Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade, Qingchuan Yang, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy.

What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?

We introduce two bottlenecks:

1. Order Calibration

Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).

Model

Model response

Generated continuation

↧ Hover or click a bar to preview different responses.

sample 1 of 1

Token Probability (logscale)

Token Index (sorted by probability)

cutoff

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

1. Order Calibration

2. Shape Calibration