Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade, Qingchuan Yang, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy.



What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?

We introduce two bottlenecks:

1. Order Calibration

Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).

Model
Model response
Generated continuation
Hover or click a bar to preview different responses.
sample 1 of 1
Token Probability (logscale)
Token Index (sorted by probability)
cutoff

2. Shape Calibration

Model has a sharp drop in the head, and a heavy, long tail. Increasing temperature flows the distribution mass towards the tail; but since the number of invalid tokens dominate, most of valid tokens always get suppressed.

Model
Model response
Generated continuation
Hover or click a bar to preview different responses.
sample 1 of 1
Token Probability (logscale)
Token Index (sorted by probability)
Temperature
0.5

Note: The probability values are shown in log scale. Also the x-axis in both plots is subsampled non-uniformly for enhanced visualization.