Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?
We introduce two bottlenecks:
1. Order Calibration
Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).
Model
Model response
Generated continuation
Hover or click a bar to preview different responses.
sample 1 of 1
Token Probability (logscale)