Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
Motivation
LLMs lack diversity: They collapse to a narrow set of outputs even when many valid alternatives exist. Ask GPT-5.5 to name a random city in the world. It repeatedly outputs “Valparaíso, Chile”. (Try it!)
Increasing temperature does not solve the issue: It produces invalid or nonsensical continuations before sufficient diversity is recovered. Top-token filtering methods such as min-p either drop many valid alternatives or include invalid tokens. We therefore ask:
We identify two bottlenecks: (1) order calibration, and (2) shape calibration.
1. Order Calibration
Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).
- Valid and invalid tokens are interleaved throughout the distribution.
- No matter how the cutoff is adjusted, the model fails to recover both validity and diversity at the same time.
- The effect compounds multiplicatively over generation steps; longer sequences show more severe trade-offs.