Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade*, Qingchuan Yang*, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy.

University of Southern California · Capital One · *Equal contribution

📄 Paper (arXiv) ✉ Contact

Motivation

LLMs lack diversity: They collapse to a narrow set of outputs even when many valid alternatives exist. Ask GPT-5.5 to name a random city in the world. It repeatedly outputs “Valparaíso, Chile”. (Try it!)

Increasing temperature does not solve the issue: It produces invalid or nonsensical continuations before sufficient diversity is recovered. Top-token filtering methods such as min-p either drop many valid alternatives or include invalid tokens. We therefore ask:

What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?

We identify two bottlenecks: (1) order calibration, and (2) shape calibration.

1. Order Calibration

Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).

Model

Model response

Generated continuation

↧ Hover over a bar to preview different responses.

sample 1 of 1

Token Probability (log-scale)

Token Index (sorted by probability)

cutoff

Takeaways

Valid and invalid tokens are interleaved throughout the distribution.
No matter how the cutoff is adjusted, the model fails to recover both validity and diversity at the same time.
The effect compounds multiplicatively over generation steps; longer sequences show more severe trade-offs.

The model's distribution has a sharp drop in the head, and a heavy, long tail. At low temperature, only a few tokens are dominant. Increasing temperature shifts the distribution mass towards the tail, but since invalid tokens vastly outnumber valid ones, they grow much faster with temperature and other valid tokens remain suppressed.

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Motivation

1. Order Calibration

2. Shape Calibration