Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade*, Qingchuan Yang*, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy.
University of Southern California  ·  Capital One  ·  *Equal contribution

Motivation

LLMs lack diversity: They collapse to a narrow set of outputs even when many valid alternatives exist. Ask GPT-5.5 to name a random city in the world. It repeatedly outputs “Valparaíso, Chile”. (Try it!)

Increasing temperature does not solve the issue: It produces invalid or nonsensical continuations before sufficient diversity is recovered. Top-token filtering methods such as min-p either drop many valid alternatives or include invalid tokens. We therefore ask:


What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?

We identify two bottlenecks: (1) order calibration, and (2) shape calibration.


1. Order Calibration

Valid tokens are not reliably ranked above invalid tokens. Any cutoff strategy inevitably drops many valid tokens (low recall) or includes many invalid ones (low precision).


Model
Model response
Generated continuation
Hover over a bar to preview different responses.
sample 1 of 1
Token Probability (log-scale)
Token Index (sorted by probability)
cutoff
Takeaways
  • Valid and invalid tokens are interleaved throughout the distribution.
  • No matter how the cutoff is adjusted, the model fails to recover both validity and diversity at the same time.
  • The effect compounds multiplicatively over generation steps; longer sequences show more severe trade-offs.

2. Shape Calibration

The model's distribution has a sharp drop in the head, and a heavy, long tail. At low temperature, only a few tokens are dominant. Increasing temperature shifts the distribution mass towards the tail, but since invalid tokens vastly outnumber valid ones, they grow much faster with temperature and other valid tokens remain suppressed.


Model
Model response
Generated continuation
Hover over a bar to preview different responses.
sample 1 of 1
Token Probability (log-scale)
Token Index (sorted by probability)
Temperature
0.5
Takeaways
  • Few valid tokens have large logit values, many other valid tokens have small logit values, and a large number of invalid tokens constitute the majority of the tail.
  • Increasing the temperature reduces the gap between the head and the tail, but since the tail is dominated by invalid tokens, it leads to a much higher growth in probabilities of invalid tokens compared to most valid tokens.
  • The effect compounds when generating longer sequences.

Note: The probability values are shown in log scale. Additionally, the x-axis in both plots is subsampled non-uniformly for enhanced visualization.