# Temperature-Variety Calibration Results

**Date**: 2026-03-02
**Cost**: ~$0.02 (subscriptions absorbed most)
**Data**: `raw_latest.json`, `metrics_latest.json`

## Setup

- **Models**: gemini-flash (5 runs), gpt-5.2 (5 runs), grok-fast (10 runs)
- **Temperatures**: default vs 2.0
- **Prompts**: creative, analytical, code, persuasive
- **Anthropic excluded**: Max temp is 1.0, can't test t=2.0
- **GPT-5.2 note**: Min temp is 1.0, so "default" is already t=1.0

## Metrics Used

Both metrics are mechanical — no LLM judge, no human eval.

1. **Unique n-gram ratio** — fraction of unique 3-grams across N outputs. Higher = more lexically diverse. Crude: conflates surface rephrasing with genuine diversity.
2. **Pairwise cosine distance** — embedded outputs with `text-embedding-3-small`, computed cosine distance between all pairs. Higher = outputs further apart in semantic space. Better than n-grams but still doesn't capture structural diversity (same argument in different order scores as similar).

## Results (cosine distance, higher = more diverse)

| Model        | Default | t=2.0  | Δ       | Verdict |
|--------------|--------:|-------:|--------:|---------|
| grok-fast    |  0.0818 | 0.1576 | +0.0758 | **Yes — nearly 2x diversity** |
| gemini-flash |  0.1409 | 0.1324 | -0.0085 | No meaningful change |
| gpt-5.2      |  0.1256 | 0.0927 | -0.0329 | **Worse** — more random ≠ more diverse |

### By prompt type (cosine distance)

| Prompt     | Model        | Default | t=2.0  | Δ       |
|------------|--------------|--------:|-------:|--------:|
| creative   | grok-fast    |  0.1858 | 0.3379 | +0.1521 |
| creative   | gpt          |  0.3063 | 0.1826 | -0.1237 |
| creative   | gemini-flash |  0.2141 | 0.2289 | +0.0148 |
| analytical | grok-fast    |  0.0566 | 0.0811 | +0.0245 |
| analytical | gpt          |  0.1115 | 0.1243 | +0.0128 |
| analytical | gemini-flash |  0.1146 | 0.1311 | +0.0165 |
| code       | grok-fast    |  0.0323 | 0.1215 | +0.0892 |
| code       | gpt          |  0.0441 | 0.0077 | -0.0364 |
| code       | gemini-flash |  0.1836 | 0.0534 | -0.1302 |
| persuasive | grok-fast    |  0.0524 | 0.0900 | +0.0376 |
| persuasive | gpt          |  0.0407 | 0.0563 | +0.0156 |
| persuasive | gemini-flash |  0.0512 | 0.1162 | +0.0650 |

## Conclusions

1. **Temperature cranking works reliably only for grok-fast.** Consistent improvement across all prompt types, especially creative (+0.15) and code (+0.09).
2. **GPT-5.2 gets worse at t=2.0** for creative/code tasks. More randomness produces incoherent variation, not meaningful diversity.
3. **Gemini-flash is mostly unaffected** — slight improvements in some categories, slight degradation in others.
4. **Confirms prior scoring calibration finding**: inter-model disagreement >> intra-model temperature variation. Use multiple models or prompt reframing for genuine diversity.

## Limitations

- Mechanical metrics only (n-grams + embedding distance). No judge for semantic/structural diversity.
- N=5-10 per condition — enough for directional signal, not for statistical significance.
- GPT "default" is already t=1.0 (forced minimum), so the comparison is t=1.0 vs t=2.0, not t=0 vs t=2.0.
- 150/160 calls completed (10 GPT/grok stragglers timed out).
