Judge Calibration Report

Generated: 2026-02-25 17:04 UTC

Question: Do repeated judge calls at temp=1.0 produce meaningfully different scores? Is n=3 choices equivalent to 3 independent calls?

Methodology

gpt-4o

5x_t1.0
σ 6.95

Range: 15 | Cost: $0.0125/sample

Latency: 12735ms | Errors: 0

3x_t1.0
σ 3.85

Range: 6.7 | Cost: $0.0072/sample

Latency: 7144ms | Errors: 0

3x_t0.5
σ 2.89

Range: 5 | Cost: $0.0070/sample

Latency: 6484ms | Errors: 0

1x_n3_t1.0
σ 7.7

Range: 13.3 | Cost: $0.0041/sample

Latency: 3202ms | Errors: 0

1x_t1.0
σ N/A

Range: N/A | Cost: $0.0024/sample

Latency: 2207ms | Errors: 0

1x_t0.0
σ N/A

Range: N/A | Cost: $0.0024/sample

Latency: 2153ms | Errors: 0

Per-sample scores

SamplePrompt (truncated)5x_t1.03x_t1.03x_t0.51x_n3_t1.01x_t1.01x_t0.0
20Find all Python files that import sqlite3 and list what tabl75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)
76Refactor trading/earnings/backtest/app.py to move all import100 (σ=11.2)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
94Fix any datetime.utcnow() deprecation warnings in learning/s25 (σ=20.9)25 (σ=14.4)0 (σ=0.0)25 (σ=14.4)25 (σ=0.0)0 (σ=0.0)
101Audit logging setup across all projects. Which use loguru? W100 (σ=11.2)100 (σ=14.4)100 (σ=14.4)100 (σ=0.0)75 (σ=0.0)100 (σ=0.0)
105Map the import dependency graph between top-level directorie75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=14.4)75 (σ=0.0)75 (σ=0.0)
106Map the import dependency graph between top-level directorie75 (σ=13.7)75 (σ=0.0)75 (σ=0.0)75 (σ=14.4)75 (σ=0.0)75 (σ=0.0)
111Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=14.4)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
117What configuration patterns are used? YAML, JSON, .env, clic100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
135Find all HTTP/WebSocket server endpoints defined in the code75 (σ=13.7)75 (σ=0.0)100 (σ=0.0)100 (σ=14.4)75 (σ=0.0)75 (σ=0.0)
141Fix the Gradio layout so the sidebar doesn't collapse when t75 (σ=0.0)75 (σ=0.0)75 (σ=14.4)75 (σ=14.4)100 (σ=0.0)75 (σ=0.0)
143Fix the Gradio layout so the sidebar doesn't collapse when t100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=14.4)100 (σ=0.0)100 (σ=0.0)
164Find magic numbers and hardcoded strings in trading/earnings75 (σ=11.2)75 (σ=0.0)50 (σ=14.4)75 (σ=0.0)50 (σ=0.0)75 (σ=0.0)
179What Gradio apps exist in this repo? List each with its port100 (σ=11.2)100 (σ=0.0)100 (σ=0.0)100 (σ=14.4)100 (σ=0.0)100 (σ=0.0)
180What Gradio apps exist in this repo? List each with its port100 (σ=11.2)75 (σ=14.4)100 (σ=0.0)100 (σ=14.4)75 (σ=0.0)100 (σ=0.0)
191What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)

3× independent repeats vs 1× call with n=3 choices

Mean median difference: 3.3 points

Max difference: 25 points

Within 5 pts: 13/15 (86%)

Mean σ (repeats): 3.84

Mean σ (choices): 7.68

3x @ temp=1.0 vs 3x @ temp=0.5

Mean median difference: 6.7 points

Median difference: 0 points

Max difference: 25 points

Within 5 pts: 11/15 (73%)

Within 10 pts: 11/15 (73%)

grok-3-mini

5x_t1.0
σ 4.22

Range: 8.7 | Cost: $0.0022/sample

Latency: 42503ms | Errors: 0

3x_t1.0
σ 2.69

Range: 4.7 | Cost: $0.0012/sample

Latency: 25240ms | Errors: 0

3x_t0.5
σ 4.04

Range: 7 | Cost: $0.0012/sample

Latency: 24979ms | Errors: 0

1x_n3_t1.0
σ 2.26

Range: 4 | Cost: $0.0012/sample

Latency: 9853ms | Errors: 0

1x_t1.0
σ N/A

Range: N/A | Cost: $0.0004/sample

Latency: 8677ms | Errors: 0

1x_t0.0
σ N/A

Range: N/A | Cost: $0.0004/sample

Latency: 8525ms | Errors: 0

Per-sample scores

SamplePrompt (truncated)5x_t1.03x_t1.03x_t0.51x_n3_t1.01x_t1.01x_t0.0
20Find all Python files that import sqlite3 and list what tabl75 (σ=11.2)75 (σ=0.0)75 (σ=0.0)75 (σ=14.4)75 (σ=0.0)95 (σ=0.0)
76Refactor trading/earnings/backtest/app.py to move all import100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
94Fix any datetime.utcnow() deprecation warnings in learning/s100 (σ=0.0)100 (σ=14.4)100 (σ=14.4)100 (σ=0.0)100 (σ=0.0)75 (σ=0.0)
101Audit logging setup across all projects. Which use loguru? W100 (σ=10.8)75 (σ=8.7)100 (σ=14.4)95 (σ=5.0)95 (σ=0.0)95 (σ=0.0)
105Map the import dependency graph between top-level directorie100 (σ=2.7)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
106Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=2.9)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
111Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
117What configuration patterns are used? YAML, JSON, .env, clic75 (σ=13.7)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
135Find all HTTP/WebSocket server endpoints defined in the code100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
141Fix the Gradio layout so the sidebar doesn't collapse when t100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
143Fix the Gradio layout so the sidebar doesn't collapse when t100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
164Find magic numbers and hardcoded strings in trading/earnings75 (σ=11.2)75 (σ=0.0)100 (σ=28.9)100 (σ=14.4)75 (σ=0.0)75 (σ=0.0)
179What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
180What Gradio apps exist in this repo? List each with its port100 (σ=13.7)100 (σ=14.4)100 (σ=2.9)100 (σ=0.0)50 (σ=0.0)100 (σ=0.0)
191What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)

3× independent repeats vs 1× call with n=3 choices

Mean median difference: 3 points

Max difference: 25 points

Within 5 pts: 13/15 (86%)

Mean σ (repeats): 2.69

Mean σ (choices): 2.25

3x @ temp=1.0 vs 3x @ temp=0.5

Mean median difference: 3.3 points

Median difference: 0 points

Max difference: 25 points

Within 5 pts: 13/15 (86%)

Within 10 pts: 13/15 (86%)

claude-opus-4-6

5x_t1.0
σ 0.86

Range: 1.7 | Cost: $0.0166/sample

Latency: 11321ms | Errors: 9

3x_t1.0
σ 1.73

Range: 3 | Cost: $0.0093/sample

Latency: 5895ms | Errors: 9

3x_t0.5
σ 0.0

Range: 0 | Cost: $0.0172/sample

Latency: 12046ms | Errors: 4

1x_n3_t1.0
σ N/A

Range: N/A | Cost: $0.0000/sample

Latency: 0ms | Errors: 0

1x_t1.0
σ N/A

Range: N/A | Cost: $0.0069/sample

Latency: 4627ms | Errors: 2

1x_t0.0
σ N/A

Range: N/A | Cost: $0.0056/sample

Latency: 4036ms | Errors: 4

Per-sample scores

SamplePrompt (truncated)5x_t1.03x_t1.03x_t0.51x_n3_t1.01x_t1.01x_t0.0
20Find all Python files that import sqlite3 and list what tabl30 (σ=0.0)30 (σ=0.0)30 (σ=0.0)
76Refactor trading/earnings/backtest/app.py to move all import30 (σ=0.0)30 (σ=2.9)30 (σ=0.0)
94Fix any datetime.utcnow() deprecation warnings in learning/s5 (σ=0.0)5 (σ=0.0)5 (σ=0.0)
101Audit logging setup across all projects. Which use loguru? W52 (σ=5.8)52 (σ=0.0)
105Map the import dependency graph between top-level directorie62 (σ=0.0)62 (σ=0.0)62 (σ=0.0)
106Map the import dependency graph between top-level directorie42 (σ=0.0)42 (σ=0.0)42 (σ=0.0)
111Map the import dependency graph between top-level directorie52 (σ=1.3)52 (σ=1.7)52 (σ=0.0)52 (σ=0.0)52 (σ=0.0)
117What configuration patterns are used? YAML, JSON, .env, clic62 (σ=0.0)62 (σ=0.0)
135Find all HTTP/WebSocket server endpoints defined in the code72 (σ=0.0)72 (σ=0.0)
141Fix the Gradio layout so the sidebar doesn't collapse when t35 (σ=0.0)35 (σ=0.0)
143Fix the Gradio layout so the sidebar doesn't collapse when t25 (σ=0.0)25 (σ=0.0)25 (σ=0.0)25 (σ=0.0)
164Find magic numbers and hardcoded strings in trading/earnings15 (σ=0.0)15 (σ=0.0)15 (σ=0.0)15 (σ=0.0)
179What Gradio apps exist in this repo? List each with its port52 (σ=3.8)52 (σ=0.0)52 (σ=0.0)
180What Gradio apps exist in this repo? List each with its port62 (σ=0.0)62 (σ=0.0)62 (σ=0.0)62 (σ=0.0)
191What Gradio apps exist in this repo? List each with its port72 (σ=0.0)72 (σ=0.0)72 (σ=0.0)72 (σ=0.0)

3x @ temp=1.0 vs 3x @ temp=0.5

Mean median difference: 0 points

Median difference: 0.0 points

Max difference: 0 points

Within 5 pts: 4/4 (100%)

Within 10 pts: 4/4 (100%)

gemini-3.1-pro-preview

5x_t1.0
σ 1.48

Range: 4 | Cost: $0.0217/sample

Latency: 43316ms | Errors: 0

3x_t1.0
σ 2.24

Range: 4.3 | Cost: $0.0134/sample

Latency: 38378ms | Errors: 0

3x_t0.5
σ 0.96

Range: 1.7 | Cost: $0.0127/sample

Latency: 22472ms | Errors: 0

1x_n3_t1.0
σ N/A

Range: N/A | Cost: $0.0000/sample

Latency: 0ms | Errors: 0

1x_t1.0
σ N/A

Range: N/A | Cost: $0.0043/sample

Latency: 16288ms | Errors: 0

1x_t0.0
σ N/A

Range: N/A | Cost: $0.0041/sample

Latency: 10754ms | Errors: 0

Per-sample scores

SamplePrompt (truncated)5x_t1.03x_t1.03x_t0.51x_n3_t1.01x_t1.01x_t0.0
20Find all Python files that import sqlite3 and list what tabl100 (σ=4.5)90 (σ=8.7)100 (σ=14.4)100 (σ=0.0)100 (σ=0.0)
76Refactor trading/earnings/backtest/app.py to move all import100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
94Fix any datetime.utcnow() deprecation warnings in learning/s75 (σ=17.7)75 (σ=25.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)
101Audit logging setup across all projects. Which use loguru? W75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)75 (σ=0.0)
105Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
106Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
111Map the import dependency graph between top-level directorie100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
117What configuration patterns are used? YAML, JSON, .env, clic100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
135Find all HTTP/WebSocket server endpoints defined in the code100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
141Fix the Gradio layout so the sidebar doesn't collapse when t100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
143Fix the Gradio layout so the sidebar doesn't collapse when t100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
164Find magic numbers and hardcoded strings in trading/earnings100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
179What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
180What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)
191What Gradio apps exist in this repo? List each with its port100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)100 (σ=0.0)

3x @ temp=1.0 vs 3x @ temp=0.5

Mean median difference: 0.7 points

Median difference: 0 points

Max difference: 10 points

Within 5 pts: 14/15 (93%)

Within 10 pts: 15/15 (100%)

Conclusions