Generated: 2026-02-25 17:04 UTC
Question: Do repeated judge calls at temp=1.0 produce meaningfully different scores?
Is n=3 choices equivalent to 3 independent calls?
5x_t1.0 (gold standard), 3x_t1.0,
3x_t0.5 (lower temp), 1x_n3_t1.0 (batched choices, OpenAI/xAI only),
1x_t1.0 (single shot), 1x_t0.0 (deterministic)Range: 15 | Cost: $0.0125/sample
Latency: 12735ms | Errors: 0
Range: 6.7 | Cost: $0.0072/sample
Latency: 7144ms | Errors: 0
Range: 5 | Cost: $0.0070/sample
Latency: 6484ms | Errors: 0
Range: 13.3 | Cost: $0.0041/sample
Latency: 3202ms | Errors: 0
Range: N/A | Cost: $0.0024/sample
Latency: 2207ms | Errors: 0
Range: N/A | Cost: $0.0024/sample
Latency: 2153ms | Errors: 0
| Sample | Prompt (truncated) | 5x_t1.0 | 3x_t1.0 | 3x_t0.5 | 1x_n3_t1.0 | 1x_t1.0 | 1x_t0.0 |
|---|---|---|---|---|---|---|---|
| 20 | Find all Python files that import sqlite3 and list what tabl | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) |
| 76 | Refactor trading/earnings/backtest/app.py to move all import | 100 (σ=11.2) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 94 | Fix any datetime.utcnow() deprecation warnings in learning/s | 25 (σ=20.9) | 25 (σ=14.4) | 0 (σ=0.0) | 25 (σ=14.4) | 25 (σ=0.0) | 0 (σ=0.0) |
| 101 | Audit logging setup across all projects. Which use loguru? W | 100 (σ=11.2) | 100 (σ=14.4) | 100 (σ=14.4) | 100 (σ=0.0) | 75 (σ=0.0) | 100 (σ=0.0) |
| 105 | Map the import dependency graph between top-level directorie | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=14.4) | 75 (σ=0.0) | 75 (σ=0.0) |
| 106 | Map the import dependency graph between top-level directorie | 75 (σ=13.7) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=14.4) | 75 (σ=0.0) | 75 (σ=0.0) |
| 111 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=14.4) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 117 | What configuration patterns are used? YAML, JSON, .env, clic | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 135 | Find all HTTP/WebSocket server endpoints defined in the code | 75 (σ=13.7) | 75 (σ=0.0) | 100 (σ=0.0) | 100 (σ=14.4) | 75 (σ=0.0) | 75 (σ=0.0) |
| 141 | Fix the Gradio layout so the sidebar doesn't collapse when t | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=14.4) | 75 (σ=14.4) | 100 (σ=0.0) | 75 (σ=0.0) |
| 143 | Fix the Gradio layout so the sidebar doesn't collapse when t | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=14.4) | 100 (σ=0.0) | 100 (σ=0.0) |
| 164 | Find magic numbers and hardcoded strings in trading/earnings | 75 (σ=11.2) | 75 (σ=0.0) | 50 (σ=14.4) | 75 (σ=0.0) | 50 (σ=0.0) | 75 (σ=0.0) |
| 179 | What Gradio apps exist in this repo? List each with its port | 100 (σ=11.2) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=14.4) | 100 (σ=0.0) | 100 (σ=0.0) |
| 180 | What Gradio apps exist in this repo? List each with its port | 100 (σ=11.2) | 75 (σ=14.4) | 100 (σ=0.0) | 100 (σ=14.4) | 75 (σ=0.0) | 100 (σ=0.0) |
| 191 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
Mean median difference: 3.3 points
Max difference: 25 points
Within 5 pts: 13/15 (86%)
Mean σ (repeats): 3.84
Mean σ (choices): 7.68
Mean median difference: 6.7 points
Median difference: 0 points
Max difference: 25 points
Within 5 pts: 11/15 (73%)
Within 10 pts: 11/15 (73%)
Range: 8.7 | Cost: $0.0022/sample
Latency: 42503ms | Errors: 0
Range: 4.7 | Cost: $0.0012/sample
Latency: 25240ms | Errors: 0
Range: 7 | Cost: $0.0012/sample
Latency: 24979ms | Errors: 0
Range: 4 | Cost: $0.0012/sample
Latency: 9853ms | Errors: 0
Range: N/A | Cost: $0.0004/sample
Latency: 8677ms | Errors: 0
Range: N/A | Cost: $0.0004/sample
Latency: 8525ms | Errors: 0
| Sample | Prompt (truncated) | 5x_t1.0 | 3x_t1.0 | 3x_t0.5 | 1x_n3_t1.0 | 1x_t1.0 | 1x_t0.0 |
|---|---|---|---|---|---|---|---|
| 20 | Find all Python files that import sqlite3 and list what tabl | 75 (σ=11.2) | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=14.4) | 75 (σ=0.0) | 95 (σ=0.0) |
| 76 | Refactor trading/earnings/backtest/app.py to move all import | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 94 | Fix any datetime.utcnow() deprecation warnings in learning/s | 100 (σ=0.0) | 100 (σ=14.4) | 100 (σ=14.4) | 100 (σ=0.0) | 100 (σ=0.0) | 75 (σ=0.0) |
| 101 | Audit logging setup across all projects. Which use loguru? W | 100 (σ=10.8) | 75 (σ=8.7) | 100 (σ=14.4) | 95 (σ=5.0) | 95 (σ=0.0) | 95 (σ=0.0) |
| 105 | Map the import dependency graph between top-level directorie | 100 (σ=2.7) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 106 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=2.9) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 111 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 117 | What configuration patterns are used? YAML, JSON, .env, clic | 75 (σ=13.7) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 135 | Find all HTTP/WebSocket server endpoints defined in the code | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 141 | Fix the Gradio layout so the sidebar doesn't collapse when t | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 143 | Fix the Gradio layout so the sidebar doesn't collapse when t | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 164 | Find magic numbers and hardcoded strings in trading/earnings | 75 (σ=11.2) | 75 (σ=0.0) | 100 (σ=28.9) | 100 (σ=14.4) | 75 (σ=0.0) | 75 (σ=0.0) |
| 179 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
| 180 | What Gradio apps exist in this repo? List each with its port | 100 (σ=13.7) | 100 (σ=14.4) | 100 (σ=2.9) | 100 (σ=0.0) | 50 (σ=0.0) | 100 (σ=0.0) |
| 191 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) |
Mean median difference: 3 points
Max difference: 25 points
Within 5 pts: 13/15 (86%)
Mean σ (repeats): 2.69
Mean σ (choices): 2.25
Mean median difference: 3.3 points
Median difference: 0 points
Max difference: 25 points
Within 5 pts: 13/15 (86%)
Within 10 pts: 13/15 (86%)
Range: 1.7 | Cost: $0.0166/sample
Latency: 11321ms | Errors: 9
Range: 3 | Cost: $0.0093/sample
Latency: 5895ms | Errors: 9
Range: 0 | Cost: $0.0172/sample
Latency: 12046ms | Errors: 4
Range: N/A | Cost: $0.0000/sample
Latency: 0ms | Errors: 0
Range: N/A | Cost: $0.0069/sample
Latency: 4627ms | Errors: 2
Range: N/A | Cost: $0.0056/sample
Latency: 4036ms | Errors: 4
| Sample | Prompt (truncated) | 5x_t1.0 | 3x_t1.0 | 3x_t0.5 | 1x_n3_t1.0 | 1x_t1.0 | 1x_t0.0 |
|---|---|---|---|---|---|---|---|
| 20 | Find all Python files that import sqlite3 and list what tabl | — | — | 30 (σ=0.0) | — | 30 (σ=0.0) | 30 (σ=0.0) |
| 76 | Refactor trading/earnings/backtest/app.py to move all import | 30 (σ=0.0) | 30 (σ=2.9) | — | — | 30 (σ=0.0) | — |
| 94 | Fix any datetime.utcnow() deprecation warnings in learning/s | — | — | 5 (σ=0.0) | — | 5 (σ=0.0) | 5 (σ=0.0) |
| 101 | Audit logging setup across all projects. Which use loguru? W | — | 52 (σ=5.8) | — | — | 52 (σ=0.0) | — |
| 105 | Map the import dependency graph between top-level directorie | 62 (σ=0.0) | — | 62 (σ=0.0) | — | 62 (σ=0.0) | — |
| 106 | Map the import dependency graph between top-level directorie | — | — | 42 (σ=0.0) | — | 42 (σ=0.0) | 42 (σ=0.0) |
| 111 | Map the import dependency graph between top-level directorie | 52 (σ=1.3) | 52 (σ=1.7) | 52 (σ=0.0) | — | 52 (σ=0.0) | 52 (σ=0.0) |
| 117 | What configuration patterns are used? YAML, JSON, .env, clic | — | — | — | — | 62 (σ=0.0) | 62 (σ=0.0) |
| 135 | Find all HTTP/WebSocket server endpoints defined in the code | — | — | 72 (σ=0.0) | — | — | 72 (σ=0.0) |
| 141 | Fix the Gradio layout so the sidebar doesn't collapse when t | — | — | 35 (σ=0.0) | — | — | 35 (σ=0.0) |
| 143 | Fix the Gradio layout so the sidebar doesn't collapse when t | — | 25 (σ=0.0) | 25 (σ=0.0) | — | 25 (σ=0.0) | 25 (σ=0.0) |
| 164 | Find magic numbers and hardcoded strings in trading/earnings | 15 (σ=0.0) | 15 (σ=0.0) | 15 (σ=0.0) | — | 15 (σ=0.0) | — |
| 179 | What Gradio apps exist in this repo? List each with its port | 52 (σ=3.8) | — | — | — | 52 (σ=0.0) | 52 (σ=0.0) |
| 180 | What Gradio apps exist in this repo? List each with its port | 62 (σ=0.0) | — | 62 (σ=0.0) | — | 62 (σ=0.0) | 62 (σ=0.0) |
| 191 | What Gradio apps exist in this repo? List each with its port | — | 72 (σ=0.0) | 72 (σ=0.0) | — | 72 (σ=0.0) | 72 (σ=0.0) |
Mean median difference: 0 points
Median difference: 0.0 points
Max difference: 0 points
Within 5 pts: 4/4 (100%)
Within 10 pts: 4/4 (100%)
Range: 4 | Cost: $0.0217/sample
Latency: 43316ms | Errors: 0
Range: 4.3 | Cost: $0.0134/sample
Latency: 38378ms | Errors: 0
Range: 1.7 | Cost: $0.0127/sample
Latency: 22472ms | Errors: 0
Range: N/A | Cost: $0.0000/sample
Latency: 0ms | Errors: 0
Range: N/A | Cost: $0.0043/sample
Latency: 16288ms | Errors: 0
Range: N/A | Cost: $0.0041/sample
Latency: 10754ms | Errors: 0
| Sample | Prompt (truncated) | 5x_t1.0 | 3x_t1.0 | 3x_t0.5 | 1x_n3_t1.0 | 1x_t1.0 | 1x_t0.0 |
|---|---|---|---|---|---|---|---|
| 20 | Find all Python files that import sqlite3 and list what tabl | 100 (σ=4.5) | 90 (σ=8.7) | 100 (σ=14.4) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 76 | Refactor trading/earnings/backtest/app.py to move all import | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 94 | Fix any datetime.utcnow() deprecation warnings in learning/s | 75 (σ=17.7) | 75 (σ=25.0) | 75 (σ=0.0) | — | 75 (σ=0.0) | 75 (σ=0.0) |
| 101 | Audit logging setup across all projects. Which use loguru? W | 75 (σ=0.0) | 75 (σ=0.0) | 75 (σ=0.0) | — | 75 (σ=0.0) | 75 (σ=0.0) |
| 105 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 106 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 111 | Map the import dependency graph between top-level directorie | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 117 | What configuration patterns are used? YAML, JSON, .env, clic | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 135 | Find all HTTP/WebSocket server endpoints defined in the code | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 141 | Fix the Gradio layout so the sidebar doesn't collapse when t | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 143 | Fix the Gradio layout so the sidebar doesn't collapse when t | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 164 | Find magic numbers and hardcoded strings in trading/earnings | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 179 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 180 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
| 191 | What Gradio apps exist in this repo? List each with its port | 100 (σ=0.0) | 100 (σ=0.0) | 100 (σ=0.0) | — | 100 (σ=0.0) | 100 (σ=0.0) |
Mean median difference: 0.7 points
Median difference: 0 points
Max difference: 10 points
Within 5 pts: 14/15 (93%)
Within 10 pts: 15/15 (100%)