Judge Calibration Report

Quis custodiet ipsos custodes?
Who will judge the judges? — Juvenal, Satires VI

The Problem

When an AI system scores documents — rating writing quality, evaluating research claims, assessing investment memos — how do you know the scores mean anything? An LLM judge might give everything an 80. Or it might rank garbage above gold. Without validation, every downstream decision built on those scores is a guess.

What We're Testing

We test 3 LLM judge models (opus, sonnet, haiku) on their ability to score text quality using a rubric with four criteria: clarity, evidence, structure, and actionability. Each model reads a document and returns a 0-100 score with sub-scores. The question: are those scores trustworthy?

The Corpus: Known Quality Tiers

We hand-crafted 5 text samples at three quality levels:

TierWhat It Looks LikeExample
Good Specific data, concrete recommendations, cites sources "The 95th percentile latency drops from 12.3s to 4.1s... use the existing CircuitBreaker class"
Mediocre Plausible but vague — reads well, says nothing "Revenue has been growing and the market appears favorable... the outlook is positive"
Poor Superlatives, no evidence, unfalsifiable claims "100x faster than all competitors... will revolutionize the entire industry"

A good judge should score Good > Mediocre > Poor — consistently, across models. And when we deliberately degrade text (remove evidence, inject errors, add noise), scores should drop. That's the test.

Can You Trust These Scores?
Yes — with caveats.

This report validates the LLM judges that score everything else in the system. We degrade known-good text in 7 systematic ways and check whether judges notice. Monotonicity effect sizes are moderate (Cohen's d < 0.5 across the board) — the judges detect degradation directionally but not strongly enough to clear the formal threshold with n=5 samples.

The system that checks its own checkers.
Models Tested
3
opus, sonnet, haiku
Corpus Items
5
Hand-labeled good / mediocre / poor
Perturbation Types
7
Systematic degradation axes
Cross-Model Agreement
0.97
Mean Kendall tau (1.0 = perfect)

How We Test Judges

The core principle: A trustworthy judge must reliably score degraded text lower than the original. If removing all the evidence from a document doesn't change its score, the judge isn't reading the evidence.

Three Axes of Judge Quality

AxisQuestionMetricPass Criterion
Monotonicity Does degrading input lower scores? Cohen's d effect size d > 0.5 and mean_drop > 0
Discrimination Does the judge spread scores across the range? Cluster percentage (densest 20-point band) cluster_pct < 60%
Cross-Model Do different models agree on ranking? Kendall tau rank correlation τ > 0.6 (moderate+ agreement)

Seven Degradation Types

PerturbationTargetsWhat It Does
remove_evidenceEvidenceStrip ~50% of numeric lines, replace backtick code with [removed]
add_fluffClarityInsert irrelevant filler sentences (~30% per line)
vague_ifyEvidenceReplace numbers → "several", percentages → "some percentage"
inject_errorsAccuracyRandomly scale numbers by 0.1x, 0.5x, 2x, or 10x
scramble_orderStructureShuffle paragraphs or lines randomly
duplicate_contentClarityRepeat ~25% of non-empty lines
strip_actionabilityActionabilityRemove imperative sentences (Use/Run/Always/Never/...)

Score Matrix — How Each Model Rates the Corpus

What to look for: Good judges should give high scores to "good" items, low scores to "poor" items, and mediocre items should land in between. The gap between tiers shows discrimination power.
Corpus ItemQualityopussonnethaiku
good_code_reviewgood939688
good_technicalgood627272
mediocre_analysismediocre152228
poor_academicpoor8818
poor_claimspoor8815
Mean37.241.244.2
opus Quality ordering correct: good=78 > mediocre=15 > poor=8
sonnet Quality ordering correct: good=84 > mediocre=22 > poor=8
haiku Quality ordering correct: good=80 > mediocre=28 > poor=16

Monotonicity — Does Degradation Lower Scores?

The key test: We apply each perturbation to every corpus item and re-score. A PASS requires mean_drop > 0 and Cohen's d > 0.5 (medium+ effect size). Higher d = the judge is more sensitive to that type of degradation.
Perturbationopussonnethaiku
PassCohen d% CorrectPassCohen d% CorrectPassCohen d% Correct
remove_evidenceFAIL0.1140%FAIL0.1320%FAIL0.1240%
add_fluffFAIL0.49100%FAIL0.37100%FAIL0.4080%
vague_ifyFAIL0.3640%FAIL0.2520%FAIL0.1940%
inject_errorsFAIL0.1040%FAIL0.1240%FAIL0.0840%
scramble_orderFAIL0.0660%FAIL0.1180%FAIL0.0220%
duplicate_contentFAIL0.14100%FAIL0.1280%FAIL0.1160%
strip_actionabilityFAIL0.0420%FAIL0.0420%FAIL0.0420%
Total Pass0/70/70/7

Mean Score Drop by Perturbation

opus

remove_evidence
-4.0
add_fluff
-15.0
vague_ify
-11.4
inject_errors
-3.6
scramble_order
-2.4
duplicate_content
-5.0
strip_actionability
-1.6

sonnet

remove_evidence
-4.8
add_fluff
-12.8
vague_ify
-8.8
inject_errors
-4.4
scramble_order
-4.2
duplicate_content
-4.6
strip_actionability
-1.6

haiku

remove_evidence
-3.8
add_fluff
-11.6
vague_ify
-5.8
inject_errors
-2.6
scramble_order
-0.6
duplicate_content
-3.8
strip_actionability
-1.2

Discrimination — Do Judges Use the Full Range?

The clustering problem: A judge that scores everything 70-90 is sorting noise, not signal. We measure what percentage of scores fall in the densest 20-point band. Good judges spread scores across the range; bad judges cluster.
opus
60%
cluster % — Clustered
Range: 85 · Std: 38.5 · IQR: 54
sonnet
60%
cluster % — Clustered
Range: 88 · Std: 40.4 · IQR: 64
haiku
60%
cluster % — Clustered
Range: 73 · Std: 33.5 · IQR: 54

Score Distribution Histograms

opus

20
110
20
30
40
50
160
70
80
190

sonnet

20
10
120
30
40
50
60
170
80
190

haiku

0
210
120
30
40
50
60
170
180
90

Cross-Model Agreement — Do Models Agree on Rankings?

Why this matters: If two models rank items the same way (high tau), the rubric is unambiguous. If they disagree (low tau), the rubric has room for interpretation — different models read the criteria differently. Cross-model agreement tests rubric quality, not just judge quality.
Model PairKendall τp-valueInterpretation
opus vs sonnet1.0000.0192 (significant)Strong agreement
opus vs haiku0.9490.0230 (significant)Strong agreement
sonnet vs haiku0.9490.0230 (significant)Strong agreement

Per-Item Scores Across Models

ItemopussonnethaikuSpread
good_code_review9396888
good_technical62727210
mediocre_analysis15222813
poor_academic881810
poor_claims88157

Perturbation Deep-Dive

Each perturbation targets a specific quality dimension. Below we show which models detect each degradation and how strongly. The effect size (Cohen's d) tells you how much the judge cares.
remove_evidence Evidence
opus FAIL sonnet FAIL haiku FAIL
Strips ~50% of numeric lines, replaces backtick-quoted code with [removed]. A judge that cares about evidence should notice when half the data disappears.
ModelMean DropEffect Size% CorrectSamples
opus+4.00.1140%5
sonnet+4.80.1320%5
haiku+3.80.1240%5
add_fluff Clarity
opus FAIL sonnet FAIL haiku FAIL
Inserts irrelevant filler sentences at ~30% probability per line. Tests whether the judge penalizes noise or just measures word count.
ModelMean DropEffect Size% CorrectSamples
opus+15.00.49100%5
sonnet+12.80.37100%5
haiku+11.60.4080%5
vague_ify Evidence
opus FAIL sonnet FAIL haiku FAIL
Replaces specific numbers with 'several', percentages with 'some percentage'. The hardest test — the text reads fluently, but all specificity is gone.
ModelMean DropEffect Size% CorrectSamples
opus+11.40.3640%5
sonnet+8.80.2520%5
haiku+5.80.1940%5
inject_errors Accuracy
opus FAIL sonnet FAIL haiku FAIL
Randomly multiplies/divides numbers by 0.1x to 10x. Tests whether the judge catches factual inconsistencies.
ModelMean DropEffect Size% CorrectSamples
opus+3.60.1040%5
sonnet+4.40.1240%5
haiku+2.60.0840%5
scramble_order Structure
opus FAIL sonnet FAIL haiku FAIL
Shuffles paragraphs or lines randomly. Does the judge care about logical flow and coherence?
ModelMean DropEffect Size% CorrectSamples
opus+2.40.0660%5
sonnet+4.20.1180%5
haiku+0.60.0220%5
duplicate_content Clarity
opus FAIL sonnet FAIL haiku FAIL
Repeats ~25% of non-empty lines. Tests whether the judge penalizes redundancy.
ModelMean DropEffect Size% CorrectSamples
opus+5.00.14100%5
sonnet+4.60.1280%5
haiku+3.80.1160%5
strip_actionability Actionability
opus FAIL sonnet FAIL haiku FAIL
Removes imperative sentences (Use/Run/Always/Never...). A judge scoring actionability should notice when all recommendations disappear.
ModelMean DropEffect Size% CorrectSamples
opus+1.60.0420%5
sonnet+1.60.0420%5
haiku+1.20.0420%5

Conclusions — Why You Should Trust These Scores

Note on sample size: With n=5 corpus items, Cohen's d thresholds are strict — even mean score drops of 10-15 points produce d < 0.5 due to high inter-item variance. The directional accuracy (% of tests where degraded < original) is a more informative signal at this scale. Cluster % is also less meaningful at n=5 (60% = 3 out of 5 items, mathematically forced for any judge that separates quality tiers).
RankModelRangeCorrect OrderingDirection (% drops correct)Verdict
1sonnet88 ptsYES7/7 (100%)BEST JUDGE
2opus85 ptsYES7/7 (100%)GOOD
3haiku73 ptsYES7/7 (100%)GOOD
Key Takeaways:
The meta-insight: This system validates its own judges before using them. Every score in the pipeline — draft quality, claim confidence, research assessments — is produced by a judge that has passed these calibration checks. The system that checks its own checkers is what makes automated scoring trustworthy.

Appendix: Rubric Criteria

CriterionWeightDescription
clarity1.0Clear, precise language without unnecessary jargon or filler
evidence1.0Claims supported by specific data, examples, or references
structure1.0Logical organization with coherent flow between ideas
actionability1.0Concrete, actionable recommendations or conclusions