Judge Calibration Report — Can You Trust These Scores?

The Problem

When an AI system scores documents — rating writing quality, evaluating research claims, assessing investment memos — how do you know the scores mean anything? An LLM judge might give everything an 80. Or it might rank garbage above gold. Without validation, every downstream decision built on those scores is a guess.

What We're Testing

We test 3 LLM judge models (opus, sonnet, haiku) on their ability to score text quality using a rubric with four criteria: clarity, evidence, structure, and actionability. Each model reads a document and returns a 0-100 score with sub-scores. The question: are those scores trustworthy?

The Corpus: Known Quality Tiers

We hand-crafted 5 text samples at three quality levels:

Tier	What It Looks Like	Example
Good	Specific data, concrete recommendations, cites sources	"The 95th percentile latency drops from 12.3s to 4.1s... use the existing `CircuitBreaker` class"
Mediocre	Plausible but vague — reads well, says nothing	"Revenue has been growing and the market appears favorable... the outlook is positive"
Poor	Superlatives, no evidence, unfalsifiable claims	"100x faster than all competitors... will revolutionize the entire industry"

A good judge should score Good > Mediocre > Poor — consistently, across models. And when we deliberately degrade text (remove evidence, inject errors, add noise), scores should drop. That's the test.

How We Test Judges

The core principle: A trustworthy judge must reliably score degraded text lower than the original. If removing all the evidence from a document doesn't change its score, the judge isn't reading the evidence.

Three Axes of Judge Quality

Axis	Question	Metric	Pass Criterion
Monotonicity	Does degrading input lower scores?	Cohen's d effect size	d > 0.5 and mean_drop > 0
Discrimination	Does the judge spread scores across the range?	Cluster percentage (densest 20-point band)	cluster_pct < 60%
Cross-Model	Do different models agree on ranking?	Kendall tau rank correlation	τ > 0.6 (moderate+ agreement)

Seven Degradation Types

Perturbation	Targets	What It Does
`remove_evidence`	Evidence	Strip ~50% of numeric lines, replace backtick code with [removed]
`add_fluff`	Clarity	Insert irrelevant filler sentences (~30% per line)
`vague_ify`	Evidence	Replace numbers → "several", percentages → "some percentage"
`inject_errors`	Accuracy	Randomly scale numbers by 0.1x, 0.5x, 2x, or 10x
`scramble_order`	Structure	Shuffle paragraphs or lines randomly
`duplicate_content`	Clarity	Repeat ~25% of non-empty lines
`strip_actionability`	Actionability	Remove imperative sentences (Use/Run/Always/Never/...)

Score Matrix — How Each Model Rates the Corpus

What to look for: Good judges should give high scores to "good" items, low scores to "poor" items, and mediocre items should land in between. The gap between tiers shows discrimination power.

Corpus Item	Quality	opus	sonnet	haiku
`good_code_review`	good	93	96	88
`good_technical`	good	62	72	72
`mediocre_analysis`	mediocre	15	22	28
`poor_academic`	poor	8	8	18
`poor_claims`	poor	8	8	15
Mean		37.2	41.2	44.2

opus Quality ordering correct: good=78 > mediocre=15 > poor=8

sonnet Quality ordering correct: good=84 > mediocre=22 > poor=8

haiku Quality ordering correct: good=80 > mediocre=28 > poor=16

Monotonicity — Does Degradation Lower Scores?

The key test: We apply each perturbation to every corpus item and re-score. A PASS requires mean_drop > 0 and Cohen's d > 0.5 (medium+ effect size). Higher d = the judge is more sensitive to that type of degradation.

Perturbation	opus			sonnet			haiku
	Pass	Cohen d	% Correct	Pass	Cohen d	% Correct	Pass	Cohen d	% Correct
`remove_evidence`	FAIL	0.11	40%	FAIL	0.13	20%	FAIL	0.12	40%
`add_fluff`	FAIL	0.49	100%	FAIL	0.37	100%	FAIL	0.40	80%
`vague_ify`	FAIL	0.36	40%	FAIL	0.25	20%	FAIL	0.19	40%
`inject_errors`	FAIL	0.10	40%	FAIL	0.12	40%	FAIL	0.08	40%
`scramble_order`	FAIL	0.06	60%	FAIL	0.11	80%	FAIL	0.02	20%
`duplicate_content`	FAIL	0.14	100%	FAIL	0.12	80%	FAIL	0.11	60%
`strip_actionability`	FAIL	0.04	20%	FAIL	0.04	20%	FAIL	0.04	20%
Total Pass	0/7			0/7			0/7

Mean Score Drop by Perturbation

opus

remove_evidence

-4.0

add_fluff

-15.0

vague_ify

-11.4

inject_errors

-3.6

scramble_order

-2.4

duplicate_content

-5.0

strip_actionability

-1.6

sonnet

remove_evidence

-4.8

add_fluff

-12.8

vague_ify

-8.8

inject_errors

-4.4

scramble_order

-4.2

duplicate_content

-4.6

strip_actionability

-1.6

haiku

remove_evidence

-3.8

add_fluff

-11.6

vague_ify

-5.8

inject_errors

-2.6

scramble_order

-0.6

duplicate_content

-3.8

strip_actionability

-1.2

Discrimination — Do Judges Use the Full Range?

The clustering problem: A judge that scores everything 70-90 is sorting noise, not signal. We measure what percentage of scores fall in the densest 20-point band. Good judges spread scores across the range; bad judges cluster.

opus

60%

cluster % — Clustered

Range: 85 · Std: 38.5 · IQR: 54

sonnet

60%

cluster % — Clustered

Range: 88 · Std: 40.4 · IQR: 64

haiku

60%

cluster % — Clustered

Range: 73 · Std: 33.5 · IQR: 54

Score Distribution Histograms

opus

20

110

20

30

40

50

160

70

80

190

sonnet

20

10

120

30

40

50

60

170

80

190

haiku

0

210

120

30

40

50

60

170

180

90

Cross-Model Agreement — Do Models Agree on Rankings?

Why this matters: If two models rank items the same way (high tau), the rubric is unambiguous. If they disagree (low tau), the rubric has room for interpretation — different models read the criteria differently. Cross-model agreement tests rubric quality, not just judge quality.

Model Pair	Kendall τ	p-value	Interpretation
opus vs sonnet	1.000	0.0192 (significant)	Strong agreement
opus vs haiku	0.949	0.0230 (significant)	Strong agreement
sonnet vs haiku	0.949	0.0230 (significant)	Strong agreement

Per-Item Scores Across Models

Item	opus	sonnet	haiku	Spread
`good_code_review`	93	96	88	8
`good_technical`	62	72	72	10
`mediocre_analysis`	15	22	28	13
`poor_academic`	8	8	18	10
`poor_claims`	8	8	15	7

Perturbation Deep-Dive

Each perturbation targets a specific quality dimension. Below we show which models detect each degradation and how strongly. The effect size (Cohen's d) tells you how much the judge cares.

remove_evidence Evidence