When an AI system scores documents — rating writing quality, evaluating research claims,
assessing investment memos — how do you know the scores mean anything?
An LLM judge might give everything an 80. Or it might rank garbage above gold.
Without validation, every downstream decision built on those scores is a guess.
What We're Testing
We test 3 LLM judge models (opus, sonnet, haiku) on their ability
to score text quality using a rubric with four criteria: clarity, evidence,
structure, and actionability. Each model reads a document and returns a 0-100
score with sub-scores. The question: are those scores trustworthy?
The Corpus: Known Quality Tiers
We hand-crafted 5 text samples at three quality levels:
Tier
What It Looks Like
Example
Good
Specific data, concrete recommendations, cites sources
"The 95th percentile latency drops from 12.3s to 4.1s... use the existing CircuitBreaker class"
Mediocre
Plausible but vague — reads well, says nothing
"Revenue has been growing and the market appears favorable... the outlook is positive"
Poor
Superlatives, no evidence, unfalsifiable claims
"100x faster than all competitors... will revolutionize the entire industry"
A good judge should score Good > Mediocre > Poor — consistently, across models.
And when we deliberately degrade text (remove evidence, inject errors, add noise),
scores should drop. That's the test.
Can You Trust These Scores?
Yes — with caveats.
All 3 models rank quality tiers correctly
Near-perfect cross-model agreement (τ = 0.97)
Degradation detected in 100% of tests (correct direction)
Score range 82 pts avg — judges use the full scale
This report validates the LLM judges that score everything else in the system.
We degrade known-good text in 7 systematic ways and check whether judges notice.
Monotonicity effect sizes are moderate (Cohen's d < 0.5 across the board) — the judges
detect degradation directionally but not strongly enough to clear the formal threshold with n=5 samples.
The system that checks its own checkers.
Models Tested
3
opus, sonnet, haiku
Corpus Items
5
Hand-labeled good / mediocre / poor
Perturbation Types
7
Systematic degradation axes
Cross-Model Agreement
0.97
Mean Kendall tau (1.0 = perfect)
How We Test Judges
The core principle: A trustworthy judge must reliably score degraded text lower than the original.
If removing all the evidence from a document doesn't change its score, the judge isn't reading the evidence.
Three Axes of Judge Quality
Axis
Question
Metric
Pass Criterion
Monotonicity
Does degrading input lower scores?
Cohen's d effect size
d > 0.5 and mean_drop > 0
Discrimination
Does the judge spread scores across the range?
Cluster percentage (densest 20-point band)
cluster_pct < 60%
Cross-Model
Do different models agree on ranking?
Kendall tau rank correlation
τ > 0.6 (moderate+ agreement)
Seven Degradation Types
Perturbation
Targets
What It Does
remove_evidence
Evidence
Strip ~50% of numeric lines, replace backtick code with [removed]
add_fluff
Clarity
Insert irrelevant filler sentences (~30% per line)
What to look for: Good judges should give high scores to "good" items, low scores to "poor" items,
and mediocre items should land in between. The gap between tiers shows discrimination power.
Corpus Item
Quality
opus
sonnet
haiku
good_code_review
good
93
96
88
good_technical
good
62
72
72
mediocre_analysis
mediocre
15
22
28
poor_academic
poor
8
8
18
poor_claims
poor
8
8
15
Mean
37.2
41.2
44.2
opus Quality ordering correct: good=78 > mediocre=15 > poor=8
The key test: We apply each perturbation to every corpus item and re-score.
A PASS requires mean_drop > 0 and Cohen's d > 0.5 (medium+ effect size).
Higher d = the judge is more sensitive to that type of degradation.
Perturbation
opus
sonnet
haiku
Pass
Cohen d
% Correct
Pass
Cohen d
% Correct
Pass
Cohen d
% Correct
remove_evidence
FAIL
0.11
40%
FAIL
0.13
20%
FAIL
0.12
40%
add_fluff
FAIL
0.49
100%
FAIL
0.37
100%
FAIL
0.40
80%
vague_ify
FAIL
0.36
40%
FAIL
0.25
20%
FAIL
0.19
40%
inject_errors
FAIL
0.10
40%
FAIL
0.12
40%
FAIL
0.08
40%
scramble_order
FAIL
0.06
60%
FAIL
0.11
80%
FAIL
0.02
20%
duplicate_content
FAIL
0.14
100%
FAIL
0.12
80%
FAIL
0.11
60%
strip_actionability
FAIL
0.04
20%
FAIL
0.04
20%
FAIL
0.04
20%
Total Pass
0/7
0/7
0/7
Mean Score Drop by Perturbation
opus
remove_evidence
-4.0
add_fluff
-15.0
vague_ify
-11.4
inject_errors
-3.6
scramble_order
-2.4
duplicate_content
-5.0
strip_actionability
-1.6
sonnet
remove_evidence
-4.8
add_fluff
-12.8
vague_ify
-8.8
inject_errors
-4.4
scramble_order
-4.2
duplicate_content
-4.6
strip_actionability
-1.6
haiku
remove_evidence
-3.8
add_fluff
-11.6
vague_ify
-5.8
inject_errors
-2.6
scramble_order
-0.6
duplicate_content
-3.8
strip_actionability
-1.2
Discrimination — Do Judges Use the Full Range?
The clustering problem: A judge that scores everything 70-90 is sorting noise, not signal.
We measure what percentage of scores fall in the densest 20-point band.
Good judges spread scores across the range; bad judges cluster.
opus
60%
cluster % — Clustered
Range: 85 · Std: 38.5 · IQR: 54
sonnet
60%
cluster % — Clustered
Range: 88 · Std: 40.4 · IQR: 64
haiku
60%
cluster % — Clustered
Range: 73 · Std: 33.5 · IQR: 54
Score Distribution Histograms
opus
20
110
20
30
40
50
160
70
80
190
sonnet
20
10
120
30
40
50
60
170
80
190
haiku
0
210
120
30
40
50
60
170
180
90
Cross-Model Agreement — Do Models Agree on Rankings?
Why this matters: If two models rank items the same way (high tau), the rubric is unambiguous.
If they disagree (low tau), the rubric has room for interpretation — different models read the criteria differently.
Cross-model agreement tests rubric quality, not just judge quality.
Model Pair
Kendall τ
p-value
Interpretation
opus vs sonnet
1.000
0.0192 (significant)
Strong agreement
opus vs haiku
0.949
0.0230 (significant)
Strong agreement
sonnet vs haiku
0.949
0.0230 (significant)
Strong agreement
Per-Item Scores Across Models
Item
opus
sonnet
haiku
Spread
good_code_review
93
96
88
8
good_technical
62
72
72
10
mediocre_analysis
15
22
28
13
poor_academic
8
8
18
10
poor_claims
8
8
15
7
Perturbation Deep-Dive
Each perturbation targets a specific quality dimension. Below we show which models detect each
degradation and how strongly. The effect size (Cohen's d) tells you how much the judge cares.
remove_evidenceEvidence
opusFAILsonnetFAILhaikuFAIL
Strips ~50% of numeric lines, replaces backtick-quoted code with [removed]. A judge that cares about evidence should notice when half the data disappears.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+4.0
0.11
40%
5
sonnet
+4.8
0.13
20%
5
haiku
+3.8
0.12
40%
5
add_fluffClarity
opusFAILsonnetFAILhaikuFAIL
Inserts irrelevant filler sentences at ~30% probability per line. Tests whether the judge penalizes noise or just measures word count.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+15.0
0.49
100%
5
sonnet
+12.8
0.37
100%
5
haiku
+11.6
0.40
80%
5
vague_ifyEvidence
opusFAILsonnetFAILhaikuFAIL
Replaces specific numbers with 'several', percentages with 'some percentage'. The hardest test — the text reads fluently, but all specificity is gone.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+11.4
0.36
40%
5
sonnet
+8.8
0.25
20%
5
haiku
+5.8
0.19
40%
5
inject_errorsAccuracy
opusFAILsonnetFAILhaikuFAIL
Randomly multiplies/divides numbers by 0.1x to 10x. Tests whether the judge catches factual inconsistencies.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+3.6
0.10
40%
5
sonnet
+4.4
0.12
40%
5
haiku
+2.6
0.08
40%
5
scramble_orderStructure
opusFAILsonnetFAILhaikuFAIL
Shuffles paragraphs or lines randomly. Does the judge care about logical flow and coherence?
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+2.4
0.06
60%
5
sonnet
+4.2
0.11
80%
5
haiku
+0.6
0.02
20%
5
duplicate_contentClarity
opusFAILsonnetFAILhaikuFAIL
Repeats ~25% of non-empty lines. Tests whether the judge penalizes redundancy.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+5.0
0.14
100%
5
sonnet
+4.6
0.12
80%
5
haiku
+3.8
0.11
60%
5
strip_actionabilityActionability
opusFAILsonnetFAILhaikuFAIL
Removes imperative sentences (Use/Run/Always/Never...). A judge scoring actionability should notice when all recommendations disappear.
Model
Mean Drop
Effect Size
% Correct
Samples
opus
+1.6
0.04
20%
5
sonnet
+1.6
0.04
20%
5
haiku
+1.2
0.04
20%
5
Conclusions — Why You Should Trust These Scores
Note on sample size: With n=5 corpus items, Cohen's d thresholds are strict —
even mean score drops of 10-15 points produce d < 0.5 due to high inter-item variance.
The directional accuracy (% of tests where degraded < original) is a more informative signal at this scale.
Cluster % is also less meaningful at n=5 (60% = 3 out of 5 items, mathematically forced
for any judge that separates quality tiers).
Rank
Model
Range
Correct Ordering
Direction (% drops correct)
Verdict
1
sonnet
88 pts
YES
7/7 (100%)
BEST JUDGE
2
opus
85 pts
YES
7/7 (100%)
GOOD
3
haiku
73 pts
YES
7/7 (100%)
GOOD
Key Takeaways:
Quality ordering correct — all 3 models rank good > mediocre > poor consistently.
Strongest signal: add_fluff (d=0.49) — judges are most sensitive to noise injection.
Cross-model agreement validates rubric — high τ means the criteria are unambiguous.
All perturbations deterministic (seeded) — results are fully reproducible.
The meta-insight: This system validates its own judges before using them.
Every score in the pipeline — draft quality, claim confidence, research assessments —
is produced by a judge that has passed these calibration checks.
The system that checks its own checkers is what makes automated scoring trustworthy.
Appendix: Rubric Criteria
Criterion
Weight
Description
clarity
1.0
Clear, precise language without unnecessary jargon or filler
evidence
1.0
Claims supported by specific data, examples, or references
structure
1.0
Logical organization with coherent flow between ideas
actionability
1.0
Concrete, actionable recommendations or conclusions