FLAWS Failure Analysis

What it measures: Whether an LLM can identify and localize claim-invalidating errors that have been deliberately inserted into peer-reviewed scientific papers. Tests deep scientific reading comprehension and error detection.

Dataset: 713 paper-error pairs (448 Gemini-inserted + 265 GPT-inserted) from real ICML 2025 papers. Each paper has expert-crafted conceptual errors with ground-truth error locations.

Scoring: Top-k identification accuracy (k=1, k=3, k=10) using OR of word-level Levenshtein similarity (threshold 0.5) and LLM-as-judge evaluation. Model outputs up to k error candidate excerpts.

Model	k=1	k=3	k=10
GPT-5	9.0%	19.2%	39.1%
DeepSeek Reasoner v3.1	5.9%	16.3%	35.2%
Claude Sonnet 4.5	5.2%	12.6%	21.5%

Relevance: Directly tests the draft/ project's core capability — can multi-model analysis detect errors in documents? Deeply unsaturated: even GPT-5 finds the right error only 39% of the time.

Current State

Best Score

50% combined k=10 on tune split (20 papers, OpenAI subset).

Exact Config

Model	`anthropic/claude-sonnet-4-6`
Tool use	None
Vario config	None (baseline, single-pass)
Thinking budget	None (no extended thinking)
Temperature	0
top_k	10 (model returns exactly 10 candidate error excerpts)
Levenshtein threshold	0.5
Judge model	`anthropic/claude-haiku-4-5-20251001` (temperature=0)
Prompt version	Fixed: "You MUST return exactly {num_chunks}" (count repeated 3x)
Word limit per excerpt	100

Reproduce

python -m benchmarks.eval.flaws_official sonnet --split tune --subset openai --no-subscription

Comparison to Published Frontier Scores

Model	k=10 Accuracy	Dataset	Notes
GPT-5	39.1%	Full (713 papers)	Published FLAWS paper
DeepSeek Reasoner v3.1	35.2%	Full (713 papers)	Published FLAWS paper
Sonnet 4.6 (ours)	50.0%	Tune split (20 papers)	OpenAI subset only; will likely regress toward 30-40% on full eval split
Sonnet 4.5	21.5%	Full (713 papers)	Published FLAWS paper

The 50% score is on 20 tune-split papers. The full eval split (245 papers) will produce a more reliable number. Small-sample variance means the true accuracy is likely in the 30-50% range.

Accuracy Breakdown by Method

Failure Taxonomy

Of the 20 papers evaluated, 10 matched and 10 did not. The failures fall into three categories.

Distribution

Method	k=1	k=3	k=10
Levenshtein (threshold=0.5)	2/20 (10%)	3/20 (15%)	7/20 (35%)
LLM Judge (Haiku 4.5)	1/20 (5%)	4/20 (20%)	9/20 (45%)
Combined (OR)	2/20 (10%)	4/20 (20%)	10/20 (50%)

Matched (both)

6 (30%)

Matched (Lev only)

1 (5%)

Matched (Judge only)

3 (15%)

Wrong error

10 (50%)

A. Model finds wrong error entirely (7 of 10 failures)

The model identifies an error in a completely different part of the paper. Best similarity scores are below 0.15, indicating zero conceptual overlap with the ground truth. The model's candidates and the actual inserted error are in different sections, covering different topics.

Paper	Best Sim	Model's Prediction	Ground Truth
2502.03444v2	0.090	GMM modes / latent space quality claim	Position Encoding (2D RoPE applied to latents)
2505.15025v1	0.105	Quadratic function / linear policy derivation	Convex reformulation exactness claim
2502.16658v1	0.138	Distribution-free / DP volume optimality	Gaussian mixture lemma + theorem proof
2502.10158v3	0.077	MNL model kappa interpretation	Reward bounds + regret matching claim
2503.07639v1	0.156	MoE sparse activation / routing design	Interpretability + activation sparsity finding
2502.01362v2	0.109	Bridge matching loss regression target	Multistep distillation formulation
2503.19595v2	0.194	Gradient variance / pass@k objective	GRPO integration (baseline removal)

Example: 2502.10158v3 (sim=0.077)

Model's Top Prediction

A small kappa indicates a larger deviation from the linear model. Note that 1/kappa can be exponentially large, so it is crucial to avoid any dependency on 1/kappa in our regret bound.

Actual Inserted Error

Throughout this paper, we assume that Reward_h(s_h,a_h,s_{h+1}) in [0,1] for all h and all possible triples (s_h,a_h,s_{h+1}). AND Note that the lower bounds of Zhou (2021) match the second term of our regret bound, Omega(d*sqrt(HK)), without any rescaling.

The model focused on the MNL model's kappa parameter. The actual errors were a reward bound assumption change and a regret lower bound matching claim -- completely different sections.

B. Model finds related issue but not the specific inserted error (2 of 10 failures)

The model identifies an issue in the same conceptual area as the inserted error, but targets a different specific passage. Best similarity is in the 0.09-0.15 range -- close enough to be in the neighborhood, but not the right passage.

Paper	Best Sim	Model's Prediction	Ground Truth
2506.11449v1	0.148	Softmax denominator bug in TopK mechanism	Small-world network factor computation
2502.16282v2	0.138	Implicit alignment claims (intro section)	Alignment emergence experiment (results section)

Example: 2506.11449v1 (sim=0.148)

Model's Top Prediction

\tilde{\alpha_i} = \min(k * exp(alpha_i/T) / sum_j exp(alpha_i/T), 1) Model found: "Softmax denominator uses alpha_i instead of alpha_j -- makes it n*exp(alpha_i/T), simplifying to k/n for all i"

Actual Inserted Error

\subsection{Diagonal Sparsity and Small World Networks} To test the small-worldness of networks trained with DynaDiag, we take a 90% sparse ViT-B/16... The actual error was in the small-world network analysis section, not the TopK mechanism.

The model found what appears to be a genuine bug (wrong variable in softmax sum), but it is NOT the inserted error. This illustrates a key challenge: the model may find real errors that are not the planted one.

C. Extraction/matching artifact (1 of 10 failures)

The model's prediction is thematically adjacent to the ground truth but the excerpt targets a different granularity (e.g., model quotes table values while the GT is a paragraph about the same table).

Paper	Best Sim	Model's Prediction	Ground Truth
2505.21363v3	0.094	Subgroup choice / mitigation performance	Subgroup-label noise sensitivity (table data)

This is the only failure where improving extraction granularity or matching logic could plausibly help. The remaining 9 failures are detection problems, not matching problems.

For Reference: How the 10 Matches Broke Down

Levenshtein + Judge agree (6 papers, 30%): Both matching methods confirm. Model extracted an excerpt with substantial overlap to ground truth text.

Judge only (3 papers, 15%): The LLM judge recognized a conceptual match that Levenshtein missed. This occurs when the model found the right error but quoted a slightly different passage, or when LaTeX formatting reduces string similarity.

Paper	Lev Sim	Judge's Assessment
2506.03363v1	0.089	Found the correct theorem section about uniform distribution minimizer
2506.04870v1	0.222	Found the regularizer interpretation passage
2506.13095v1	0.250	Identified the anomaly score consistency regularization loss

Levenshtein only (1 paper, 5%): Paper 2503.06337v4 -- Levenshtein subspan matching found a hit but the judge did not confirm. Possible Levenshtein false positive from partial word overlap.

Match Example: 2506.03363v1 -- Judge Rescued

Model's Top Predictions

Pred[0]: $$\beta_S = \sum_{S\subseteq T} \frac{\alpha_T}{2^{|T|}}.$$ Pred[1]: $$\beta_S = \sum_{T\supseteq S}\frac{\alpha_T}{2^{|T|}}.$$ Model identified the summation index inversion (S subset T vs T superset S).

Actual Inserted Error

This result can be extended to allow for arbitrary distributions over combinations... the uniform distribution over combinations is the unique minimizer... A 1342-character theorem block about uniform distribution optimality.

Lev sim=0.089 Judge: CORRECT Levenshtein fails entirely (model quoted the equation, GT is a theorem paragraph). The judge recognized the conceptual overlap -- the model found an error within the theorem's scope.

Match Example: 2502.02531v3 (sim=0.938)

Model Prediction (Pred[0])

while a linear model would always follow a power law scaling with training iterations L ~ t^{-beta}, we note that...

Ground Truth

[Overlapping text about power law scaling -- model extracted nearly identical passage]

sim=0.938 Near-perfect match. Model identified the correct error on first try.

Levenshtein vs Judge Agreement

	Judge: Match	Judge: No Match
Lev: Match	6 (agree: correct)	1 (Lev false positive?)
Lev: No Match	3 (Judge found conceptual match)	10 (agree: no match)

Agreement rate: 80% (16/20). The 3 judge-only matches demonstrate the value of the LLM judge for handling excerpt granularity and LaTeX/text format mismatches.

Logbook

All runs attempted on this benchmark, in chronological order. The decisive variable was the prompt change in Run 3.

Prompt Change Detail

Run	Description	Score (k=10)	Avg preds/paper	Notes
1	Smoke test	0%	1.3	Baseline validation, 5 questions
Model: `anthropic/claude-haiku-4-5-20251001` \| Tools: none \| Vario: none \| Thinking: none \| Temp: 0 \| Lev threshold: 0.5 \| Judge: haiku-4.5 \| Prompt: old ("return ... candidate error chunks", no count enforcement) \| Sample: 5 (random)
2	Sonnet, 50Q random, old prompt	0%	1.3	Stronger model, same prompt -- still only 1-2 candidates output
Model: `anthropic/claude-sonnet-4-6` \| Tools: none \| Vario: none \| Thinking: none \| Temp: 0 \| Lev threshold: 0.5 \| Judge: haiku-4.5 \| Prompt: old ("return ... candidate error chunks") \| Sample: 50 (random) \| Flag: --no-subscription
3	Haiku, 2Q tune, fixed prompt	0%	10.0	Prompt fix confirmed working -- 10 candidates now output. 0% on 2 papers is uninformative.
Model: `anthropic/claude-haiku-4-5-20251001` \| Tools: none \| Vario: none \| Thinking: none \| Temp: 0 \| Lev threshold: 0.5 \| Judge: haiku-4.5 \| Prompt: fixed ("You MUST return exactly {num_chunks}", count repeated 3x) \| Sample: 2 (tune split)
4	Sonnet, 20Q tune, fixed prompt	50%	10.0	Current best. $2.03 identification + ~$0.04 judging.
Model: `anthropic/claude-sonnet-4-6` \| Tools: none \| Vario: none \| Thinking: none \| Temp: 0 \| Lev threshold: 0.5 \| Judge: `anthropic/claude-haiku-4-5-20251001` \| Prompt: fixed \| Sample: 20 (tune split, OpenAI subset) \| Flag: --no-subscription

Before (Runs 1-2): old prompt

0% accuracy

Avg 1.34 predictions per paper
4/50 papers had 0 extracted predictions
Model gave lengthy analysis but only 1 :error_text: block
Prompt said "return ... candidate error chunks" but did not enforce count

After (Runs 3-4): fixed prompt

50% accuracy

Exactly 10.0 predictions per paper
0/20 papers had extraction failures
Prompt says "You MUST return exactly {num_chunks}" + repeats count 3x
100% extraction success rate

Matching Analysis

Similarity Distribution (200 total predictions across 20 papers)

[0.0, 0.1)

148

[0.1, 0.2)

[0.2, 0.3)

[0.3, 0.5)

[0.5, 1.0]

Bimodal distribution. 95.5% of predictions have <0.2 similarity to any ground truth span. Predictions either match well (>0.5) or not at all (<0.2). There is almost no "near miss" zone.

Threshold Sensitivity

Threshold	Lev Matches	Accuracy	Delta from 0.5
0.50 (official)	7/20	35%	--
0.40	7/20	35%	+0
0.30	7/20	35%	+0
0.20	9/20	45%	+2
0.15	18/20	90%	+11 (false positives)

Lowering threshold from 0.5 to 0.3 gains zero additional matches. At 0.2, two papers are added but with questionable validity. At 0.15, nearly everything matches (meaningless). The official 0.5 threshold is appropriate.

LLM Judge Value-Add

The LLM judge (Haiku 4.5) adds 3 papers over Levenshtein alone (combined 10/20 vs Lev-only 7/20). These are cases where the model found the right error but quoted a different-granularity passage, or where LaTeX formatting reduced string similarity below 0.5.

Conclusion

Matching is NOT the bottleneck. Lowering thresholds adds nothing. The judge already catches formatting mismatches. All 10 failures are detection failures: the model genuinely does not find the planted error. Improvement must come from better error detection, not better matching.

Next Steps

1. Extended thinking (16K-32K budget)
Expected impact: HIGH | Cost: 2-3x current (~$5 for 20 papers)
The model needs to reason deeply about mathematical correctness across 30-80K characters of LaTeX. Extended thinking gives space to trace proofs step by step, cross-reference claims across sections, and verify mathematical consistency of definitions, theorems, and proofs. Directly addresses the dominant "finds wrong error" failure mode.

2. Multi-pass analysis (section by section instead of whole paper)
Expected impact: HIGH | Cost: 3-5x current (multiple LLM calls per paper)
Instead of one-shot analysis of the full paper, break the task into: (1) extract all claims/theorems/definitions, (2) for each, verify internal consistency and correctness, (3) rank by severity. Many inserted errors are in appendix proofs and secondary results -- areas the model skips when analyzing the whole paper at once.

Root Cause Analysis

Why Does the Model Find the Wrong Error?

Paper length overwhelms attention. ICML papers are 30-80K characters of LaTeX. The inserted error is typically a subtle change in one theorem, one equation, or one experimental description. The model must evaluate the entire paper and identify which specific passage contains a conceptual flaw. With 10+ theorems and proofs per paper, the search space is vast.
Multiple plausible errors exist. In several cases (e.g., 2506.11449v1), the model found what appears to be a genuine bug (wrong variable in a softmax). The paper may have multiple issues, but only the inserted one counts.
Errors are in supporting sections. Many inserted errors are in appendix proofs, table captions, or secondary results -- not the main claims the model naturally focuses on. The model tends to scrutinize the abstract, introduction, and main theorems.
No external knowledge baseline. The prompt constrains the model to knowledge available at time of publication. This prevents comparing against known results or established theorems, limiting the model to internal consistency checks only.

Cost Analysis

Component	Total	Per Paper
Error identification (Sonnet 4.6)	$2.03	$0.10
Judge evaluation (Haiku 4.5)	~$0.04	~$0.002
Total	$2.07	$0.10

Projecting to full eval split (245 papers): ~$25. With extended thinking at 16K budget, expect 2-3x cost increase (~$50-75).

Per-Paper Results

Methodology Notes

Paper	#Pred	#GT	Best Sim	Lev k10	Judge k10	Combined	Mode
2506.06895v2	10	2	0.552	Y	Y	Y	BOTH
2502.02531v3	10	11	0.938	Y	Y	Y	BOTH
2502.16025v2	10	4	0.938	Y	Y	Y	BOTH
2505.04163v1	10	7	0.950	Y	Y	Y	BOTH
2505.06744v1	10	3	0.179	Y	Y	Y	BOTH
2505.19097v2	10	11	0.600	Y	Y	Y	BOTH
2503.06337v4	10	10	0.118	Y	N	Y	LEV
2506.03363v1	10	2	0.089	N	Y	Y	JUDGE
2506.04870v1	10	4	0.222	N	Y	Y	JUDGE
2506.13095v1	10	2	0.250	N	Y	Y	JUDGE
2502.03444v2	10	4	0.090	N	N	N	MISS
2505.15025v1	10	4	0.105	N	N	N	MISS
2502.16658v1	10	7	0.138	N	N	N	MISS
2502.10158v3	10	4	0.077	N	N	N	MISS
2506.11449v1	10	4	0.148	N	N	N	MISS
2502.16282v2	10	4	0.138	N	N	N	MISS
2503.07639v1	10	6	0.156	N	N	N	MISS
2502.01362v2	10	2	0.109	N	N	N	MISS
2503.19595v2	10	2	0.194	N	N	N	MISS
2505.21363v3	10	10	0.094	N	N	N	MISS

Dataset: FLAWS benchmark, OpenAI subset, tune split (20 papers from flaws_split.json)
Model: Sonnet 4.6 (anthropic/claude-sonnet-4-6), temperature=0, no extended thinking
Judge: Haiku 4.5 (anthropic/claude-haiku-4-5-20251001), temperature=0
Prompt: Fixed version requiring exactly 10 error candidates, word limit 100
Matching: Bidirectional Levenshtein subspan similarity, threshold=0.5 (per FLAWS paper methodology)
Scoring: Combined = Levenshtein OR Judge (per FLAWS paper methodology)
Cache: Results use LLM disk cache; re-running produces identical outputs

FLAWS Failure Analysis: Sonnet 4.6

Benchmark: FLAWS

Current State

Best Score

Exact Config

Reproduce

Comparison to Published Frontier Scores

Accuracy Breakdown by Method

Failure Taxonomy

Distribution

A. Model finds wrong error entirely (7 of 10 failures)

Example: 2502.10158v3 (sim=0.077)

B. Model finds related issue but not the specific inserted error (2 of 10 failures)

Example: 2506.11449v1 (sim=0.148)

C. Extraction/matching artifact (1 of 10 failures)

For Reference: How the 10 Matches Broke Down

Match Example: 2506.03363v1 -- Judge Rescued

Match Example: 2502.02531v3 (sim=0.938)

Levenshtein vs Judge Agreement

Logbook

Prompt Change Detail

Matching Analysis

Similarity Distribution (200 total predictions across 20 papers)

Threshold Sensitivity

LLM Judge Value-Add

Conclusion

Next Steps

Root Cause Analysis

Why Does the Model Find the Wrong Error?

Cost Analysis

Per-Paper Results

Methodology Notes