FLAWS Failure Analysis: Sonnet 4.6

Finding fLAWS in Scientific Papers — Status Report — 2026-03-24

Benchmark: FLAWS

What it measures: Whether an LLM can identify and localize claim-invalidating errors that have been deliberately inserted into peer-reviewed scientific papers. Tests deep scientific reading comprehension and error detection.

Source: Xasayi et al., 2025. arXiv:2511.21843 | GitHub | HuggingFace

Dataset: 713 paper-error pairs (448 Gemini-inserted + 265 GPT-inserted) from real ICML 2025 papers. Each paper has expert-crafted conceptual errors with ground-truth error locations.

Scoring: Top-k identification accuracy (k=1, k=3, k=10) using OR of word-level Levenshtein similarity (threshold 0.5) and LLM-as-judge evaluation. Model outputs up to k error candidate excerpts.

Modelk=1k=3k=10
GPT-59.0%19.2%39.1%
DeepSeek Reasoner v3.15.9%16.3%35.2%
Claude Sonnet 4.55.2%12.6%21.5%

Relevance: Directly tests the draft/ project's core capability — can multi-model analysis detect errors in documents? Deeply unsaturated: even GPT-5 finds the right error only 39% of the time.

Current State

50%
Combined k=10 accuracy
35%
Levenshtein k=10
45%
Judge k=10
10.0
Avg predictions/paper
$2.03
Identification cost
10%
Combined k=1 accuracy

Best Score

50% combined k=10 on tune split (20 papers, OpenAI subset).

Exact Config

Modelanthropic/claude-sonnet-4-6
Tool useNone
Vario configNone (baseline, single-pass)
Thinking budgetNone (no extended thinking)
Temperature0
top_k10 (model returns exactly 10 candidate error excerpts)
Levenshtein threshold0.5
Judge modelanthropic/claude-haiku-4-5-20251001 (temperature=0)
Prompt versionFixed: "You MUST return exactly {num_chunks}" (count repeated 3x)
Word limit per excerpt100

Reproduce

python -m benchmarks.eval.flaws_official sonnet --split tune --subset openai --no-subscription

Comparison to Published Frontier Scores

Modelk=10 AccuracyDatasetNotes
GPT-539.1%Full (713 papers)Published FLAWS paper
DeepSeek Reasoner v3.135.2%Full (713 papers)Published FLAWS paper
Sonnet 4.6 (ours)50.0%Tune split (20 papers)OpenAI subset only; will likely regress toward 30-40% on full eval split
Sonnet 4.521.5%Full (713 papers)Published FLAWS paper

The 50% score is on 20 tune-split papers. The full eval split (245 papers) will produce a more reliable number. Small-sample variance means the true accuracy is likely in the 30-50% range.

Accuracy Breakdown by Method

Methodk=1k=3k=10
Levenshtein (threshold=0.5)2/20 (10%)3/20 (15%)7/20 (35%)
LLM Judge (Haiku 4.5)1/20 (5%)4/20 (20%)9/20 (45%)
Combined (OR)2/20 (10%)4/20 (20%)10/20 (50%)

Failure Taxonomy

Of the 20 papers evaluated, 10 matched and 10 did not. The failures fall into three categories.

Distribution

Matched (both)
6 (30%)
Matched (Lev only)
1 (5%)
Matched (Judge only)
3 (15%)
Wrong error
10 (50%)

A. Model finds wrong error entirely (7 of 10 failures)

The model identifies an error in a completely different part of the paper. Best similarity scores are below 0.15, indicating zero conceptual overlap with the ground truth. The model's candidates and the actual inserted error are in different sections, covering different topics.

PaperBest SimModel's PredictionGround Truth
2502.03444v20.090GMM modes / latent space quality claimPosition Encoding (2D RoPE applied to latents)
2505.15025v10.105Quadratic function / linear policy derivationConvex reformulation exactness claim
2502.16658v10.138Distribution-free / DP volume optimalityGaussian mixture lemma + theorem proof
2502.10158v30.077MNL model kappa interpretationReward bounds + regret matching claim
2503.07639v10.156MoE sparse activation / routing designInterpretability + activation sparsity finding
2502.01362v20.109Bridge matching loss regression targetMultistep distillation formulation
2503.19595v20.194Gradient variance / pass@k objectiveGRPO integration (baseline removal)

Example: 2502.10158v3 (sim=0.077)

Model's Top Prediction
A small kappa indicates a larger deviation from the linear model. Note that 1/kappa can be exponentially large, so it is crucial to avoid any dependency on 1/kappa in our regret bound.
Actual Inserted Error
Throughout this paper, we assume that Reward_h(s_h,a_h,s_{h+1}) in [0,1] for all h and all possible triples (s_h,a_h,s_{h+1}). AND Note that the lower bounds of Zhou (2021) match the second term of our regret bound, Omega(d*sqrt(HK)), without any rescaling.

The model focused on the MNL model's kappa parameter. The actual errors were a reward bound assumption change and a regret lower bound matching claim -- completely different sections.

B. Model finds related issue but not the specific inserted error (2 of 10 failures)

The model identifies an issue in the same conceptual area as the inserted error, but targets a different specific passage. Best similarity is in the 0.09-0.15 range -- close enough to be in the neighborhood, but not the right passage.

PaperBest SimModel's PredictionGround Truth
2506.11449v10.148Softmax denominator bug in TopK mechanismSmall-world network factor computation
2502.16282v20.138Implicit alignment claims (intro section)Alignment emergence experiment (results section)

Example: 2506.11449v1 (sim=0.148)

Model's Top Prediction
\tilde{\alpha_i} = \min(k * exp(alpha_i/T) / sum_j exp(alpha_i/T), 1) Model found: "Softmax denominator uses alpha_i instead of alpha_j -- makes it n*exp(alpha_i/T), simplifying to k/n for all i"
Actual Inserted Error
\subsection{Diagonal Sparsity and Small World Networks} To test the small-worldness of networks trained with DynaDiag, we take a 90% sparse ViT-B/16... The actual error was in the small-world network analysis section, not the TopK mechanism.

The model found what appears to be a genuine bug (wrong variable in softmax sum), but it is NOT the inserted error. This illustrates a key challenge: the model may find real errors that are not the planted one.

C. Extraction/matching artifact (1 of 10 failures)

The model's prediction is thematically adjacent to the ground truth but the excerpt targets a different granularity (e.g., model quotes table values while the GT is a paragraph about the same table).

PaperBest SimModel's PredictionGround Truth
2505.21363v30.094Subgroup choice / mitigation performanceSubgroup-label noise sensitivity (table data)

This is the only failure where improving extraction granularity or matching logic could plausibly help. The remaining 9 failures are detection problems, not matching problems.

For Reference: How the 10 Matches Broke Down

Levenshtein + Judge agree (6 papers, 30%): Both matching methods confirm. Model extracted an excerpt with substantial overlap to ground truth text.

Judge only (3 papers, 15%): The LLM judge recognized a conceptual match that Levenshtein missed. This occurs when the model found the right error but quoted a slightly different passage, or when LaTeX formatting reduces string similarity.

PaperLev SimJudge's Assessment
2506.03363v10.089Found the correct theorem section about uniform distribution minimizer
2506.04870v10.222Found the regularizer interpretation passage
2506.13095v10.250Identified the anomaly score consistency regularization loss

Levenshtein only (1 paper, 5%): Paper 2503.06337v4 -- Levenshtein subspan matching found a hit but the judge did not confirm. Possible Levenshtein false positive from partial word overlap.

Match Example: 2506.03363v1 -- Judge Rescued

Model's Top Predictions
Pred[0]: $$\beta_S = \sum_{S\subseteq T} \frac{\alpha_T}{2^{|T|}}.$$ Pred[1]: $$\beta_S = \sum_{T\supseteq S}\frac{\alpha_T}{2^{|T|}}.$$ Model identified the summation index inversion (S subset T vs T superset S).
Actual Inserted Error
This result can be extended to allow for arbitrary distributions over combinations... the uniform distribution over combinations is the unique minimizer... A 1342-character theorem block about uniform distribution optimality.

Lev sim=0.089 Judge: CORRECT Levenshtein fails entirely (model quoted the equation, GT is a theorem paragraph). The judge recognized the conceptual overlap -- the model found an error within the theorem's scope.

Match Example: 2502.02531v3 (sim=0.938)

Model Prediction (Pred[0])
while a linear model would always follow a power law scaling with training iterations L ~ t^{-beta}, we note that...
Ground Truth
[Overlapping text about power law scaling -- model extracted nearly identical passage]

sim=0.938 Near-perfect match. Model identified the correct error on first try.

Levenshtein vs Judge Agreement

Judge: MatchJudge: No Match
Lev: Match6 (agree: correct)1 (Lev false positive?)
Lev: No Match3 (Judge found conceptual match)10 (agree: no match)

Agreement rate: 80% (16/20). The 3 judge-only matches demonstrate the value of the LLM judge for handling excerpt granularity and LaTeX/text format mismatches.

Logbook

All runs attempted on this benchmark, in chronological order. The decisive variable was the prompt change in Run 3.

Key finding: The 0% to 50% improvement was caused by the prompt change (demanding exactly 10 candidates instead of "at most 10"). The old prompt produced only 1-2 candidates per paper, making it nearly impossible to hit the planted error. The model change (haiku to sonnet) is secondary -- the prompt fix was necessary and sufficient.
Run Description Score (k=10) Avg preds/paper Notes
1 Smoke test 0% 1.3 Baseline validation, 5 questions
Model: anthropic/claude-haiku-4-5-20251001 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: old ("return ... candidate error chunks", no count enforcement) | Sample: 5 (random)
2 Sonnet, 50Q random, old prompt 0% 1.3 Stronger model, same prompt -- still only 1-2 candidates output
Model: anthropic/claude-sonnet-4-6 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: old ("return ... candidate error chunks") | Sample: 50 (random) | Flag: --no-subscription
3 Haiku, 2Q tune, fixed prompt 0% 10.0 Prompt fix confirmed working -- 10 candidates now output. 0% on 2 papers is uninformative.
Model: anthropic/claude-haiku-4-5-20251001 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: fixed ("You MUST return exactly {num_chunks}", count repeated 3x) | Sample: 2 (tune split)
4 Sonnet, 20Q tune, fixed prompt 50% 10.0 Current best. $2.03 identification + ~$0.04 judging.
Model: anthropic/claude-sonnet-4-6 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: anthropic/claude-haiku-4-5-20251001 | Prompt: fixed | Sample: 20 (tune split, OpenAI subset) | Flag: --no-subscription

Prompt Change Detail

Before (Runs 1-2): old prompt
0% accuracy
  • Avg 1.34 predictions per paper
  • 4/50 papers had 0 extracted predictions
  • Model gave lengthy analysis but only 1 :error_text: block
  • Prompt said "return ... candidate error chunks" but did not enforce count
After (Runs 3-4): fixed prompt
50% accuracy
  • Exactly 10.0 predictions per paper
  • 0/20 papers had extraction failures
  • Prompt says "You MUST return exactly {num_chunks}" + repeats count 3x
  • 100% extraction success rate

Matching Analysis

Similarity Distribution (200 total predictions across 20 papers)

[0.0, 0.1)
148
[0.1, 0.2)
43
[0.2, 0.3)
3
[0.3, 0.5)
2
[0.5, 1.0]
4

Bimodal distribution. 95.5% of predictions have <0.2 similarity to any ground truth span. Predictions either match well (>0.5) or not at all (<0.2). There is almost no "near miss" zone.

Threshold Sensitivity

ThresholdLev MatchesAccuracyDelta from 0.5
0.50 (official)7/2035%--
0.407/2035%+0
0.307/2035%+0
0.209/2045%+2
0.1518/2090%+11 (false positives)

Lowering threshold from 0.5 to 0.3 gains zero additional matches. At 0.2, two papers are added but with questionable validity. At 0.15, nearly everything matches (meaningless). The official 0.5 threshold is appropriate.

LLM Judge Value-Add

The LLM judge (Haiku 4.5) adds 3 papers over Levenshtein alone (combined 10/20 vs Lev-only 7/20). These are cases where the model found the right error but quoted a different-granularity passage, or where LaTeX formatting reduced string similarity below 0.5.

Conclusion

Matching is NOT the bottleneck. Lowering thresholds adds nothing. The judge already catches formatting mismatches. All 10 failures are detection failures: the model genuinely does not find the planted error. Improvement must come from better error detection, not better matching.

Next Steps

1. Extended thinking (16K-32K budget)
Expected impact: HIGH | Cost: 2-3x current (~$5 for 20 papers)
The model needs to reason deeply about mathematical correctness across 30-80K characters of LaTeX. Extended thinking gives space to trace proofs step by step, cross-reference claims across sections, and verify mathematical consistency of definitions, theorems, and proofs. Directly addresses the dominant "finds wrong error" failure mode.
2. Multi-pass analysis (section by section instead of whole paper)
Expected impact: HIGH | Cost: 3-5x current (multiple LLM calls per paper)
Instead of one-shot analysis of the full paper, break the task into: (1) extract all claims/theorems/definitions, (2) for each, verify internal consistency and correctness, (3) rank by severity. Many inserted errors are in appendix proofs and secondary results -- areas the model skips when analyzing the whole paper at once.
3. Multi-model consensus
Expected impact: MEDIUM | Cost: 3x current (~$6 for 3 models on 20 papers)
Run 3+ models independently, aggregate predictions. Models have different blind spots -- where Sonnet misses a proof error, GPT or Gemini might catch it. Union of predictions increases recall without losing precision (already at k=10).
4. Run on full eval split (245 papers)
Expected impact: reliable score estimate | Cost: ~$25 identification + ~$2 judging
The tune split (20 papers) gives noisy estimates. The full eval split produces a publishable number. Required before claiming any particular accuracy level.

Root Cause Analysis

Why Does the Model Find the Wrong Error?

  1. Paper length overwhelms attention. ICML papers are 30-80K characters of LaTeX. The inserted error is typically a subtle change in one theorem, one equation, or one experimental description. The model must evaluate the entire paper and identify which specific passage contains a conceptual flaw. With 10+ theorems and proofs per paper, the search space is vast.
  2. Multiple plausible errors exist. In several cases (e.g., 2506.11449v1), the model found what appears to be a genuine bug (wrong variable in a softmax). The paper may have multiple issues, but only the inserted one counts.
  3. Errors are in supporting sections. Many inserted errors are in appendix proofs, table captions, or secondary results -- not the main claims the model naturally focuses on. The model tends to scrutinize the abstract, introduction, and main theorems.
  4. No external knowledge baseline. The prompt constrains the model to knowledge available at time of publication. This prevents comparing against known results or established theorems, limiting the model to internal consistency checks only.

Cost Analysis

ComponentTotalPer Paper
Error identification (Sonnet 4.6)$2.03$0.10
Judge evaluation (Haiku 4.5)~$0.04~$0.002
Total$2.07$0.10

Projecting to full eval split (245 papers): ~$25. With extended thinking at 16K budget, expect 2-3x cost increase (~$50-75).

Per-Paper Results

Paper#Pred#GTBest SimLev k10Judge k10CombinedMode
2506.06895v21020.552YYYBOTH
2502.02531v310110.938YYYBOTH
2502.16025v21040.938YYYBOTH
2505.04163v11070.950YYYBOTH
2505.06744v11030.179YYYBOTH
2505.19097v210110.600YYYBOTH
2503.06337v410100.118YNYLEV
2506.03363v11020.089NYYJUDGE
2506.04870v11040.222NYYJUDGE
2506.13095v11020.250NYYJUDGE
2502.03444v21040.090NNNMISS
2505.15025v11040.105NNNMISS
2502.16658v11070.138NNNMISS
2502.10158v31040.077NNNMISS
2506.11449v11040.148NNNMISS
2502.16282v21040.138NNNMISS
2503.07639v11060.156NNNMISS
2502.01362v21020.109NNNMISS
2503.19595v21020.194NNNMISS
2505.21363v310100.094NNNMISS

Methodology Notes