Finding fLAWS in Scientific Papers — Status Report — 2026-03-24
What it measures: Whether an LLM can identify and localize claim-invalidating errors that have been deliberately inserted into peer-reviewed scientific papers. Tests deep scientific reading comprehension and error detection.
Source: Xasayi et al., 2025. arXiv:2511.21843 | GitHub | HuggingFace
Dataset: 713 paper-error pairs (448 Gemini-inserted + 265 GPT-inserted) from real ICML 2025 papers. Each paper has expert-crafted conceptual errors with ground-truth error locations.
Scoring: Top-k identification accuracy (k=1, k=3, k=10) using OR of word-level Levenshtein similarity (threshold 0.5) and LLM-as-judge evaluation. Model outputs up to k error candidate excerpts.
| Model | k=1 | k=3 | k=10 |
|---|---|---|---|
| GPT-5 | 9.0% | 19.2% | 39.1% |
| DeepSeek Reasoner v3.1 | 5.9% | 16.3% | 35.2% |
| Claude Sonnet 4.5 | 5.2% | 12.6% | 21.5% |
Relevance: Directly tests the draft/ project's core capability — can multi-model analysis detect errors in documents? Deeply unsaturated: even GPT-5 finds the right error only 39% of the time.
50% combined k=10 on tune split (20 papers, OpenAI subset).
| Model | anthropic/claude-sonnet-4-6 |
| Tool use | None |
| Vario config | None (baseline, single-pass) |
| Thinking budget | None (no extended thinking) |
| Temperature | 0 |
| top_k | 10 (model returns exactly 10 candidate error excerpts) |
| Levenshtein threshold | 0.5 |
| Judge model | anthropic/claude-haiku-4-5-20251001 (temperature=0) |
| Prompt version | Fixed: "You MUST return exactly {num_chunks}" (count repeated 3x) |
| Word limit per excerpt | 100 |
| Model | k=10 Accuracy | Dataset | Notes |
|---|---|---|---|
| GPT-5 | 39.1% | Full (713 papers) | Published FLAWS paper |
| DeepSeek Reasoner v3.1 | 35.2% | Full (713 papers) | Published FLAWS paper |
| Sonnet 4.6 (ours) | 50.0% | Tune split (20 papers) | OpenAI subset only; will likely regress toward 30-40% on full eval split |
| Sonnet 4.5 | 21.5% | Full (713 papers) | Published FLAWS paper |
The 50% score is on 20 tune-split papers. The full eval split (245 papers) will produce a more reliable number. Small-sample variance means the true accuracy is likely in the 30-50% range.
| Method | k=1 | k=3 | k=10 |
|---|---|---|---|
| Levenshtein (threshold=0.5) | 2/20 (10%) | 3/20 (15%) | 7/20 (35%) |
| LLM Judge (Haiku 4.5) | 1/20 (5%) | 4/20 (20%) | 9/20 (45%) |
| Combined (OR) | 2/20 (10%) | 4/20 (20%) | 10/20 (50%) |
Of the 20 papers evaluated, 10 matched and 10 did not. The failures fall into three categories.
The model identifies an error in a completely different part of the paper. Best similarity scores are below 0.15, indicating zero conceptual overlap with the ground truth. The model's candidates and the actual inserted error are in different sections, covering different topics.
| Paper | Best Sim | Model's Prediction | Ground Truth |
|---|---|---|---|
| 2502.03444v2 | 0.090 | GMM modes / latent space quality claim | Position Encoding (2D RoPE applied to latents) |
| 2505.15025v1 | 0.105 | Quadratic function / linear policy derivation | Convex reformulation exactness claim |
| 2502.16658v1 | 0.138 | Distribution-free / DP volume optimality | Gaussian mixture lemma + theorem proof |
| 2502.10158v3 | 0.077 | MNL model kappa interpretation | Reward bounds + regret matching claim |
| 2503.07639v1 | 0.156 | MoE sparse activation / routing design | Interpretability + activation sparsity finding |
| 2502.01362v2 | 0.109 | Bridge matching loss regression target | Multistep distillation formulation |
| 2503.19595v2 | 0.194 | Gradient variance / pass@k objective | GRPO integration (baseline removal) |
The model focused on the MNL model's kappa parameter. The actual errors were a reward bound assumption change and a regret lower bound matching claim -- completely different sections.
The model identifies an issue in the same conceptual area as the inserted error, but targets a different specific passage. Best similarity is in the 0.09-0.15 range -- close enough to be in the neighborhood, but not the right passage.
| Paper | Best Sim | Model's Prediction | Ground Truth |
|---|---|---|---|
| 2506.11449v1 | 0.148 | Softmax denominator bug in TopK mechanism | Small-world network factor computation |
| 2502.16282v2 | 0.138 | Implicit alignment claims (intro section) | Alignment emergence experiment (results section) |
The model found what appears to be a genuine bug (wrong variable in softmax sum), but it is NOT the inserted error. This illustrates a key challenge: the model may find real errors that are not the planted one.
The model's prediction is thematically adjacent to the ground truth but the excerpt targets a different granularity (e.g., model quotes table values while the GT is a paragraph about the same table).
| Paper | Best Sim | Model's Prediction | Ground Truth |
|---|---|---|---|
| 2505.21363v3 | 0.094 | Subgroup choice / mitigation performance | Subgroup-label noise sensitivity (table data) |
This is the only failure where improving extraction granularity or matching logic could plausibly help. The remaining 9 failures are detection problems, not matching problems.
Levenshtein + Judge agree (6 papers, 30%): Both matching methods confirm. Model extracted an excerpt with substantial overlap to ground truth text.
Judge only (3 papers, 15%): The LLM judge recognized a conceptual match that Levenshtein missed. This occurs when the model found the right error but quoted a slightly different passage, or when LaTeX formatting reduces string similarity.
| Paper | Lev Sim | Judge's Assessment |
|---|---|---|
| 2506.03363v1 | 0.089 | Found the correct theorem section about uniform distribution minimizer |
| 2506.04870v1 | 0.222 | Found the regularizer interpretation passage |
| 2506.13095v1 | 0.250 | Identified the anomaly score consistency regularization loss |
Levenshtein only (1 paper, 5%): Paper 2503.06337v4 -- Levenshtein subspan matching found a hit but the judge did not confirm. Possible Levenshtein false positive from partial word overlap.
Lev sim=0.089 Judge: CORRECT Levenshtein fails entirely (model quoted the equation, GT is a theorem paragraph). The judge recognized the conceptual overlap -- the model found an error within the theorem's scope.
sim=0.938 Near-perfect match. Model identified the correct error on first try.
| Judge: Match | Judge: No Match | |
|---|---|---|
| Lev: Match | 6 (agree: correct) | 1 (Lev false positive?) |
| Lev: No Match | 3 (Judge found conceptual match) | 10 (agree: no match) |
Agreement rate: 80% (16/20). The 3 judge-only matches demonstrate the value of the LLM judge for handling excerpt granularity and LaTeX/text format mismatches.
All runs attempted on this benchmark, in chronological order. The decisive variable was the prompt change in Run 3.
| Run | Description | Score (k=10) | Avg preds/paper | Notes |
|---|---|---|---|---|
| 1 | Smoke test | 0% | 1.3 | Baseline validation, 5 questions |
|
Model:
anthropic/claude-haiku-4-5-20251001 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: old ("return ... candidate error chunks", no count enforcement) | Sample: 5 (random) | ||||
| 2 | Sonnet, 50Q random, old prompt | 0% | 1.3 | Stronger model, same prompt -- still only 1-2 candidates output |
|
Model:
anthropic/claude-sonnet-4-6 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: old ("return ... candidate error chunks") | Sample: 50 (random) | Flag: --no-subscription | ||||
| 3 | Haiku, 2Q tune, fixed prompt | 0% | 10.0 | Prompt fix confirmed working -- 10 candidates now output. 0% on 2 papers is uninformative. |
|
Model:
anthropic/claude-haiku-4-5-20251001 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: haiku-4.5 | Prompt: fixed ("You MUST return exactly {num_chunks}", count repeated 3x) | Sample: 2 (tune split) | ||||
| 4 | Sonnet, 20Q tune, fixed prompt | 50% | 10.0 | Current best. $2.03 identification + ~$0.04 judging. |
|
Model:
anthropic/claude-sonnet-4-6 | Tools: none | Vario: none | Thinking: none | Temp: 0 | Lev threshold: 0.5 | Judge: anthropic/claude-haiku-4-5-20251001 | Prompt: fixed | Sample: 20 (tune split, OpenAI subset) | Flag: --no-subscription | ||||
:error_text: blockBimodal distribution. 95.5% of predictions have <0.2 similarity to any ground truth span. Predictions either match well (>0.5) or not at all (<0.2). There is almost no "near miss" zone.
| Threshold | Lev Matches | Accuracy | Delta from 0.5 |
|---|---|---|---|
| 0.50 (official) | 7/20 | 35% | -- |
| 0.40 | 7/20 | 35% | +0 |
| 0.30 | 7/20 | 35% | +0 |
| 0.20 | 9/20 | 45% | +2 |
| 0.15 | 18/20 | 90% | +11 (false positives) |
Lowering threshold from 0.5 to 0.3 gains zero additional matches. At 0.2, two papers are added but with questionable validity. At 0.15, nearly everything matches (meaningless). The official 0.5 threshold is appropriate.
The LLM judge (Haiku 4.5) adds 3 papers over Levenshtein alone (combined 10/20 vs Lev-only 7/20). These are cases where the model found the right error but quoted a different-granularity passage, or where LaTeX formatting reduced string similarity below 0.5.
| Component | Total | Per Paper |
|---|---|---|
| Error identification (Sonnet 4.6) | $2.03 | $0.10 |
| Judge evaluation (Haiku 4.5) | ~$0.04 | ~$0.002 |
| Total | $2.07 | $0.10 |
Projecting to full eval split (245 papers): ~$25. With extended thinking at 16K budget, expect 2-3x cost increase (~$50-75).
| Paper | #Pred | #GT | Best Sim | Lev k10 | Judge k10 | Combined | Mode |
|---|---|---|---|---|---|---|---|
| 2506.06895v2 | 10 | 2 | 0.552 | Y | Y | Y | BOTH |
| 2502.02531v3 | 10 | 11 | 0.938 | Y | Y | Y | BOTH |
| 2502.16025v2 | 10 | 4 | 0.938 | Y | Y | Y | BOTH |
| 2505.04163v1 | 10 | 7 | 0.950 | Y | Y | Y | BOTH |
| 2505.06744v1 | 10 | 3 | 0.179 | Y | Y | Y | BOTH |
| 2505.19097v2 | 10 | 11 | 0.600 | Y | Y | Y | BOTH |
| 2503.06337v4 | 10 | 10 | 0.118 | Y | N | Y | LEV |
| 2506.03363v1 | 10 | 2 | 0.089 | N | Y | Y | JUDGE |
| 2506.04870v1 | 10 | 4 | 0.222 | N | Y | Y | JUDGE |
| 2506.13095v1 | 10 | 2 | 0.250 | N | Y | Y | JUDGE |
| 2502.03444v2 | 10 | 4 | 0.090 | N | N | N | MISS |
| 2505.15025v1 | 10 | 4 | 0.105 | N | N | N | MISS |
| 2502.16658v1 | 10 | 7 | 0.138 | N | N | N | MISS |
| 2502.10158v3 | 10 | 4 | 0.077 | N | N | N | MISS |
| 2506.11449v1 | 10 | 4 | 0.148 | N | N | N | MISS |
| 2502.16282v2 | 10 | 4 | 0.138 | N | N | N | MISS |
| 2503.07639v1 | 10 | 6 | 0.156 | N | N | N | MISS |
| 2502.01362v2 | 10 | 2 | 0.109 | N | N | N | MISS |
| 2503.19595v2 | 10 | 2 | 0.194 | N | N | N | MISS |
| 2505.21363v3 | 10 | 10 | 0.094 | N | N | N | MISS |
anthropic/claude-sonnet-4-6), temperature=0, no extended thinkinganthropic/claude-haiku-4-5-20251001), temperature=0