# FLAWS Failure Analysis: Sonnet 4.6

**Tune split (20 papers, OpenAI subset) — 2026-03-24**

[Full HTML report](https://static.localhost/benchmarks/results/reports/flaws_failure_analysis.html)

## Executive Summary

The prompt fix requiring exactly 10 error candidates per paper improved results from **0% to 50%** accuracy at k=10 (combined Levenshtein + Judge). The previous run averaged only 1.34 predictions/paper; the fixed prompt achieves exactly 10.0.

10 of 20 papers (50%) remain unmatched. The dominant failure mode is the model finding a **completely different error** than the one inserted — not a matching/formatting issue. Lowering the Levenshtein threshold from 0.5 to 0.3 adds zero matches.

## Key Metrics

| Metric                       | Value          |
|------------------------------|----------------|
| Combined k=10 accuracy       | 10/20 (50%)    |
| Levenshtein k=10             | 7/20 (35%)     |
| Judge k=10                   | 9/20 (45%)     |
| Combined k=1 accuracy        | 2/20 (10%)     |
| Avg predictions per paper    | 10.0           |
| Identification cost          | $2.03          |

## Comparison to Published Results

| Model                       | k=10 Accuracy | Notes                              |
|-----------------------------|---------------|-------------------------------------|
| GPT-5                       | 39.1%         | Published (full dataset)            |
| **Sonnet 4.6 (ours)**       | **50.0%**     | Tune split (20 papers) only         |
| Sonnet 4.5                  | 21.5%         | Published (full dataset)            |
| Sonnet 4.6 (old prompt)     | 0.0%          | Before fix — 1.34 preds/paper       |

Note: 50% on 20 tune-split papers will likely regress toward 30-40% on the full eval split.

## Failure Taxonomy

| Category              | Count | Pct  | Description                                                   |
|-----------------------|-------|------|---------------------------------------------------------------|
| Matched (both)        | 6     | 30%  | Both Levenshtein and Judge agree                              |
| Matched (Lev only)    | 1     | 5%   | String match found, Judge disagrees                           |
| Matched (Judge only)  | 3     | 15%  | Conceptual match the Judge caught despite low string sim      |
| Wrong error           | 10    | 50%  | Model found a different error than the one inserted           |

## Threshold Sensitivity

| Threshold | Lev Matches | Accuracy | Delta |
|-----------|-------------|----------|-------|
| 0.50      | 7/20        | 35%      | --    |
| 0.40      | 7/20        | 35%      | +0    |
| 0.30      | 7/20        | 35%      | +0    |
| 0.20      | 9/20        | 45%      | +2    |

**Conclusion:** Matching is not the bottleneck. The official threshold (0.5) is appropriate.

## Root Causes (Wrong Error)

1. **Paper length overwhelms attention** — 30-80K chars of LaTeX; subtle error in one theorem among many
2. **Multiple plausible errors exist** — model sometimes finds real bugs that aren't the planted one
3. **Errors in supporting sections** — appendix proofs, table captions, secondary results
4. **No external knowledge baseline** — limited to internal consistency checks

## Recommendations

1. **Enable extended thinking (HIGH)** — 16K-32K thinking budget for proof tracing
2. **Multi-pass analysis (HIGH)** — Section-by-section verification instead of one-shot
3. **Multi-model consensus (MEDIUM)** — Union predictions from 3+ models
4. **Prompt: emphasize proofs/appendices (MEDIUM)** — Guide attention to where errors hide
5. **Run full eval split** — 245 papers, ~$25 estimated cost
6. **Do NOT invest in matching improvements** — not the bottleneck