# SimpleQA Failure Analysis — Sonnet 4.6

**Date:** 2026-03-24 | **Split:** tune (50 questions) | **Grader:** Haiku 4.5

## Results

| Metric                    | Value  |
|---------------------------|--------|
| Correct                   | 15/50 (30%) |
| Incorrect                 | 29/50 (58%) |
| Not Attempted             | 6/50 (12%)  |
| Accuracy Given Attempted  | 34.1%       |
| F1                        | 0.319       |
| Cost                      | $0.14       |

## Failure Taxonomy

| Category                  | Count | % of Failures | Description |
|---------------------------|-------|---------------|-------------|
| Confident Hallucination   | 18    | 51%           | Gives wrong answer with no hedging — most damaging |
| Hedged but Wrong          | 6     | 17%           | Expresses uncertainty but commits to wrong answer |
| Near Miss                 | 5     | 14%           | Close but wrong (off by 1 day/year, similar acronym) |
| Knowledge Gap / Refusal   | 6     | 17%           | Correctly declines to answer — best failure mode |

## Key Findings

1. **Confident hallucination is the dominant failure mode** (36% of all questions). The model fabricates plausible-sounding facts with no uncertainty markers. Sub-types: wrong dates/years (8), wrong names/entities (6), wrong numbers (4).

2. **The model is poorly calibrated.** When it answers (44/50 questions), it's only right 34% of the time. It hedges on just 7 wrong answers and refuses on 6. The other 18 wrong answers are stated with full confidence.

3. **Correct answers are shorter and more confident.** Avg 227 chars vs 256 (wrong) vs 467 (refused). Only 1/15 correct answers contains hedging language.

4. **No grading errors found.** All 50 grades are defensible under the SimpleQA rubric.

5. **Worst topics:** Geography (17%), Art (0%), History (0%). **Best topics:** Sports (50%), Other (50%), Politics (42%).

## Top Recommendations

1. **"Refuse if unsure" system prompt** — directly attacks 18 confident hallucinations. High impact.
2. **Web search augmentation** (`--native-web-search`) — addresses root cause (knowledge gaps). High impact, +20-30pp expected.
3. **Self-consistency / majority vote** — fabricated details vary across samples; correct answers don't. Medium impact, 3-5x cost.
4. **Terse answer instruction** — reduces self-talk-induced errors (model talks itself into wrong answers). Medium impact.

## Full Report

See [simpleqa_failure_analysis.html](simpleqa_failure_analysis.html) for detailed examples, per-question results, and complete analysis.
