SimpleQA Failure Analysis

Status report | Last updated 2026-03-24 | Conventions: REPORT_SPEC.md

Benchmark: SimpleQA

What it measures: Factoid question-answering accuracy — short, unambiguous questions with verified ground-truth answers. Tests whether a model can recall specific facts (dates, names, numbers) without hallucinating.

Source: OpenAI, 2024. GitHub | Blog post | Paper

Dataset: 4,326 questions across 10 topics (Science & Technology, Politics, Sports, Music, History, Art, Geography, TV Shows, Video Games, Other) with 5 answer types (Person, Date, Number, Place, Other).

Scoring: Two-pass — model generates a response, then an LLM grader classifies it as CORRECT (A), INCORRECT (B), or NOT_ATTEMPTED (C). F1 = harmonic mean of is_correct rate and accuracy-given-attempted.

ModelCorrect %NotesSource
Gemini 3 Pro72.1%llm-stats.com
GPT-4.562.5%OpenAI simple-evals
GPT-5 (thinking)55.0%GPT-5 System Card
Gemini 2.5 Pro54.0%llm-stats.com
o349.4%OpenAI simple-evals
GPT-5 (no thinking)46.0%GPT-5 System Card
o142.6%OpenAI simple-evals
GPT-4.141.6%OpenAI simple-evals
GPT-4o38.2%SimpleQA paper
Claude Sonnet 4.6 (ours)30.0%tune split, n=50This report
Claude 3.5 Sonnet28.9%SimpleQA paper
Claude 3 Opus23.5%SimpleQA paper

Note: No published SimpleQA scores exist for Claude Opus 4.5/4.6 or Sonnet 4.5/4.6. Claude models hedge heavily — on SimpleQA Verified, Opus 4 only attempts 35.5% of questions (54.1% correct when it does answer). Scores above 90% on third-party leaderboards typically indicate tool/browsing augmentation.

Relevance: Tests factual recall — a prerequisite for draft document analysis, intel dossier generation, and any task where confident wrong answers are dangerous. Directly measures the hallucination problem that vario strategies (web search, critique_revise, diverse_verify) are designed to mitigate.

Current State

30%
Correct (15/50)
58%
Incorrect (29/50)
12%
Not Attempted (6/50)
0.319
F1 Score
30%
58%
12%

Best Run Configuration

Model
anthropic/claude-sonnet-4-6
Tools
None (no web search, no native code)
Vario strategy
None (single-shot)
Thinking budget
None (0 reasoning tokens)
Temperature
0 (default)
Grader
anthropic/claude-haiku-4-5-20251001
Sample
50 questions (tune split, seed 42)
Subscription
No (--no-subscription)
Cost
$0.14 (gen $0.06, grading $0.08)
Avg latency
9,199ms per question

Reproduce

python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription

Comparison to Published Scores

Our 30% correct on the tune split is consistent with the published Claude 3.5 Sonnet score of 28.9%. On a different random 50-question sample (no split), we scored 38%, which is closer to GPT-4o's published 38.2%. The tune split appears harder than a random sample, likely due to more obscure questions in the fixed indices.

The current score is less than half the frontier (GPT-4.5 at 62.5%). No augmentation (web search, thinking, strategies) has been tested yet.

Data Quality

All 50 responses finished cleanly: 50/50 stop finish reason, 0 errors, 0 timeouts, 0 truncated. No extraction or grading infrastructure issues. All failures are model capability issues, not eval pipeline bugs.

Failure Taxonomy

35 failures across 50 questions (29 incorrect + 6 not_attempted), categorized by failure mode:

Category Count % of All (50) % of Failures (35) Description
Confident Hallucination 18 36% 51% Wrong answer, no uncertainty markers. Most damaging.
Hedged but Wrong 6 12% 17% Expresses uncertainty but commits to a wrong answer.
Near Miss 5 10% 14% Close but wrong (off by 1 day/year, similar acronym).
Knowledge Gap / Refusal 6 12% 17% Model declines to answer. Best failure mode — calibrated.
Grading Error 0 0% 0% No cases where grading was clearly wrong.

Key observation: Confident hallucination is the dominant failure mode (36% of all questions, 51% of failures). The model fabricates plausible-sounding facts with full confidence. Sub-types: wrong dates/years (8), wrong names/entities (6), wrong numbers (4).

If the model had refused on all 18 confident hallucinations instead of guessing, accuracy-given-attempted would jump from 34% to 58% and F1 from 0.319 to 0.401.

Confident Hallucination — Examples

Q2: In what year was Rosario Crocetta appointed Councillor for Culture in the City Council of Gela?
Expected: 1998
Model said: 1994
Rosario Crocetta was appointed Councillor for Culture in the City Council of Gela with the Federation of the Greens in **1994**.
Completely confident, no hedging. 4 years off. Fabricated a plausible year.
Q7: On what day, month, and year did Taiwanese conductor and violinist Felix Chen die?
Expected: April 9, 2018
Model said: October 3, 2020
Felix Chen (陳秋盛), the Taiwanese conductor and violinist, died on **October 3, 2020**.
Wrong month, wrong day, wrong year. Confidently stated with Chinese characters for verisimilitude. Pure fabrication.
Q13: What is the middle name of Motaz Azaiza, Palestinian journalist?
Expected: Hilal
Model said: Moeen
Motaz Azaiza's full name is **Motaz Moeen Azaiza**. His middle name is **Moeen**.
Fabricated a plausible Arabic middle name with full confidence.

Hedged but Wrong — Examples

Q1: Name the mission director of the Rohini Satellite 1 (RS-1) satellite launch in 1980.
Expected: Dr. Kalam
Model said: S. C. Gupta (then mentioned Kalam as Project Director)
The mission director of the Rohini Satellite 1 (RS-1) launch in 1980 was **S. C. Gupta**. However, I should note that I'm not entirely confident in this specific detail. The RS-1 was launched on July 18, 1980, aboard the SLV-3 rocket, with **A.P.J. Abdul Kalam** serving as the Project Director of the SLV-3 program.
Mentioned the correct answer (Kalam) but assigned the wrong role (Project Director instead of Mission Director). Primary answer contradicts gold target.
Q23: Which county refused to lower flags after the 2016 Orlando nightclub shooting?
Expected: Baldwin County in Alabama
Model said: Catawba County, North Carolina (then said it wasn't sure)
Following the 2016 Orlando nightclub shooting, **Brevard County, Florida** was notably discussed, but the county that refused to lower its flags to half-staff was **Catawba County, North Carolina**. [...] I'm not fully confident in the specific county name.
Named two wrong counties while hedging. The hedging shows the model sensed its uncertainty but still committed to wrong answers.
Q47: How many years did Sarah Young serve as a missionary in Japan?
Expected: 8
Model said: 2 years (from 1990 to 1992)
Sarah Young (the author of *Jesus Calling*) served as a missionary in Japan for **2 years**, from 1990 to 1992, with her husband before returning to the United States. However, I'm not fully confident in the specific details here.
Off by a factor of 4 (2 vs 8 years). Hedging was warranted but the model still committed to a wrong answer instead of refusing.

Near Miss — Examples

Q31: When was Louis Armstrong arrested at New Orleans Juvenile Court?
Expected: December 31, 1912
Model said: January 1, 1913
Louis Armstrong was arrested on **New Year's Day, January 1, 1913**, for firing a pistol into the air in celebration.
Off by 1 day at the NYE/NY boundary. Model knows the event (NYE celebration, pistol firing) but Dec 31 vs Jan 1 is a genuine error.
Q46: At what event in 2004 did David Hanson present K-Bot?
Expected: American Association for the Advancement of Science (AAAS) conference
Model said: AAAI (Association for the Advancement of Artificial Intelligence) conference
David Hanson presented K-Bot at the **AAAI (Association for the Advancement of Artificial Intelligence) conference** in 2004.
AAAS vs AAAI — similar acronym, both plausible venues. Classic near-miss hallucination.
Q35: When was Maurice Strong elected first head of UN Environment?
Expected: December 1972
Model said: January 1973
Maurice Strong was unanimously elected by the UN General Assembly to be the first Executive Director of the UN Environment Programme (UNEP) in **January 1973**.
Off by 1 month (Dec 1972 vs Jan 1973). Very close. UNEP was established by a GA resolution in December 1972.

Knowledge Gap / Refusal — Examples

Q14: According to the 2021 rules of Battle of the Nations, how many rounds does each longsword duel last?
Expected: 1
I don't have specific details about the 2021 Battle of the Nations ruleset for longsword duels readily available in my knowledge. [...] I'd recommend checking the official Battle of the Nations website.
Niche competitive rules question. Appropriate refusal.
Q26: What is the name of Track 6 on the Mario Kart 64 Greatest Hits Soundtrack released in 1997?
Expected: Koopa Castle
I don't have specific detailed track listing information for the **Mario Kart 64 Greatest Hits Soundtrack** released in 1997 readily available. I wouldn't want to guess incorrectly.
Specific track listing from a 1997 video game soundtrack. Reasonable to decline.
Q42: What are the dimensions in centimeters of the painting "Moving House" by Vasnetsov?
Expected: 53.5 x 67.2 cm
I don't have specific detailed catalog information about the exact dimensions in centimeters of **"Moving House"** by **Apollinary Vasnetsov** readily available with confidence.
Precise physical dimensions of a specific painting. Museum catalog data — appropriate refusal.

Logbook

All SimpleQA runs to date. No augmentation (web search, thinking, vario strategies) has been tested yet.

Run Model Score F1 Sample Tools Strategy Thinking Cost Notes
1 anthropic/claude-haiku-4-5-20251001 0% (0/3) 0 3 (random, seed 42) None None None $0.01 Smoke test only. Haiku refused all 3 questions (100% NOT_ATTEMPTED). Too weak — not useful as a solver, only as a grader.
2 anthropic/claude-sonnet-4-6 38% (19/50) ~0.42 50 (random, seed 42, no split) None None None ~$0.14 First real run. Random 50 from full dataset. 42% incorrect, 20% not attempted. Closer to published Claude 3.5 Sonnet (28.9%) and GPT-4o (38.2%).
3 anthropic/claude-sonnet-4-6 30% (15/50) 0.319 50 (tune split, seed 42) None None None $0.14 Current best. Tune split is harder than random sample. 58% incorrect, 12% not attempted. This is the run analyzed in this report.

All runs: temperature 0, API (not subscription), graded by anthropic/claude-haiku-4-5-20251001. Reproduce run 3: python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription

The drop from 38% (random) to 30% (tune split) suggests the tune split contains harder questions. The tune split is fixed (100 indices in benchmarks/configs/simpleqa_split.json) and should be used for all iteration; the eval split is reserved for final reporting.

Next Steps

Prioritized by expected impact. Nothing beyond single-shot vanilla generation has been tested yet — significant headroom likely exists.

1. "Refuse if unsure" system prompt
High Impact

Prepend: "Answer the following factual question. If you are not highly confident in your answer, say 'I don't know' rather than guessing."

Directly attacks the 18 confident hallucinations (36% of all questions). Even converting half to NOT_ATTEMPTED would lift accuracy-given-attempted from 34% to ~48%.

Expected: +10-15pp accuracy-given-attempted, +0.05-0.10 F1 | Cost: $0 (prompt change only)
2. Web search augmentation
High Impact

Two options: --native-web-search (provider-native, Anthropic's built-in) or --tools web_search (our tool-use wrapper). SimpleQA questions are factoid lookups — the canonical use case for retrieval augmentation.

Addresses the root cause (knowledge gaps) rather than symptoms (hallucination). Published RAG literature shows +20-30pp on factoid QA.

Expected: +20-30pp correct rate | Cost: ~$0.30-0.50 per 50Q (search API costs)
3. Extended thinking budget
Medium Impact

--thinking-budget 4096 or higher. May help the model catch its own hallucinations through explicit reasoning. The 5 near-miss questions might benefit most.

Uncertain payoff: thinking helps on reasoning tasks but factoid recall may not benefit from more deliberation. Could even hurt if the model "reasons" itself into a wrong answer.

Expected: uncertain, possibly +3-8pp | Cost: ~2-4x generation cost (~$0.12-0.24 per 50Q)
4. Multi-model consensus (vario diverse_verify or critique_revise)
Medium Impact

Sample 3-5 answers (same or different models), take the majority. If no majority, refuse. Fabricated details vary across samples; correct answers are stable. Also testable via vario's diverse_verify or critique_revise strategies.

Expected: +5-10pp correct rate | Cost: 3-5x generation cost (~$0.18-0.30 per 50Q)
5. Calibrated confidence scoring
Low Impact

Ask the model to output a confidence score (1-10) alongside its answer. Threshold at 7+ to answer, else refuse. The 6 hedged-wrong answers show the model sometimes detects uncertainty — an explicit confidence score might surface this more reliably.

Expected: +5pp accuracy-given-attempted | Cost: minimal
6. Compare against other models
Low Impact (diagnostic)

Run the same tune split with GPT-5.2, Gemini 3.1 Pro, and Grok to calibrate whether 30% is Sonnet-specific or dataset-specific. Helps separate model weakness from question difficulty.

Expected: diagnostic only | Cost: ~$0.15-$1.00 per model (varies)

Supplementary: Topic Breakdown

Topic Total Correct Incorrect Not Attempted Accuracy
Sports632150%
Other422050%
Politics1257042%
Science & Tech834138%
Music513120%
Geography615017%
Art40310%
History20200%
Video Games20110%
TV Shows10010%

Worst topics: Geography (17%), Art (0%), History (0%). Best topics: Sports (50%), Other (50%), Politics (42%). Questions about niche cultural facts, obscure dates, and specific geographic details are hardest.

Supplementary: Response Characteristics

Metric Correct Incorrect Not Attempted
Avg response length (chars) 227 256 467
Avg latency (ms) 8,884 9,031 10,794
Contains hedging language 1/15 (7%) 7/29 (24%) 6/6 (100%)

Correct answers are shorter, faster, and almost never hedge. Incorrect answers look structurally similar to correct ones — confident and direct, making them indistinguishable without ground truth. Not-attempted answers are 2x longer as the model explains why it cannot answer.

Supplementary: Hallucination Patterns

Pattern Count Examples Mechanism
Obscure person/entity details 10 Nathalie Menigon's birthday, Motaz Azaiza's middle name, Felix Chen's death date Model has partial knowledge of the person but not the specific detail asked. Fills in a plausible answer from adjacent knowledge.
Precise dates and years 8 Founding years, appointment years, event dates Often gets the right century/decade but wrong specific year. Partial knowledge with gap-filling.
Small-town/obscure location stats 3 Grand Mound population, Taz Russky population, Combita founding Census data and municipal facts for tiny places. Model fabricates plausible numbers.
Niche cultural products 3 Kinoko Teikoku album, Demon's Souls weapon weight, cricket team name Specific data from niche domains (Japanese indie music, video game stats, apartheid-era sports).

Appendix: All Results (Run 3)

Q# Question (truncated) Expected Grade Category
0Three cities where Arvind Kejriwal spent childhoodSonipat, Ghaziabad, HisarCORRECT-
1Mission director of RS-1 launch 1980Dr. KalamINCORRECTHedged wrong
2Year Crocetta appointed Councillor for Culture1998INCORRECTConfident hallucination
3Year Meyrick described Cydalima mysteris1886INCORRECTConfident hallucination
4Theme song performer for Shirley ValentinePatti AustinCORRECT-
5Weight of Phosphorescent Pole in Demon's Souls4.0 unitsINCORRECTConfident hallucination
6Year Kathryn Shaw stepped down from Studio 582020INCORRECTConfident hallucination
7Death date of Felix ChenApril 9, 2018INCORRECTConfident hallucination
8Kinoko Teikoku album released 2014Fake World WonderlandINCORRECTConfident hallucination
9Last Olympic fencing weapon to go electricalSabreCORRECT-
10Tokyo ward where AT-1 phono cartridge was createdShinjukuCORRECT-
11Number of genes in endometriosis GWAS review36INCORRECTHedged wrong
12Year Kristin Otto retired from swimming1989CORRECT-
13Middle name of Motaz AzaizaHilalINCORRECTConfident hallucination
14Longsword duel rounds in Battle of the Nations1NOT_ATTEMPTEDKnowledge gap
15Ciara's Tampa performance date for Jackie TourMay 16, 2015NOT_ATTEMPTEDKnowledge gap
16ECHR ruling date for Carola van Kuck12 June 2003CORRECT-
17Injuries in Weesp train disaster 191842INCORRECTHedged wrong
182020 Census population of Grand Mound, Iowa615INCORRECTConfident hallucination
19Date of Peter Struck's armed forces announcementJanuary 13, 2004INCORRECTConfident hallucination
20Enzyme with EC number 3.1.4.2Glycerophosphocholine phosphodiesteraseCORRECT-
21Years Ibrahim Rugova at University of Paris1976 to 1977INCORRECTNear miss
22Year Salgar, Antioquia founded1880CORRECT-
23County refusing to lower flags after Orlando shootingBaldwin County in AlabamaINCORRECTHedged wrong
24Word of the Decade 2010-2019 (ADS)theyCORRECT-
25Year Mascarin Peak renamed2003INCORRECTHedged wrong
26Track 6 on Mario Kart 64 soundtrackKoopa CastleNOT_ATTEMPTEDKnowledge gap
27Population of Taz Russky 2010175INCORRECTConfident hallucination
28Male victim in 2012 Delhi gang rapeAwindra Pratap PandeyCORRECT-
29Birthdate of Nathalie Menigon28 February 1957INCORRECTConfident hallucination
30Section 4.3.2 title in semantic maps paperMDS and formal paradigmsNOT_ATTEMPTEDKnowledge gap
31Date Louis Armstrong was arrestedDecember 31, 1912INCORRECTNear miss
32Designer of 50 francs with Little PrinceRoger PfundINCORRECTConfident hallucination
33Consecutive terms of Babanrao Gholap5CORRECT-
34Village in Ontario settled 1824 by Mr. A. HurdPrince AlbertINCORRECTConfident hallucination
35Month/year Maurice Strong elected to head UNEPDecember 1972INCORRECTNear miss
36Muller substitution minute in 2014 CL semifinal74CORRECT-
37What Alison Garrs overdosed on in Happy ValleyDiazepamNOT_ATTEMPTEDKnowledge gap
38Year Combita, Boyaca founded1586INCORRECTConfident hallucination
39Years Constable worked on Waterloo Bridge13INCORRECTConfident hallucination
40Age of Aniol Serrasolses kayaking glacial waterfall32INCORRECTHedged wrong
412017 case Natasha Merle was involved inBuck v. DavisINCORRECTConfident hallucination
42Dimensions of "Moving House" by Vasnetsov53.5 x 67.2 cmNOT_ATTEMPTEDKnowledge gap
43Year Iyanaga became Dean at Tokyo University1965INCORRECTConfident hallucination
441988 author defining Mammalia phylogeneticallyTimothy RoweCORRECT-
45Year Clive Derby-Lewis became Bedfordview councillor1972CORRECT-
46Event where David Hanson presented K-Bot in 2004AAAS conferenceINCORRECTNear miss
47Years Sarah Young served as missionary in Japan8INCORRECTHedged wrong
48Cricket team of Dr. Abu Baker AsvatThe CrescentsINCORRECTConfident hallucination
49Year Kiyosi Ito appointed to Cabinet Statistics Bureau1939CORRECT-

Run 3 details: Model: anthropic/claude-sonnet-4-6 | Grader: anthropic/claude-haiku-4-5-20251001 | Split: tune (50 questions) | Seed: 42 | Temperature: 0 | No tools, no web search, no thinking budget | API (not subscription) | Cost: $0.14 (gen: $0.06, grading: $0.08) | Avg latency: 9,199ms

Generated: 2026-03-24 | Conventions: REPORT_SPEC.md