SimpleQA Failure Analysis

Status report | Last updated 2026-03-24 | Conventions: REPORT_SPEC.md

Benchmark: SimpleQA

What it measures: Factoid question-answering accuracy — short, unambiguous questions with verified ground-truth answers. Tests whether a model can recall specific facts (dates, names, numbers) without hallucinating.

Source: OpenAI, 2024. GitHub | Blog post | Paper

Dataset: 4,326 questions across 10 topics (Science & Technology, Politics, Sports, Music, History, Art, Geography, TV Shows, Video Games, Other) with 5 answer types (Person, Date, Number, Place, Other).

Scoring: Two-pass — model generates a response, then an LLM grader classifies it as CORRECT (A), INCORRECT (B), or NOT_ATTEMPTED (C). F1 = harmonic mean of is_correct rate and accuracy-given-attempted.

Model	Correct %	Notes	Source
Gemini 3 Pro	72.1%		llm-stats.com
GPT-4.5	62.5%		OpenAI simple-evals
GPT-5 (thinking)	55.0%		GPT-5 System Card
Gemini 2.5 Pro	54.0%		llm-stats.com
o3	49.4%		OpenAI simple-evals
GPT-5 (no thinking)	46.0%		GPT-5 System Card
o1	42.6%		OpenAI simple-evals
GPT-4.1	41.6%		OpenAI simple-evals
GPT-4o	38.2%		SimpleQA paper
Claude Sonnet 4.6 (ours)	30.0%	tune split, n=50	This report
Claude 3.5 Sonnet	28.9%		SimpleQA paper
Claude 3 Opus	23.5%		SimpleQA paper

Note: No published SimpleQA scores exist for Claude Opus 4.5/4.6 or Sonnet 4.5/4.6. Claude models hedge heavily — on SimpleQA Verified, Opus 4 only attempts 35.5% of questions (54.1% correct when it does answer). Scores above 90% on third-party leaderboards typically indicate tool/browsing augmentation.

Relevance: Tests factual recall — a prerequisite for draft document analysis, intel dossier generation, and any task where confident wrong answers are dangerous. Directly measures the hallucination problem that vario strategies (web search, critique_revise, diverse_verify) are designed to mitigate.

Current State

30%

Correct (15/50)

58%

Incorrect (29/50)

12%

Not Attempted (6/50)

0.319

F1 Score

30%

58%

12%

Best Run Configuration

Model: anthropic/claude-sonnet-4-6
Tools: None (no web search, no native code)
Vario strategy: None (single-shot)
Thinking budget: None (0 reasoning tokens)
Temperature: 0 (default)
Grader: anthropic/claude-haiku-4-5-20251001
Sample: 50 questions (tune split, seed 42)
Subscription: No (--no-subscription)
Cost: $0.14 (gen $0.06, grading $0.08)
Avg latency: 9,199ms per question

Reproduce

python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription

Comparison to Published Scores

Our 30% correct on the tune split is consistent with the published Claude 3.5 Sonnet score of 28.9%. On a different random 50-question sample (no split), we scored 38%, which is closer to GPT-4o's published 38.2%. The tune split appears harder than a random sample, likely due to more obscure questions in the fixed indices.

The current score is less than half the frontier (GPT-4.5 at 62.5%). No augmentation (web search, thinking, strategies) has been tested yet.

Data Quality

All 50 responses finished cleanly: 50/50 stop finish reason, 0 errors, 0 timeouts, 0 truncated. No extraction or grading infrastructure issues. All failures are model capability issues, not eval pipeline bugs.

Failure Taxonomy

35 failures across 50 questions (29 incorrect + 6 not_attempted), categorized by failure mode:

Category	Count	% of All (50)	% of Failures (35)	Description
Confident Hallucination	18	36%	51%	Wrong answer, no uncertainty markers. Most damaging.
Hedged but Wrong	6	12%	17%	Expresses uncertainty but commits to a wrong answer.
Near Miss	5	10%	14%	Close but wrong (off by 1 day/year, similar acronym).
Knowledge Gap / Refusal	6	12%	17%	Model declines to answer. Best failure mode — calibrated.
Grading Error	0	0%	0%	No cases where grading was clearly wrong.

Key observation: Confident hallucination is the dominant failure mode (36% of all questions, 51% of failures). The model fabricates plausible-sounding facts with full confidence. Sub-types: wrong dates/years (8), wrong names/entities (6), wrong numbers (4).

If the model had refused on all 18 confident hallucinations instead of guessing, accuracy-given-attempted would jump from 34% to 58% and F1 from 0.319 to 0.401.

Confident Hallucination — Examples

Q2: In what year was Rosario Crocetta appointed Councillor for Culture in the City Council of Gela?

Expected: 1998

Model said: 1994

Rosario Crocetta was appointed Councillor for Culture in the City Council of Gela with the Federation of the Greens in **1994**.

Completely confident, no hedging. 4 years off. Fabricated a plausible year.

Q7: On what day, month, and year did Taiwanese conductor and violinist Felix Chen die?

Expected: April 9, 2018

Model said: October 3, 2020

Felix Chen (陳秋盛), the Taiwanese conductor and violinist, died on **October 3, 2020**.

Wrong month, wrong day, wrong year. Confidently stated with Chinese characters for verisimilitude. Pure fabrication.

Q13: What is the middle name of Motaz Azaiza, Palestinian journalist?

Expected: Hilal

Model said: Moeen

Motaz Azaiza's full name is **Motaz Moeen Azaiza**. His middle name is **Moeen**.

Fabricated a plausible Arabic middle name with full confidence.

Hedged but Wrong — Examples

Q1: Name the mission director of the Rohini Satellite 1 (RS-1) satellite launch in 1980.

Expected: Dr. Kalam

Model said: S. C. Gupta (then mentioned Kalam as Project Director)

The mission director of the Rohini Satellite 1 (RS-1) launch in 1980 was **S. C. Gupta**. However, I should note that I'm not entirely confident in this specific detail. The RS-1 was launched on July 18, 1980, aboard the SLV-3 rocket, with **A.P.J. Abdul Kalam** serving as the Project Director of the SLV-3 program.

Mentioned the correct answer (Kalam) but assigned the wrong role (Project Director instead of Mission Director). Primary answer contradicts gold target.

Q23: Which county refused to lower flags after the 2016 Orlando nightclub shooting?

Expected: Baldwin County in Alabama

Model said: Catawba County, North Carolina (then said it wasn't sure)

Following the 2016 Orlando nightclub shooting, **Brevard County, Florida** was notably discussed, but the county that refused to lower its flags to half-staff was **Catawba County, North Carolina**. [...] I'm not fully confident in the specific county name.

Named two wrong counties while hedging. The hedging shows the model sensed its uncertainty but still committed to wrong answers.

Q47: How many years did Sarah Young serve as a missionary in Japan?

Expected: 8

Model said: 2 years (from 1990 to 1992)

Sarah Young (the author of *Jesus Calling*) served as a missionary in Japan for **2 years**, from 1990 to 1992, with her husband before returning to the United States. However, I'm not fully confident in the specific details here.

Off by a factor of 4 (2 vs 8 years). Hedging was warranted but the model still committed to a wrong answer instead of refusing.

Near Miss — Examples

Q31: When was Louis Armstrong arrested at New Orleans Juvenile Court?

Expected: December 31, 1912

Model said: January 1, 1913

Louis Armstrong was arrested on **New Year's Day, January 1, 1913**, for firing a pistol into the air in celebration.

Off by 1 day at the NYE/NY boundary. Model knows the event (NYE celebration, pistol firing) but Dec 31 vs Jan 1 is a genuine error.

Q46: At what event in 2004 did David Hanson present K-Bot?

Expected: American Association for the Advancement of Science (AAAS) conference

Model said: AAAI (Association for the Advancement of Artificial Intelligence) conference

David Hanson presented K-Bot at the **AAAI (Association for the Advancement of Artificial Intelligence) conference** in 2004.

AAAS vs AAAI — similar acronym, both plausible venues. Classic near-miss hallucination.

Q35: When was Maurice Strong elected first head of UN Environment?

Expected: December 1972

Model said: January 1973

Maurice Strong was unanimously elected by the UN General Assembly to be the first Executive Director of the UN Environment Programme (UNEP) in **January 1973**.

Off by 1 month (Dec 1972 vs Jan 1973). Very close. UNEP was established by a GA resolution in December 1972.

Knowledge Gap / Refusal — Examples

Q14: According to the 2021 rules of Battle of the Nations, how many rounds does each longsword duel last?

Expected: 1

I don't have specific details about the 2021 Battle of the Nations ruleset for longsword duels readily available in my knowledge. [...] I'd recommend checking the official Battle of the Nations website.

Niche competitive rules question. Appropriate refusal.

Q26: What is the name of Track 6 on the Mario Kart 64 Greatest Hits Soundtrack released in 1997?

Expected: Koopa Castle

I don't have specific detailed track listing information for the **Mario Kart 64 Greatest Hits Soundtrack** released in 1997 readily available. I wouldn't want to guess incorrectly.

Specific track listing from a 1997 video game soundtrack. Reasonable to decline.

Q42: What are the dimensions in centimeters of the painting "Moving House" by Vasnetsov?

Expected: 53.5 x 67.2 cm

I don't have specific detailed catalog information about the exact dimensions in centimeters of **"Moving House"** by **Apollinary Vasnetsov** readily available with confidence.

Precise physical dimensions of a specific painting. Museum catalog data — appropriate refusal.

Logbook

All SimpleQA runs to date. No augmentation (web search, thinking, vario strategies) has been tested yet.

Run	Model	Score	F1	Sample	Tools	Strategy	Thinking	Cost	Notes
1	`anthropic/claude-haiku-4-5-20251001`	0% (0/3)	0	3 (random, seed 42)	None	None	None	$0.01	Smoke test only. Haiku refused all 3 questions (100% NOT_ATTEMPTED). Too weak — not useful as a solver, only as a grader.
2	`anthropic/claude-sonnet-4-6`	38% (19/50)	~0.42	50 (random, seed 42, no split)	None	None	None	~$0.14	First real run. Random 50 from full dataset. 42% incorrect, 20% not attempted. Closer to published Claude 3.5 Sonnet (28.9%) and GPT-4o (38.2%).
3	`anthropic/claude-sonnet-4-6`	30% (15/50)	0.319	50 (tune split, seed 42)	None	None	None	$0.14	Current best. Tune split is harder than random sample. 58% incorrect, 12% not attempted. This is the run analyzed in this report.

All runs: temperature 0, API (not subscription), graded by anthropic/claude-haiku-4-5-20251001. Reproduce run 3: python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription

The drop from 38% (random) to 30% (tune split) suggests the tune split contains harder questions. The tune split is fixed (100 indices in benchmarks/configs/simpleqa_split.json) and should be used for all iteration; the eval split is reserved for final reporting.

Next Steps

Prioritized by expected impact. Nothing beyond single-shot vanilla generation has been tested yet — significant headroom likely exists.

1. "Refuse if unsure" system prompt

High Impact

Prepend: "Answer the following factual question. If you are not highly confident in your answer, say 'I don't know' rather than guessing."

Directly attacks the 18 confident hallucinations (36% of all questions). Even converting half to NOT_ATTEMPTED would lift accuracy-given-attempted from 34% to ~48%.

Expected: +10-15pp accuracy-given-attempted, +0.05-0.10 F1 | Cost: $0 (prompt change only)

2. Web search augmentation

High Impact

Two options: --native-web-search (provider-native, Anthropic's built-in) or --tools web_search (our tool-use wrapper). SimpleQA questions are factoid lookups — the canonical use case for retrieval augmentation.

Addresses the root cause (knowledge gaps) rather than symptoms (hallucination). Published RAG literature shows +20-30pp on factoid QA.

Expected: +20-30pp correct rate | Cost: ~$0.30-0.50 per 50Q (search API costs)

3. Extended thinking budget

Medium Impact

--thinking-budget 4096 or higher. May help the model catch its own hallucinations through explicit reasoning. The 5 near-miss questions might benefit most.

Uncertain payoff: thinking helps on reasoning tasks but factoid recall may not benefit from more deliberation. Could even hurt if the model "reasons" itself into a wrong answer.

Expected: uncertain, possibly +3-8pp | Cost: ~2-4x generation cost (~$0.12-0.24 per 50Q)

4. Multi-model consensus (vario diverse_verify or critique_revise)

Medium Impact

Sample 3-5 answers (same or different models), take the majority. If no majority, refuse. Fabricated details vary across samples; correct answers are stable. Also testable via vario's diverse_verify or critique_revise strategies.

Expected: +5-10pp correct rate | Cost: 3-5x generation cost (~$0.18-0.30 per 50Q)

5. Calibrated confidence scoring

Low Impact

Ask the model to output a confidence score (1-10) alongside its answer. Threshold at 7+ to answer, else refuse. The 6 hedged-wrong answers show the model sometimes detects uncertainty — an explicit confidence score might surface this more reliably.

Expected: +5pp accuracy-given-attempted | Cost: minimal

6. Compare against other models

Low Impact (diagnostic)

Run the same tune split with GPT-5.2, Gemini 3.1 Pro, and Grok to calibrate whether 30% is Sonnet-specific or dataset-specific. Helps separate model weakness from question difficulty.

Expected: diagnostic only | Cost: ~$0.15-$1.00 per model (varies)

Supplementary: Topic Breakdown

Topic	Total	Correct	Incorrect	Not Attempted	Accuracy
Sports	6	3	2	1	50%
Other	4	2	2	0	50%
Politics	12	5	7	0	42%
Science & Tech	8	3	4	1	38%
Music	5	1	3	1	20%
Geography	6	1	5	0	17%
Art	4	0	3	1	0%
History	2	0	2	0	0%
Video Games	2	0	1	1	0%
TV Shows	1	0	0	1	0%

Worst topics: Geography (17%), Art (0%), History (0%). Best topics: Sports (50%), Other (50%), Politics (42%). Questions about niche cultural facts, obscure dates, and specific geographic details are hardest.

Supplementary: Response Characteristics

Metric	Correct	Incorrect	Not Attempted
Avg response length (chars)	227	256	467
Avg latency (ms)	8,884	9,031	10,794
Contains hedging language	1/15 (7%)	7/29 (24%)	6/6 (100%)

Correct answers are shorter, faster, and almost never hedge. Incorrect answers look structurally similar to correct ones — confident and direct, making them indistinguishable without ground truth. Not-attempted answers are 2x longer as the model explains why it cannot answer.

Supplementary: Hallucination Patterns

Pattern	Count	Examples	Mechanism
Obscure person/entity details	10	Nathalie Menigon's birthday, Motaz Azaiza's middle name, Felix Chen's death date	Model has partial knowledge of the person but not the specific detail asked. Fills in a plausible answer from adjacent knowledge.
Precise dates and years	8	Founding years, appointment years, event dates	Often gets the right century/decade but wrong specific year. Partial knowledge with gap-filling.
Small-town/obscure location stats	3	Grand Mound population, Taz Russky population, Combita founding	Census data and municipal facts for tiny places. Model fabricates plausible numbers.
Niche cultural products	3	Kinoko Teikoku album, Demon's Souls weapon weight, cricket team name	Specific data from niche domains (Japanese indie music, video game stats, apartheid-era sports).

Appendix: All Results (Run 3)

Q#	Question (truncated)	Expected	Grade	Category
0	Three cities where Arvind Kejriwal spent childhood	Sonipat, Ghaziabad, Hisar	CORRECT	-
1	Mission director of RS-1 launch 1980	Dr. Kalam	INCORRECT	Hedged wrong
2	Year Crocetta appointed Councillor for Culture	1998	INCORRECT	Confident hallucination
3	Year Meyrick described Cydalima mysteris	1886	INCORRECT	Confident hallucination
4	Theme song performer for Shirley Valentine	Patti Austin	CORRECT	-
5	Weight of Phosphorescent Pole in Demon's Souls	4.0 units	INCORRECT	Confident hallucination
6	Year Kathryn Shaw stepped down from Studio 58	2020	INCORRECT	Confident hallucination
7	Death date of Felix Chen	April 9, 2018	INCORRECT	Confident hallucination
8	Kinoko Teikoku album released 2014	Fake World Wonderland	INCORRECT	Confident hallucination
9	Last Olympic fencing weapon to go electrical	Sabre	CORRECT	-
10	Tokyo ward where AT-1 phono cartridge was created	Shinjuku	CORRECT	-
11	Number of genes in endometriosis GWAS review	36	INCORRECT	Hedged wrong
12	Year Kristin Otto retired from swimming	1989	CORRECT	-
13	Middle name of Motaz Azaiza	Hilal	INCORRECT	Confident hallucination
14	Longsword duel rounds in Battle of the Nations	1	NOT_ATTEMPTED	Knowledge gap
15	Ciara's Tampa performance date for Jackie Tour	May 16, 2015	NOT_ATTEMPTED	Knowledge gap
16	ECHR ruling date for Carola van Kuck	12 June 2003	CORRECT	-
17	Injuries in Weesp train disaster 1918	42	INCORRECT	Hedged wrong
18	2020 Census population of Grand Mound, Iowa	615	INCORRECT	Confident hallucination
19	Date of Peter Struck's armed forces announcement	January 13, 2004	INCORRECT	Confident hallucination
20	Enzyme with EC number 3.1.4.2	Glycerophosphocholine phosphodiesterase	CORRECT	-
21	Years Ibrahim Rugova at University of Paris	1976 to 1977	INCORRECT	Near miss
22	Year Salgar, Antioquia founded	1880	CORRECT	-
23	County refusing to lower flags after Orlando shooting	Baldwin County in Alabama	INCORRECT	Hedged wrong
24	Word of the Decade 2010-2019 (ADS)	they	CORRECT	-
25	Year Mascarin Peak renamed	2003	INCORRECT	Hedged wrong
26	Track 6 on Mario Kart 64 soundtrack	Koopa Castle	NOT_ATTEMPTED	Knowledge gap
27	Population of Taz Russky 2010	175	INCORRECT	Confident hallucination
28	Male victim in 2012 Delhi gang rape	Awindra Pratap Pandey	CORRECT	-
29	Birthdate of Nathalie Menigon	28 February 1957	INCORRECT	Confident hallucination
30	Section 4.3.2 title in semantic maps paper	MDS and formal paradigms	NOT_ATTEMPTED	Knowledge gap
31	Date Louis Armstrong was arrested	December 31, 1912	INCORRECT	Near miss
32	Designer of 50 francs with Little Prince	Roger Pfund	INCORRECT	Confident hallucination
33	Consecutive terms of Babanrao Gholap	5	CORRECT	-
34	Village in Ontario settled 1824 by Mr. A. Hurd	Prince Albert	INCORRECT	Confident hallucination
35	Month/year Maurice Strong elected to head UNEP	December 1972	INCORRECT	Near miss
36	Muller substitution minute in 2014 CL semifinal	74	CORRECT	-
37	What Alison Garrs overdosed on in Happy Valley	Diazepam	NOT_ATTEMPTED	Knowledge gap
38	Year Combita, Boyaca founded	1586	INCORRECT	Confident hallucination
39	Years Constable worked on Waterloo Bridge	13	INCORRECT	Confident hallucination
40	Age of Aniol Serrasolses kayaking glacial waterfall	32	INCORRECT	Hedged wrong
41	2017 case Natasha Merle was involved in	Buck v. Davis	INCORRECT	Confident hallucination
42	Dimensions of "Moving House" by Vasnetsov	53.5 x 67.2 cm	NOT_ATTEMPTED	Knowledge gap
43	Year Iyanaga became Dean at Tokyo University	1965	INCORRECT	Confident hallucination
44	1988 author defining Mammalia phylogenetically	Timothy Rowe	CORRECT	-
45	Year Clive Derby-Lewis became Bedfordview councillor	1972	CORRECT	-
46	Event where David Hanson presented K-Bot in 2004	AAAS conference	INCORRECT	Near miss
47	Years Sarah Young served as missionary in Japan	8	INCORRECT	Hedged wrong
48	Cricket team of Dr. Abu Baker Asvat	The Crescents	INCORRECT	Confident hallucination
49	Year Kiyosi Ito appointed to Cabinet Statistics Bureau	1939	CORRECT	-

Generated: 2026-03-24 | Conventions: REPORT_SPEC.md