Status report | Last updated 2026-03-24 | Conventions: REPORT_SPEC.md
What it measures: Factoid question-answering accuracy — short, unambiguous questions with verified ground-truth answers. Tests whether a model can recall specific facts (dates, names, numbers) without hallucinating.
Source: OpenAI, 2024. GitHub | Blog post | Paper
Dataset: 4,326 questions across 10 topics (Science & Technology, Politics, Sports, Music, History, Art, Geography, TV Shows, Video Games, Other) with 5 answer types (Person, Date, Number, Place, Other).
Scoring: Two-pass — model generates a response, then an LLM grader classifies it as CORRECT (A), INCORRECT (B), or NOT_ATTEMPTED (C). F1 = harmonic mean of is_correct rate and accuracy-given-attempted.
| Model | Correct % | Notes | Source |
|---|---|---|---|
| Gemini 3 Pro | 72.1% | llm-stats.com | |
| GPT-4.5 | 62.5% | OpenAI simple-evals | |
| GPT-5 (thinking) | 55.0% | GPT-5 System Card | |
| Gemini 2.5 Pro | 54.0% | llm-stats.com | |
| o3 | 49.4% | OpenAI simple-evals | |
| GPT-5 (no thinking) | 46.0% | GPT-5 System Card | |
| o1 | 42.6% | OpenAI simple-evals | |
| GPT-4.1 | 41.6% | OpenAI simple-evals | |
| GPT-4o | 38.2% | SimpleQA paper | |
| Claude Sonnet 4.6 (ours) | 30.0% | tune split, n=50 | This report |
| Claude 3.5 Sonnet | 28.9% | SimpleQA paper | |
| Claude 3 Opus | 23.5% | SimpleQA paper |
Note: No published SimpleQA scores exist for Claude Opus 4.5/4.6 or Sonnet 4.5/4.6. Claude models hedge heavily — on SimpleQA Verified, Opus 4 only attempts 35.5% of questions (54.1% correct when it does answer). Scores above 90% on third-party leaderboards typically indicate tool/browsing augmentation.
Relevance: Tests factual recall — a prerequisite for draft document analysis, intel dossier generation, and any task where confident wrong answers are dangerous. Directly measures the hallucination problem that vario strategies (web search, critique_revise, diverse_verify) are designed to mitigate.
anthropic/claude-sonnet-4-6anthropic/claude-haiku-4-5-20251001--no-subscription)python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription
Our 30% correct on the tune split is consistent with the published Claude 3.5 Sonnet score of 28.9%. On a different random 50-question sample (no split), we scored 38%, which is closer to GPT-4o's published 38.2%. The tune split appears harder than a random sample, likely due to more obscure questions in the fixed indices.
The current score is less than half the frontier (GPT-4.5 at 62.5%). No augmentation (web search, thinking, strategies) has been tested yet.
All 50 responses finished cleanly: 50/50 stop finish reason, 0 errors, 0 timeouts, 0 truncated. No extraction or grading infrastructure issues. All failures are model capability issues, not eval pipeline bugs.
35 failures across 50 questions (29 incorrect + 6 not_attempted), categorized by failure mode:
| Category | Count | % of All (50) | % of Failures (35) | Description |
|---|---|---|---|---|
| Confident Hallucination | 18 | 36% | 51% | Wrong answer, no uncertainty markers. Most damaging. |
| Hedged but Wrong | 6 | 12% | 17% | Expresses uncertainty but commits to a wrong answer. |
| Near Miss | 5 | 10% | 14% | Close but wrong (off by 1 day/year, similar acronym). |
| Knowledge Gap / Refusal | 6 | 12% | 17% | Model declines to answer. Best failure mode — calibrated. |
| Grading Error | 0 | 0% | 0% | No cases where grading was clearly wrong. |
Key observation: Confident hallucination is the dominant failure mode (36% of all questions, 51% of failures). The model fabricates plausible-sounding facts with full confidence. Sub-types: wrong dates/years (8), wrong names/entities (6), wrong numbers (4).
If the model had refused on all 18 confident hallucinations instead of guessing, accuracy-given-attempted would jump from 34% to 58% and F1 from 0.319 to 0.401.
All SimpleQA runs to date. No augmentation (web search, thinking, vario strategies) has been tested yet.
| Run | Model | Score | F1 | Sample | Tools | Strategy | Thinking | Cost | Notes |
|---|---|---|---|---|---|---|---|---|---|
| 1 | anthropic/claude-haiku-4-5-20251001 |
0% (0/3) | 0 | 3 (random, seed 42) | None | None | None | $0.01 | Smoke test only. Haiku refused all 3 questions (100% NOT_ATTEMPTED). Too weak — not useful as a solver, only as a grader. |
| 2 | anthropic/claude-sonnet-4-6 |
38% (19/50) | ~0.42 | 50 (random, seed 42, no split) | None | None | None | ~$0.14 | First real run. Random 50 from full dataset. 42% incorrect, 20% not attempted. Closer to published Claude 3.5 Sonnet (28.9%) and GPT-4o (38.2%). |
| 3 | anthropic/claude-sonnet-4-6 |
30% (15/50) | 0.319 | 50 (tune split, seed 42) | None | None | None | $0.14 | Current best. Tune split is harder than random sample. 58% incorrect, 12% not attempted. This is the run analyzed in this report. |
All runs: temperature 0, API (not subscription), graded by anthropic/claude-haiku-4-5-20251001. Reproduce run 3: python -m benchmarks.eval.simpleqa_official sonnet --split tune --no-subscription
The drop from 38% (random) to 30% (tune split) suggests the tune split contains harder questions. The tune split is fixed (100 indices in benchmarks/configs/simpleqa_split.json) and should be used for all iteration; the eval split is reserved for final reporting.
Prioritized by expected impact. Nothing beyond single-shot vanilla generation has been tested yet — significant headroom likely exists.
Prepend: "Answer the following factual question. If you are not highly confident in your answer, say 'I don't know' rather than guessing."
Directly attacks the 18 confident hallucinations (36% of all questions). Even converting half to NOT_ATTEMPTED would lift accuracy-given-attempted from 34% to ~48%.
Two options: --native-web-search (provider-native, Anthropic's built-in) or --tools web_search (our tool-use wrapper). SimpleQA questions are factoid lookups — the canonical use case for retrieval augmentation.
Addresses the root cause (knowledge gaps) rather than symptoms (hallucination). Published RAG literature shows +20-30pp on factoid QA.
--thinking-budget 4096 or higher. May help the model catch its own hallucinations through explicit reasoning. The 5 near-miss questions might benefit most.
Uncertain payoff: thinking helps on reasoning tasks but factoid recall may not benefit from more deliberation. Could even hurt if the model "reasons" itself into a wrong answer.
Sample 3-5 answers (same or different models), take the majority. If no majority, refuse. Fabricated details vary across samples; correct answers are stable. Also testable via vario's diverse_verify or critique_revise strategies.
Ask the model to output a confidence score (1-10) alongside its answer. Threshold at 7+ to answer, else refuse. The 6 hedged-wrong answers show the model sometimes detects uncertainty — an explicit confidence score might surface this more reliably.
Run the same tune split with GPT-5.2, Gemini 3.1 Pro, and Grok to calibrate whether 30% is Sonnet-specific or dataset-specific. Helps separate model weakness from question difficulty.
| Topic | Total | Correct | Incorrect | Not Attempted | Accuracy |
|---|---|---|---|---|---|
| Sports | 6 | 3 | 2 | 1 | 50% |
| Other | 4 | 2 | 2 | 0 | 50% |
| Politics | 12 | 5 | 7 | 0 | 42% |
| Science & Tech | 8 | 3 | 4 | 1 | 38% |
| Music | 5 | 1 | 3 | 1 | 20% |
| Geography | 6 | 1 | 5 | 0 | 17% |
| Art | 4 | 0 | 3 | 1 | 0% |
| History | 2 | 0 | 2 | 0 | 0% |
| Video Games | 2 | 0 | 1 | 1 | 0% |
| TV Shows | 1 | 0 | 0 | 1 | 0% |
Worst topics: Geography (17%), Art (0%), History (0%). Best topics: Sports (50%), Other (50%), Politics (42%). Questions about niche cultural facts, obscure dates, and specific geographic details are hardest.
| Metric | Correct | Incorrect | Not Attempted |
|---|---|---|---|
| Avg response length (chars) | 227 | 256 | 467 |
| Avg latency (ms) | 8,884 | 9,031 | 10,794 |
| Contains hedging language | 1/15 (7%) | 7/29 (24%) | 6/6 (100%) |
Correct answers are shorter, faster, and almost never hedge. Incorrect answers look structurally similar to correct ones — confident and direct, making them indistinguishable without ground truth. Not-attempted answers are 2x longer as the model explains why it cannot answer.
| Pattern | Count | Examples | Mechanism |
|---|---|---|---|
| Obscure person/entity details | 10 | Nathalie Menigon's birthday, Motaz Azaiza's middle name, Felix Chen's death date | Model has partial knowledge of the person but not the specific detail asked. Fills in a plausible answer from adjacent knowledge. |
| Precise dates and years | 8 | Founding years, appointment years, event dates | Often gets the right century/decade but wrong specific year. Partial knowledge with gap-filling. |
| Small-town/obscure location stats | 3 | Grand Mound population, Taz Russky population, Combita founding | Census data and municipal facts for tiny places. Model fabricates plausible numbers. |
| Niche cultural products | 3 | Kinoko Teikoku album, Demon's Souls weapon weight, cricket team name | Specific data from niche domains (Japanese indie music, video game stats, apartheid-era sports). |
| Q# | Question (truncated) | Expected | Grade | Category |
|---|---|---|---|---|
| 0 | Three cities where Arvind Kejriwal spent childhood | Sonipat, Ghaziabad, Hisar | CORRECT | - |
| 1 | Mission director of RS-1 launch 1980 | Dr. Kalam | INCORRECT | Hedged wrong |
| 2 | Year Crocetta appointed Councillor for Culture | 1998 | INCORRECT | Confident hallucination |
| 3 | Year Meyrick described Cydalima mysteris | 1886 | INCORRECT | Confident hallucination |
| 4 | Theme song performer for Shirley Valentine | Patti Austin | CORRECT | - |
| 5 | Weight of Phosphorescent Pole in Demon's Souls | 4.0 units | INCORRECT | Confident hallucination |
| 6 | Year Kathryn Shaw stepped down from Studio 58 | 2020 | INCORRECT | Confident hallucination |
| 7 | Death date of Felix Chen | April 9, 2018 | INCORRECT | Confident hallucination |
| 8 | Kinoko Teikoku album released 2014 | Fake World Wonderland | INCORRECT | Confident hallucination |
| 9 | Last Olympic fencing weapon to go electrical | Sabre | CORRECT | - |
| 10 | Tokyo ward where AT-1 phono cartridge was created | Shinjuku | CORRECT | - |
| 11 | Number of genes in endometriosis GWAS review | 36 | INCORRECT | Hedged wrong |
| 12 | Year Kristin Otto retired from swimming | 1989 | CORRECT | - |
| 13 | Middle name of Motaz Azaiza | Hilal | INCORRECT | Confident hallucination |
| 14 | Longsword duel rounds in Battle of the Nations | 1 | NOT_ATTEMPTED | Knowledge gap |
| 15 | Ciara's Tampa performance date for Jackie Tour | May 16, 2015 | NOT_ATTEMPTED | Knowledge gap |
| 16 | ECHR ruling date for Carola van Kuck | 12 June 2003 | CORRECT | - |
| 17 | Injuries in Weesp train disaster 1918 | 42 | INCORRECT | Hedged wrong |
| 18 | 2020 Census population of Grand Mound, Iowa | 615 | INCORRECT | Confident hallucination |
| 19 | Date of Peter Struck's armed forces announcement | January 13, 2004 | INCORRECT | Confident hallucination |
| 20 | Enzyme with EC number 3.1.4.2 | Glycerophosphocholine phosphodiesterase | CORRECT | - |
| 21 | Years Ibrahim Rugova at University of Paris | 1976 to 1977 | INCORRECT | Near miss |
| 22 | Year Salgar, Antioquia founded | 1880 | CORRECT | - |
| 23 | County refusing to lower flags after Orlando shooting | Baldwin County in Alabama | INCORRECT | Hedged wrong |
| 24 | Word of the Decade 2010-2019 (ADS) | they | CORRECT | - |
| 25 | Year Mascarin Peak renamed | 2003 | INCORRECT | Hedged wrong |
| 26 | Track 6 on Mario Kart 64 soundtrack | Koopa Castle | NOT_ATTEMPTED | Knowledge gap |
| 27 | Population of Taz Russky 2010 | 175 | INCORRECT | Confident hallucination |
| 28 | Male victim in 2012 Delhi gang rape | Awindra Pratap Pandey | CORRECT | - |
| 29 | Birthdate of Nathalie Menigon | 28 February 1957 | INCORRECT | Confident hallucination |
| 30 | Section 4.3.2 title in semantic maps paper | MDS and formal paradigms | NOT_ATTEMPTED | Knowledge gap |
| 31 | Date Louis Armstrong was arrested | December 31, 1912 | INCORRECT | Near miss |
| 32 | Designer of 50 francs with Little Prince | Roger Pfund | INCORRECT | Confident hallucination |
| 33 | Consecutive terms of Babanrao Gholap | 5 | CORRECT | - |
| 34 | Village in Ontario settled 1824 by Mr. A. Hurd | Prince Albert | INCORRECT | Confident hallucination |
| 35 | Month/year Maurice Strong elected to head UNEP | December 1972 | INCORRECT | Near miss |
| 36 | Muller substitution minute in 2014 CL semifinal | 74 | CORRECT | - |
| 37 | What Alison Garrs overdosed on in Happy Valley | Diazepam | NOT_ATTEMPTED | Knowledge gap |
| 38 | Year Combita, Boyaca founded | 1586 | INCORRECT | Confident hallucination |
| 39 | Years Constable worked on Waterloo Bridge | 13 | INCORRECT | Confident hallucination |
| 40 | Age of Aniol Serrasolses kayaking glacial waterfall | 32 | INCORRECT | Hedged wrong |
| 41 | 2017 case Natasha Merle was involved in | Buck v. Davis | INCORRECT | Confident hallucination |
| 42 | Dimensions of "Moving House" by Vasnetsov | 53.5 x 67.2 cm | NOT_ATTEMPTED | Knowledge gap |
| 43 | Year Iyanaga became Dean at Tokyo University | 1965 | INCORRECT | Confident hallucination |
| 44 | 1988 author defining Mammalia phylogenetically | Timothy Rowe | CORRECT | - |
| 45 | Year Clive Derby-Lewis became Bedfordview councillor | 1972 | CORRECT | - |
| 46 | Event where David Hanson presented K-Bot in 2004 | AAAS conference | INCORRECT | Near miss |
| 47 | Years Sarah Young served as missionary in Japan | 8 | INCORRECT | Hedged wrong |
| 48 | Cricket team of Dr. Abu Baker Asvat | The Crescents | INCORRECT | Confident hallucination |
| 49 | Year Kiyosi Ito appointed to Cabinet Statistics Bureau | 1939 | CORRECT | - |
Run 3 details: Model: anthropic/claude-sonnet-4-6 | Grader: anthropic/claude-haiku-4-5-20251001 | Split: tune (50 questions) | Seed: 42 | Temperature: 0 | No tools, no web search, no thinking budget | API (not subscription) | Cost: $0.14 (gen: $0.06, grading: $0.08) | Avg latency: 9,199ms
Generated: 2026-03-24 | Conventions: REPORT_SPEC.md