GPQA Diamond: Web Search Impact on Frontier Models

Each model was run on the same 100-question GPQA Diamond subset (seed=0) twice: once closed-book (no tools), once with provider-native web search enabled. All runs used --no-cache to ensure fresh API calls.

Model	Closed-Book	+ Native Web	Delta	Latency (closed)	Latency (web)
Gemini 3 Pro	93%	90%	−3%	216s	563s
GPT-5.2 Pro (reasoning: high)	89%	89%	+0%	330s	460s
Opus 4.6	85%	87%	+2%	92s	89s
Grok 4.1 Fast Reasoning	85%	87%	+2%	148s	182s
GPT-5.2	75%	68%	−7%	67s	73s

Why does web search hurt the best models?

GPQA Diamond contains PhD-level questions in physics, chemistry, and biology. The correct answers require deep domain knowledge and multi-step reasoning, not facts that can be Googled. Web results for these topics tend to be surface-level (Wikipedia summaries, study guides) that can mislead a model that already has the correct reasoning chain from training. The strongest model (Gemini 3 Pro at 93%) was hurt most — it had the most to lose from second-guessing itself with web noise.

GPT-5.2: -7% regression

The largest drop. GPT-5.2 (non-Pro, no reasoning budget) appears particularly susceptible to web search distraction. Without extended reasoning to filter irrelevant search results, it incorporates misleading information into its answers. The Pro version with reasoning_effort: high was immune (89% → 89%), suggesting that reasoning budget acts as a filter against web noise.

All Models Ranked (GPQA Diamond, 100 questions)

* $0 cost = subscription/free tier or Responses API not returning cost metadata with web search tools.

Thinking Tokens: Opus 4.6

MATH Benchmark (Fast Models, 7 subjects × 15 = 105 questions)

Methodology

Model	Accuracy	Avg Latency	Cost	Tier
Gemini 3 Pro	93%	216s	$0*	Frontier
Gemini 3 Pro + web	90%	563s	$3.51	Frontier
GPT-5.2 Pro (re:high)	89%	330s	$0*	Frontier
GPT-5.2 Pro + web	89%	460s	$0*	Frontier
Opus 4.6 + web	87%	89s	$3.31	Frontier
Grok 4.1 Reasoning + web	87%	182s	$0*	Frontier
Opus 4.6 (think 16K)	86%	304s	$11.00	Frontier
Opus 4.6	85%	92s	$3.35	Strong
Grok 4.1 Reasoning	85%	148s	$0*	Strong
MiniMax-M2.5	84%	742s	$0	Strong
Gemini 3 Flash	83%	48s	$0.29	Strong
GPT-5.2	75%	67s	$0*	Mid
GPT-5 Mini	74%	98s	$0*	Mid
GPT-5.2 + web	68%	73s	$0*	Hurt by web
Grok 4.1 Non-Reasoning	64%	25s	$0*	Budget
Haiku 4.5	58%	—	$0	Budget

Configuration	Accuracy	Cost	Latency
Opus 4.6 (no thinking)	85%	$3.35	92s
Opus 4.6 (think 16K budget)	86%	$11.00	304s
Opus 4.6 + web search	87%	$3.31	89s

Model	Accuracy	Cost	Avg Latency
Gemini 3 Flash	95.2%	$0.16	5.3s
GPT-5 Mini	93.3%	$0*	8.6s
Haiku 4.5	92.4%	$0.27	3.2s
Grok 4.1 Non-Reasoning	70.5%	$0*	3.6s

GPQA Diamond

Dataset: GPQA Diamond (198 total questions), sampled 100 with seed=0 for consistent subset
Prompt: Chain-of-thought reasoning, answer extraction via regex for (A)/(B)/(C)/(D)
Web search: Provider-native web search grounding (native_web_search=True):
- Anthropic: web_search_options parameter
- OpenAI: Responses API web_search tool
- Gemini: web_search_options parameter
- xAI: Responses API web_search tool
Caching: All web search runs used --no-cache for fresh API calls
Concurrency: 10 (default), reduced to 5 for Gemini due to rate limits
Thinking tokens: Opus think-16K used thinking.budget_tokens=16384 via Anthropic API

MATH

Dataset: MATH benchmark, 7 subjects, 15 questions per subject (105 total)
Prompt: Chain-of-thought with boxed answer extraction
Models: Fast/cheap tier only (Gemini Flash, GPT-5 Mini, Haiku 4.5, Grok Non-Reasoning)

Infrastructure

Backend: lib/llm via litellm, with Anthropic OAuth subscription path + API billing fallback
Runner: benchmarks/eval/ — AsyncBatchRunner with rich progress bar
CLI: python -m benchmarks.eval.gpqa_official <model> --sample 100 --native-web-search --no-cache

Key Findings

Recommendations

Scenario	Recommendation
Best GPQA accuracy	Gemini 3 Pro, closed-book (93%)
Best accuracy/cost ratio	Gemini 3 Flash (83% at $0.29, 48s)
When to use web search	Current-events benchmarks, not knowledge-heavy science. Web search helps when training data is stale, not when deep reasoning is needed.
When to use thinking tokens	When you need the highest possible accuracy and cost/latency don't matter. +1% with 3.3x overhead.
Fast MATH evaluations	Gemini 3 Flash (95.2%, 5.3s, $0.16/105q)

Generated from benchmarks/results/gpqa_official/ and benchmarks/results/math_official/. Run CLI: python -m benchmarks.eval.gpqa_official <model> [--native-web-search] [--sample N]