GPQA Diamond: Web Search Impact on Frontier Models

February 17, 2026 — rivus/benchmarks

TL;DR: Native web search is net-negative on GPQA Diamond for the best models. Gemini 3 Pro dropped 3% (93% → 90%), GPT-5.2 dropped 7% (75% → 68%). Only mid-tier models gained modestly (+2%). PhD-level science questions need deep reasoning, not web lookup — search introduces noise that misleads rather than helps.

Web Search Impact: Closed-Book vs Native Web Search

Each model was run on the same 100-question GPQA Diamond subset (seed=0) twice: once closed-book (no tools), once with provider-native web search enabled. All runs used --no-cache to ensure fresh API calls.

Model Closed-Book + Native Web Delta Latency (closed) Latency (web)
Gemini 3 Pro 93% 90% −3% 216s 563s
GPT-5.2 Pro (reasoning: high) 89% 89% +0% 330s 460s
Opus 4.6 85% 87% +2% 92s 89s
Grok 4.1 Fast Reasoning 85% 87% +2% 148s 182s
GPT-5.2 75% 68% −7% 67s 73s
Why does web search hurt the best models?

GPQA Diamond contains PhD-level questions in physics, chemistry, and biology. The correct answers require deep domain knowledge and multi-step reasoning, not facts that can be Googled. Web results for these topics tend to be surface-level (Wikipedia summaries, study guides) that can mislead a model that already has the correct reasoning chain from training. The strongest model (Gemini 3 Pro at 93%) was hurt most — it had the most to lose from second-guessing itself with web noise.

GPT-5.2: -7% regression

The largest drop. GPT-5.2 (non-Pro, no reasoning budget) appears particularly susceptible to web search distraction. Without extended reasoning to filter irrelevant search results, it incorporates misleading information into its answers. The Pro version with reasoning_effort: high was immune (89% → 89%), suggesting that reasoning budget acts as a filter against web noise.

All Models Ranked (GPQA Diamond, 100 questions)

Model Accuracy Visual Avg Latency Cost Tier
Gemini 3 Pro 93%
216s $0* Frontier
Gemini 3 Pro + web 90%
563s $3.51 Frontier
GPT-5.2 Pro (re:high) 89%
330s $0* Frontier
GPT-5.2 Pro + web 89%
460s $0* Frontier
Opus 4.6 + web 87%
89s $3.31 Frontier
Grok 4.1 Reasoning + web 87%
182s $0* Frontier
Opus 4.6 (think 16K) 86%
304s $11.00 Frontier
Opus 4.6 85%
92s $3.35 Strong
Grok 4.1 Reasoning 85%
148s $0* Strong
MiniMax-M2.5 84%
742s $0 Strong
Gemini 3 Flash 83%
48s $0.29 Strong
GPT-5.2 75%
67s $0* Mid
GPT-5 Mini 74%
98s $0* Mid
GPT-5.2 + web 68%
73s $0* Hurt by web
Grok 4.1 Non-Reasoning 64%
25s $0* Budget
Haiku 4.5 58%
$0 Budget

* $0 cost = subscription/free tier or Responses API not returning cost metadata with web search tools.

Thinking Tokens: Opus 4.6

Tested whether explicit thinking budget improves Opus on GPQA Diamond.

ConfigurationAccuracyCostLatency
Opus 4.6 (no thinking)85%$3.3592s
Opus 4.6 (think 16K budget)86%$11.00304s
Opus 4.6 + web search87%$3.3189s
Thinking tokens: 3.3x cost for +1%

16K thinking budget costs 3.3x more ($11 vs $3.35) and takes 3.3x longer (304s vs 92s) for only +1% accuracy (85% → 86%). Web search achieved +2% at lower cost. Per-question analysis showed 89% agreement between think/no-think: thinking helped on 6 questions but hurt on 5 (overthinking/second-guessing).

MATH Benchmark (Fast Models, 7 subjects × 15 = 105 questions)

ModelAccuracyCostAvg Latency
Gemini 3 Flash95.2%$0.165.3s
GPT-5 Mini93.3%$0*8.6s
Haiku 4.592.4%$0.273.2s
Grok 4.1 Non-Reasoning70.5%$0*3.6s
Cross-provider agreement on MATH

70% of questions were answered identically by all 4 models. Of the 32 disagreements, Grok was uniquely wrong on 21 (66%) — dragging down the agreement rate. The top 3 models (Gemini Flash, GPT-5 Mini, Haiku) agreed on 91% of questions.

Methodology

GPQA Diamond

MATH

Infrastructure

Key Findings

  1. Web search hurts the best models on knowledge-heavy benchmarks. Gemini 3 Pro (the leader at 93%) dropped to 90%. Web results for PhD-level science are surface-level noise that undermines correct reasoning chains.
  2. Reasoning budget protects against web noise. GPT-5.2 Pro (reasoning:high) was unaffected (89% → 89%), while GPT-5.2 (no reasoning) dropped 7%. Extended reasoning acts as a filter — the model can evaluate and discard irrelevant search results instead of being distracted by them.
  3. Mid-tier models benefit slightly. Opus and Grok each gained +2% with web search. At the 85% accuracy level, there are questions where the model is uncertain and web evidence can tip the balance correctly.
  4. Thinking tokens have poor ROI on GPQA. Opus with 16K thinking budget: +1% for 3.3x cost. Thinking helped on 6 questions but hurt on 5 (overthinking). Web search achieved +2% at lower cost.
  5. Gemini 3 Pro dominates closed-book GPQA. 93% — 4 points above the next model (GPT-5.2 Pro at 89%). The gap is consistent across question types.
  6. Grok non-reasoning is significantly weaker. 64% on GPQA, 70.5% on MATH — uniquely wrong on 21 of 32 cross-provider MATH disagreements. Not competitive for knowledge-heavy tasks.

Recommendations

ScenarioRecommendation
Best GPQA accuracy Gemini 3 Pro, closed-book (93%)
Best accuracy/cost ratio Gemini 3 Flash (83% at $0.29, 48s)
When to use web search Current-events benchmarks, not knowledge-heavy science. Web search helps when training data is stale, not when deep reasoning is needed.
When to use thinking tokens When you need the highest possible accuracy and cost/latency don't matter. +1% with 3.3x overhead.
Fast MATH evaluations Gemini 3 Flash (95.2%, 5.3s, $0.16/105q)

Generated from benchmarks/results/gpqa_official/ and benchmarks/results/math_official/. Run CLI: python -m benchmarks.eval.gpqa_official <model> [--native-web-search] [--sample N]