The Vario Advantage — Reasoning Strategies vs Direct Model Calls

Same model. Different reasoning. +25pp accuracy.

MATH Benchmark -- Haiku 4.5 -- 16 balanced problems. Strategies ordered by accuracy.

Cross-Model Comparison

The same strategies applied across four different models. The lift depends on the baseline -- mid-range models benefit most.

Baseline vs. Best Strategy Across Models

MATH Benchmark. Haiku & Gemini: 16 problems. 100-problem runs: generate & verify.

Mid-range models benefit most

Haiku at 50% baseline gained +25pp. Gemini Flash at 81-87% gained +4-6pp. The more room there is to improve, the more strategies help.

Ceiling effects are real

Haiku at 92% baseline on 100 problems saw +0pp from generate & verify. When a model already solves nearly everything, there's nothing left to gain.

Overconfident verification hurts

MiniMax M2.5 dropped 3pp with generate & verify. Its self-verification systematically rejected correct answers. Not all models verify well.

Cost & Latency Reality Check

Strategies cost more. Here's how much.

$0.01-0.02

Cost per problem

$1-2

Per 100-problem batch

5-90s

Latency per problem

~1ms

Baseline latency

Detailed Cost & Latency Breakdown

Haiku 4.5, 16-problem MATH benchmark. Sorted by accuracy.

Strategy	Accuracy	Cost / Problem	Latency	Problems / $1
baseline	50.0%	~$0	1.1ms	--
temperature_sweep	62.5%	$0.010	16.4s	100
majority_vote	68.8%	$0.017	5.1s	59
best_of_n	68.8%	$0.017	14.9s	59
weighted_vote	68.8%	$0.017	6.4s	59
generate_and_verify	75.0%	$0.017	10.1s	59
lens_ensemble	62.5%	$0.018	61.9s	56

The math

For a batch of 100 problems, the total cost is $1-2. The quality improvement is 4-25 percentage points. At $0.017 per problem, majority_vote and generate_and_verify solve ~59 problems per dollar -- while delivering 19-25pp more accuracy than a free baseline call.

Latency context

Strategies add seconds, not minutes. majority_vote is the fastest strategy at 5.1s -- practical for interactive use. lens_ensemble at 62s is best suited for batch processing. The right trade-off depends on whether you need realtime answers or batch quality.

When Strategies Help Most

Not all situations benefit equally. Here's what the data shows about when to use strategies -- and when not to.

Sweet spot: 50-85% baselines

Models scoring in this range have the most headroom. Haiku at 50% gained +25pp. Gemini Flash at 81% gained +6pp. This is where strategies deliver the highest ROI.

Diminishing returns above 90%

Haiku at 92% (100-problem run) saw +0pp lift. When the model already solves 92 out of 100 problems, strategies can't find many errors to fix. The ceiling is real.

Overconfident verifiers backfire

MiniMax M2.5 lost 3pp with generate & verify. The model's verification step systematically rejected correct answers. Strategy selection should be model-aware.

Best bang for the buck

majority_vote: cheapest per-pp-gained, fastest (5.1s), and delivers +18.8pp on Haiku. generate_and_verify: highest absolute accuracy, same cost, moderate latency. Start with these two.

19 Strategies. 10 Stages. 9 Lenses.

Strategies are built from composable stages and analytical lenses. Mix them like building blocks to create new reasoning approaches.

Voting

majority_vote
weighted_vote
best_of_n
temperature_sweep

Verification

generate_and_verify
self_consistency
chain_of_verification
stepwise_verification

Iteration

critique_and_refine
debate
progressive_refinement
red_team_blue_team
iterative_deepening

Ensemble & Hybrid

lens_ensemble
mixture_of_strategies
adaptive_strategy
meta_reasoning
tournament
cascade

10 Composable Stages

generate score critique refine verify vote filter rank synthesize decompose

9 Analytical Lenses

game theory economics evolution ecology information theory systems thinking adversarial bayesian thermodynamic

Composability

Each strategy is a pipeline of stages. generate_and_verify = generate + verify + vote. lens_ensemble = generate (per lens) + score + synthesize. New strategies emerge from new combinations -- no code changes needed, just a pipeline definition.

Full Data: Gemini 3-Flash

For completeness -- the same benchmark run on a stronger baseline model.

MATH Benchmark -- Gemini 3-Flash (16 problems)

Higher baseline, smaller lift, longer latency. Note: generate & verify underperformed here.

Strategy	Accuracy	Cost / Problem	Latency
baseline	81.3%	~$0	0.8ms
majority_vote	87.5%	$0.009	35.9s
best_of_n	87.5%	$0.009	57.4s
generate_and_verify	77.8%	$0.009	88.2s
lens_ensemble	68.8%	$0.010	210.2s

Observation

On Gemini Flash, majority_vote and best_of_n outperform generate_and_verify. The optimal strategy is model-dependent. This is why Vario supports 19 strategies -- the best one varies by model, domain, and problem difficulty.

100-Problem Validation

Larger-scale runs with generate & verify across four models. More problems, more statistical confidence.

Generate & Verify -- 100-Problem MATH Benchmark

Model	Baseline	Gen & Verify	Lift	Cost / Run
Haiku 4.5	92%	92%	+0pp	$1.79
Gemini 3-Flash	87%	91%	+4pp	$1.01
Grok-fast	74%	74%	+0pp	--
MiniMax M2.5	91%	88%	-3pp	$1.33

Gemini Flash: best cost-effectiveness

+4pp lift at $1.01 total. That's $0.25 per percentage point gained -- the best ROI in this cohort.

Haiku: ceiling effect confirmed

At 92% baseline, there's almost no room to improve. The strategy correctly identifies most answers but can't fix what the model already gets right.

Case Study: Multi-Model Decisions in Practice

Real architectural decisions from a VC intelligence project. Four frontier models consulted independently via maxthink preset.

Setup: Building a VC intelligence tool for TenOneTen Ventures. Four design questions needed answering: (A) how to get full portfolio data, (B) script vs job handler architecture, (C) how to discover founders of portfolio companies, (D) how to show network reachability. Each question was sent to four frontier models — Opus, GPT Pro, Grok, Gemini — via Vario's maxthink preset. Every model answered independently.

Two patterns emerged that demonstrate the value of multi-model consultation.

Unique Insight Changes the Plan

Decision C (Founder Discovery): All 4 models agreed to use Bright Data Crunchbase lookups ($0.01/company) for founder data. But Gemini alone noticed a critical detail the other 3 missed:

"Crucially, Bright Data returns contacts with LinkedIn IDs. Because your Apollo free tier obfuscates org searches but allows lookups by LinkedIn URL, C1 perfectly bridges the gap: use BD to get the founders' LinkedIn URLs, then feed those URLs into your Apollo integration for the rich 8-role employment history." — Gemini

This is a data pipeline insight — Crunchbase gives LinkedIn IDs, Apollo enriches by LinkedIn URL. The other models recommended Crunchbase for structured data but didn't see the CB→Apollo handoff that makes the whole pipeline work better. Without multi-model, this connection would have been missed.

Consensus Validates the Bold Choice

Decision D (Network/Reachability): The tempting shortcut was D2 — skip network analysis for the demo, show a placeholder. It's risky because it requires inference from sparse data (only 40 LinkedIn connections imported).

All 4 models independently said: Build it. It's the demo's punchline.

Opus: "Don't skip this — it's the demo's punchline. Even 1-2 connections is more impressive than a placeholder."

GPT Pro: "Label tiers: direct → 2-hop → inferred. Keep conservative."

Grok: "Existing data, quick inference, cost-free."

Gemini: "A placeholder defeats the purpose. Tip: manually verify at least one real path exists."

4/4 unanimous against the easy shortcut. This consensus gave confidence to invest the extra effort, knowing it wasn't just one model's opinion.

Consensus Exposed a Bad Assumption

Decision A (Getting all 155 investments): GPT Pro recommended using an existing investments.csv Crunchbase export. We checked — the CSV exists but contains stale data from 2013-2015, missing TenOneTen's recent investments entirely.

The other models that recommended live scraping were correct. Multi-model didn't prevent the bad recommendation, but the split (2 said CSV, 2 said live) flagged it as worth verifying — which caught the error.

Takeaway

The value isn't that every model is right — it's that disagreement flags uncertainty and unique insights surface connections no single model sees.

What's Next

The benchmark results are the starting point. Here's where the research goes from here.

Data note

The benchmarks above are from early runs on MATH with budget-tier models and a small problem set (16-100 problems). Comprehensive benchmarks are in progress: more models (including frontier-tier Opus, GPT-Pro, Gemini Pro), more strategies head-to-head, more domains (MMLU-Pro, GPQA, WritingBench), larger sample sizes for statistical significance, and full cost/latency profiling. This page will be updated as better data lands.

Cross-domain validation

MMLU-Pro, GPQA, WritingBench. Does the lift hold on reasoning, science, and creative tasks?

Adaptive strategy selection

Pick the best strategy per problem type automatically. Route easy problems to baseline, hard ones to gen & verify.

Strategy tournaments

Head-to-head comparison on the same problems. Statistical significance testing across strategy pairs.

Ablation studies

Which stage contributes most? Is it the generation diversity, the verification, or the voting? Isolate the signal.