Vario: What If You Could Change One Number?

Every LLM call you make today is already a pipeline — just a very short one. Vario makes the pipeline longer, and everything interesting follows.

What We Know (from the research)

Five findings that shape how vario works. Details and citations in Section 7.

Generate → judge → select works. Up to +19 accuracy points on math reasoning. Gains scale with task difficulty. The single most effective pattern. (OpenAI o1, DeepMind ICLR 2025)

Smaller model + retries beats bigger model single-shot. A cheap model with best-of-N + verifier can match a model 10× its size at lower cost. (Inference Scaling Laws, ICLR 2025)

Self-critique is near-useless. Same model revising itself: +1.8% after 5 rounds. But a different model giving structured feedback: +80%. The bottleneck is error identification, not repair. (RefineBench, NVIDIA 2025)

Debate is oversold. Simple parallel generation + voting captures most gains attributed to multi-agent debate. Debate underperforms voting on 7/9 benchmarks. (NeurIPS 2025, ICLR 2025)

No single strategy wins everywhere. Easy tasks: skip multi-attempt (waste). Hard tasks: multi-attempt is the difference between "good enough" and "correct." The right approach adapts to the problem. (Art of Scaling Test-Time Compute, 2025)

Contents

How You Use an LLM Today
Change One Number
What You Can Do With a Longer Pipeline
What This Looks Like in Practice
All the Way Down
Why Streaming Matters
What the Research Says (details)
Demo: Vario Evaluating Itself
How This Relates to Other Work
Where This Goes
Glossary — hover any dotted termLike this! Technical terms are explained on hover. for a definition

1. How You Use an LLM Today

Here's what happens when you call an LLM:

produce(n=1, model=sonnet) → done

That's a pipeline. A short one. You send a prompt, one model generates one response, and you get text back.

Described in pipeline notation, some things become visible that are easy to overlook:

n=1 — You generated exactly one candidate. Was it the model's best work? Its worst? There's no way to know. The quality of what you got is a sample of one from an unknown distribution.
No scoring step — Nobody evaluated the output. There's no quality signal attached to the result.
No selection step — With only one candidate, there's nothing to select from. You get what you get.
No provenance — The response carries no metadata about how confident the model was, what it cost, or what alternatives might have looked like.

This isn't a criticism. For quick questions, a single call is efficient and often good enough. But notice: the pipeline notation makes the constraints explicit. Once you see them, you can change them.

2. Change One Number

What if you changed n from 1 to 5?

produce(n=5, model=sonnet) → ???

Now you have five candidates instead of one. But that creates a new question: which one is best? You need a way to evaluate them. So you add a scoring step:

produce(n=5, model=sonnet) → score(rubric) → ???

Now each candidate has a quality score. But you want one answer, not five. So you add selection:

produce(n=5, model=sonnet) → score(rubric) → reduce(top_1)

That's best-of-nGenerate N candidates, score each against a rubric, keep the highest-scoring one. The simplest vario recipe.. The simplest vario recipe. Nothing was invented — we changed a number and followed the implications.

The key insight: A single LLM call is the degenerate case of a pipeline — produce(n=1) with no scoring and no selection. Every improvement is either changing a parameter or adding a stage. Not a new concept.

There's a second insight hiding here: explicit structure can be reasoned over. When the process is a pipeline — not a monolithic prompt hoping the LLM "figures it out" — you can inspect it, optimize it, and run thousands of items through it. An LLM that's told "generate five options, pick the best, and refine it" in a single prompt might do something vaguely like that. An explicit pipeline guarantees five candidates get generated, guarantees each gets scored against the rubric, and guarantees selection happens. The structure is auditable, repeatable, and the same pipeline that works on one question works on ten thousand.

What does this cost? About $0.03 and five seconds. The winning candidate typically scores 15–25 points higher than the median on a 0–100 rubric — because models make different tradeoffs on each generation, and selecting the best is better than hoping for the best.

3. What You Can Do With a Longer Pipeline

A longer pipeline lets you do things that humans naturally do when the stakes are high — but that a single LLM call can't. Each is a recognizable activity, not an abstraction:

Brainstorm

Generate many candidates in parallel, across multiple models. Different models have different training data, different biases, different strengths. You get genuine diversity, not five rephrases of the same idea.

produce(n=5, models=[sonnet, gemini, grok]) → all candidates

Research note: Heterogeneous multi-model generation produces 4–6% accuracy gains over single-model, and 30% fewer factual errors. The diversity is real, not cosmetic.

Review

Score each candidate against explicit criteria. A different model acts as judge — this is critical. The TLDR above explains why: same-model self-critique adds almost nothing (+1.8%), but cross-model structured review is transformative (+80%).

produce(n=5) → score(rubric=[correctness, thoroughness, clarity])

Now each candidate has a quality signal. You can select the best, or use the scores as feedback for revision.

Select the best

Pick the winner. Or take a vote. Or combine the best parts. This is where the +19 accuracy points come from — not from generating better, but from selecting better.

produce(n=5) → score(rubric) → reduce(top_1)

Refine

Take the scoring feedback and use it to improve the output. Then score again. Repeat until quality converges or budget runs out. This is the editorial loop — draft, get feedback, revise — automated with explicit quality criteria.

produce → repeat(score → revise, until=converged)

Research note: Only works when the feedback comes from a different model or structured rubric. "Try again" without specific feedback is near-zero value.

Evaluate at scale

Run the same analysis across a corpus. 50 documents, 3 models, systematic comparison. Reveals tradeoffs that are invisible from a single model on a single document — like which model extracts more claims but at lower precision.

source(documents/) → fan_out(models) → task(analyze) → evaluate(vs_reference)

Search deep

Each pipeline stage can itself contain a pipeline (Section 5). A search pipeline can spawn sub-searches on promising branches, each with its own scoring and selection. How deep to go is controlled by budget and convergence — hard problems get more compute, easy ones resolve quickly.

These are all compositions of the same nine opsStreaming operations that transform Items. Each op is an async generator: Items in, Items out. Nine built-in ops.: produce, score, revise, reduce, source, fan_out, task, evaluate, repeat. No special machinery for each pattern — just different arrangements of the same pieces.

4. What This Looks Like in Practice

Quick quality boost: best-of-n

$ vario run "What are the risks of acquiring Acme Corp?" -r best_of_n -i acme_10k.pdf

  Results: 1 thing(s)
  [1]  score=91  model=sonnet  "Three primary risk categories emerge..."

  Stage            In  Out   Tokens      Cost     Latency
  produce           0    5    2,845    $0.021       3.2s
  score             5    5    1,230    $0.004       1.1s
  reduce            5    1        0    $0.000       0.0s

  Budget: $0.025 spent / $0.05 max

Five candidates generated, each scored on thoroughness and accuracy. Winner: score 91. Runner-up scored 84 and missed regulatory risk entirely. You see the difference because there is a comparison.

Nuanced decision: model debate

$ vario run "Is this founder's pivot history a red flag?" -i timeline.md -r model_debate

Four models generate independently. Each gets scored on reasoning depth and evidence quality. In one real run, Opus focused on the pattern (pivoting before testing product-market fit) while Gemini focused on the outcome (each pivot moved upmarket). Scoring identified which analysis was more actionable — not just which model sounded more confident.

High-stakes output: iterative refinement

Round 1: produce(3) → score → best=72 ("misses edge cases") ↓ revise with feedback Round 2: score → best=81 ("stronger, regulatory risk understated") ↓ revise with feedback Round 3: score → best=88 ("comprehensive") ↓ revise with feedback Round 4: score → best=89 (improvement < 1.0 → converged, stop)

The repeat op handles the loop with stop conditions: score threshold, minimum improvement, drift detection, budget cap. The system decides when to stop — you specify quality criteria, not iteration counts.

Evaluate at scale

source(documents/) → fan_out(models=[sonnet, haiku, grok]) → task(extract_claims) → evaluate(vs_reference)

50 documents × 3 models = 150 evaluations. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff invisible from a single model on a single document.

Large jobs: effort allocation beyond context windows

The same pipeline that improves one answer scales to thousands. Each item is processed independently — the pipeline doesn't hold the whole corpus in memory, so there's no context window limit on the job. A 10,000-document extraction runs the same way as a 10-document one.

What makes this more than a for-loop:

Adaptive effort — Hard items can get more compute. If a document scores low after one pass, route it through refinement. Easy documents fast-path through. The TLDR finding about compute-optimal scaling (4× efficiency from allocating more attempts to harder problems) applies directly here.
Streaming partial results — The HeapAn anytime priority queue. Items scored as they arrive; best answer always readable mid-run. always has the best results so far. A job that's 10% done already has its 10% best answers ranked and accessible.
Checkpoint and resume — ItemsVario's universal datum: content + accumulated properties + provenance history. carry IDs. A Source can skip already-processed IDs, so interrupted jobs resume where they left off without re-running completed work.
Cost observability — Every item's cost, latency, and provenance is tracked. You know exactly what the job spent, which items were expensive, and why.

This is where vario diverges most from "call the API five times." It's not just about making one answer better — it's infrastructure for running LLM-heavy jobs at scale with the same quality controls and observability you'd want for any production data pipeline.

5. All the Way Down

The task op takes any async function and makes it a pipeline stage. If that function internally runs a vario pipeline, the outer pipeline doesn't know or care. Items in, Items out.

This means vario pipelines nest:

Outer: source(questions) → fan_out([baseline, best_of_n]) → task(run_strategy) → evaluate ↓ Inner (when strategy=best_of_n): produce(n=5) → score → reduce(top_1) ↓ Inner scoring: produce(n=1, model=haiku) → done

vario ab already works this way: the outer evaluation pipeline calls call_with_strategy for each question, which runs an inner recipe pipeline for the strategy arm. Two levels deep, and there's no limit.

This nesting opens up patterns that are hard to build ad hoc:

Pattern	How it nests
Eval any LLM job	Wrap any function in `task`, score the output, compare strategies. The function could be a RAG pipeline, an agent loop, a code generator — anything callable.
Deep search	A `steer` op spawns sub-pipelines to explore promising branches. Each branch is a full pipeline that can itself branch.
Meta-optimization	Outer pipeline A/B tests which inner recipe works best for a problem type. The system learns which process to use, not just which answer to give.
Recursive refinement	A refinement round could use best-of-n internally — each revision candidate is itself the winner of a sub-pipeline.

The key property: every level gets the same observability. Inner pipelines produce traces, costs, and provenance that roll up to the outer level. You can audit the full tree.

6. Why Streaming Matters

A detail that changes the user experience: vario pipelines are streaming, not batched.

Batch (traditional)

Generate all 5 candidates. Wait. Score all 5. Wait. Reduce. Return.

You see nothing until everything is done.

Streaming (vario)

Sonnet finishes first — score it immediately. Gemini finishes — score it. Best answer updates in real time.

Fast models yield results while slow models are still thinking.

Technically: every op is an async generatorPython's async generator pattern. Functions yield values one at a time as they become available, enabling pipeline processing without waiting for all data.. produce yields Items as models finish (as_completed). score evaluates each Item as it arrives. The HeapAn anytime priority queue. Items scored as they arrive; best answer always readable mid-run. keeps the best answer always accessible — you can read it or display it at any point, even while the pipeline is still running.

Only reduce is a barrier (it needs all Items to make a selection). Everything before it streams.

7. What the Research Says

The patterns above aren't speculative — each has been studied. Here are the details behind the TLDR.

Generate → judge → select: the strongest pattern

Finding	Source	Numbers
Best-of-N with a verifier dramatically outperforms single-shot on reasoning	OpenAI o1 system card	AIME: pass@1 = 74%, reranked@1000 = 93%. A 19-point lift from selection alone.
Compute-optimal test-time scaling beats naive best-of-N by 4×	DeepMind + UC Berkeley, ICLR 2025	Allocating more attempts to harder problems is 4× more efficient than uniform sampling.
Smaller model + best-of-N + verifier matches larger model at lower cost	Inference Scaling Laws, ICLR 2025	Pareto-optimal tradeoff: cheap model with retries beats expensive model single-shot.
SWE-bench: pass@5 significantly exceeds pass@1 on hard real-world code tasks	SWE-bench Verified/Pro, 2025	Frontier models drop from 70%+ (Verified) to <25% (Pro). Multi-attempt gap grows with difficulty.
Majority vote closes the last gap on near-perfect models	OpenAI o4-mini	AIME 2025: 99.5% pass@1, 100% consensus@8. Eight samples + voting = perfect.

This maps directly to brainstorm → review → select. The research confirms it's the single most effective pattern, and that the gains scale with task difficulty.

Refinement: only with external feedback

Finding	Source	Numbers
Self-refinement without external feedback is near-useless	RefineBench, NVIDIA, Nov 2025	Gemini 2.5 Pro: +1.8% after 5 iterations. DeepSeek-R1: −0.1% (got worse).
But guided refinement with structured feedback: +80% gains	Same paper	Checklist-based feedback → near-perfect scores within 5 turns.
Extended reasoning hurts on tasks with distractors	Anthropic: Inverse Scaling in Test-Time Compute, 2025	More thinking amplifies noise on certain task structures.

This is why the refine pattern uses a different model as judge. "Try again" is theater; structured cross-model feedback is transformative. The bottleneck is error identification, not error repair.

Debate: simple voting wins

Finding	Source
Majority voting alone captures most gains attributed to multi-agent debate	NeurIPS 2025 Spotlight: "Debate or Vote"
Multi-Agent Debate underperforms self-consistency on 7/9 benchmarks	ICLR 2025 MAD evaluation
Best-of-N degrades at high N (>16) — verifier gets gamed	ICML 2025

This shapes the brainstorm pattern: parallel independent generation + scoring beats complex debate protocols. Vario's default N (5) is in the empirically-validated sweet spot of 3–8.

When to skip multi-attempt

Simple factual recall — Low variance; the model knows it or doesn't.
Well-calibrated easy tasks — MMLU at 90%+; self-consistency adds <1%.
Latency-critical paths — When you need an answer in <1 second.

The gains scale with task difficulty. Easy tasks: skip. Hard tasks: multi-attempt is the difference between "good enough" and "correct."

8. Demo: Vario Evaluating Itself

The most direct way to measure whether vario improves LLM output: run a benchmark with and without it.

$ vario ab math haiku --strategy best_of_n --sample 5 --report

  Phase 1: BASELINE (single-shot) — 35 questions
  baseline: 24/35 = 68.6%

  Phase 2: STRATEGY (best_of_n) — 35 questions
  strategy: 28/35 = 80.0%

  ┌──────────┬──────────┬──────────┬─────────┐
  │ Metric   │ Baseline │ Strategy │ Delta   │
  ├──────────┼──────────┼──────────┼─────────┤
  │ Accuracy │ 68.6%    │ 80.0%    │ +11.4pp │
  │ Cost     │ $0.0082  │ $0.0387  │ 4.7x    │
  │ Latency  │ 1,204ms  │ 3,891ms  │ 3.2x    │
  └──────────┴──────────┴──────────┴─────────┘

  Wins: 6 | Losses: 2 | Net: +4

Vario runs the comparison through its own pipeline — loading benchmark questions, calling call_with_strategy for each arm, checking correctness against gold answers. The system evaluating itself with its own tools.

The output is a comparison.json + optional HTML report. Every number is traceable to specific questions that flipped from wrong to right (or right to wrong).

Run a real demo and replace these placeholder numbers with actuals. Command: vario ab math haiku --strategy best_of_n --sample 5 --report

9. How This Relates to Other Work

Vario doesn't compete with most LLM tools — it composes with them. The distinction is what layer each operates at:

System	Layer	Relationship to vario
DSPy	Prompt optimization	Complementary — a DSPy-optimized prompt works better inside a vario pipeline
LMQL / Guidance	Constrained generation	Complementary — constrain structure, then use vario to select the best among valid outputs
LangChain	Integration framework	Different goal — LangChain connects LLMs to tools; vario makes LLM output quality better
Best-of-N sampling	Single technique	Subsumed — best-of-N is one recipe; vario generalizes to refinement, debate, corpus processing
RLHF / Constitutional AI	Model training	Different phase — training improves the baseline; vario improves the ceiling at inference time

The simplest way to think about it: vario operates at the process layer, between the model (fixed) and the application (specific). It makes any model's output better through systematic generation, evaluation, and selection — without training, fine-tuning, or changing the model itself.

10. Where This Goes

Near-term: the recipe disappears

Today you choose a recipe: best_of_n, model_debate, refine_until_converged. Near-term, the system chooses for you. You describe the problem and specify quality criteria; vario designs the pipeline. This already works in prototype:

$ vario run "Evaluate this acquisition" -r "try multiple perspectives, verify facts"
  Designed: model_debate_with_verify — "Multi-model generation + factual verification pass"

Medium-term: the pipeline adapts mid-run

A steer op observes pipeline state and decides what to do next. If all candidates scored >90, skip refinement. If scores are bimodal, split into two debate tracks. If budget is running low, switch to cheaper models for remaining stages. The pipeline becomes adaptive, not just sequential.

North star: specify outcomes, not process

You say: "I need a thorough analysis of this company. Budget: $2. Confidence threshold: 85." Vario designs the process, selects the models, runs the pipeline, detects when quality is sufficient, and delivers a result with full provenance. The recipe is emergent — not hand-authored.

Why this compounds: Every scored run generates data about which models and processes work best for which problem types. The system gets better at designing processes as it accumulates execution history. You shift from crafting prompts to specifying quality criteria — a more natural way to direct AI work.

Glossary

Item: Vario's universal datum. Content (the text) + accumulated properties (model, score, cost) + provenance history tracking what each pipeline stage added.
Op: A streaming function that transforms Items. Nine built-in: produce, score, revise, reduce, source, fan_out, task, evaluate, repeat.
Recipe: A named pipeline configuration: which ops, in what order, with what parameters. Defined in YAML, or auto-designed from natural language.
Best-of-N: The simplest recipe: produce N candidates, score each, keep the best. produce(n=5) → score → reduce(top_1).
Heap: An anytime priority queue. Items are sorted by score as they arrive; the best answer is always readable, even while the pipeline is still running.
Async Generator: Python pattern where functions yield values one at a time as they become available. Enables pipeline processing without waiting for all data.
Provenance: The audit trail of an Item: which model generated it, what score it received and why, how many rounds of revision, and what each stage cost.

Vario is part of Rivus. Built by Tim Chklovski, 2025–2026.

Report v2. static.localhost/present/vario/report.html | v1