Vario

By thinking 2–3× as much about a problem — generating alternatives, evaluating from more angles, building up background — you can almost always improve on a first-pass answer. Vario makes that effort computational, not human.

Ask an LLM a hard question and it generates one answer, immediately, with no preparation. It doesn’t build a rubric for what “good” means. It doesn’t look for precedent. It doesn’t consider alternative framings. It doesn’t check its own work.

These aren’t exotic techniques — they’re what any thoughtful person does on a question that matters. The gap between a quick take and a considered one is real and valuable. Vario exists to close that gap automatically.

The bet

The difference between “good enough” and “genuinely good” is almost always more effort applied in the right places. Vario makes that effort computational — generates alternatives broadly, evaluates from more angles, builds background before answering, and learns from the delta between what the quick take missed and what the deeper pass found.

Not 19 Strategies — A Language for Thinking

Vario is not “19 strategies that compete.” It’s a composable language for expressing how to think about a problem — and a learning system that discovers which thinking moves work for which situations.

The 19 current strategies (majority_vote, debate, tree_search, etc.) are specific instantiations, but the real value is the primitives they’re built from and the learning that accumulates across runs. A strategy is just a composition of moves. The moves are the atoms:

Move	What It Does	Examples
Gather	Collect more information before committing	Search for evidence, poll multiple sources, check assumptions
Frame	Choose a lens through which to analyze	“Think like an economist”, “What would a skeptic say?”
Generate	Produce candidate solutions	Multiple models, temperatures, prompts, styles
Verify	Check if a candidate is correct or good	Execute code, check math, LLM-as-judge, self-consistency
Critique	Identify weaknesses in a candidate	Adversarial review, find counterexamples, steelman the opposing view
Optimize	Define a loss function and improve against it	Set up rubric → score → iterate on low dimensions
Generalize	Collect precedents and extract patterns	“What worked last time?”, “What do similar problems share?”
Decompose	Break into independent sub-problems	Branch into parallel sub-tasks, then merge
Synthesize	Merge the best parts of multiple candidates	Debate, rubric-guided fusion, extract-and-combine
Hedge	Identify risks and failure modes	“How could this go wrong?”, red-team, pre-mortem

A strategy like debate is really: Generate × 3 → Critique (mutual) → Synthesize (strongest arguments). tree_search is: Generate → Verify → Decompose (expand promising branches) → Synthesize. The current YAML strategies hard-wire these compositions — the goal is to make the composition itself learnable.

Two Levels of Learning

Vario learns at two levels, each feeding the other:

Meta Level — How to Approach

“This looks like a problem where we need more info before generating” → lead with Gather. “This has a clear objective function” → lead with Optimize. “High uncertainty” → Generate diversity → Critique → Hedge.

Domain Level — What Works Here

“For code, Verify via execution works; for prose, LLM-as-judge with debiasing.” “Gemini excels at search-augmented tasks; Opus at nuanced reasoning.” “Temperature 0.9 helps creative, hurts factual.”

The move library (SQLite) currently tracks strategy × situation win/loss. The future is tracking move × context × outcome, so the system can compose novel strategies from primitives rather than selecting from a fixed menu.

Why Not Just Prompt the LLM?

Fair question. A Claude Code session with tool calling already does “generate hypothesis, run code, see result, iterate.” So what do recipes add?

Capability	Tool calling session	Recipes (vario)
Interactive	Yes	Not yet
Parallel	No — sequential, one thought at a time	Yes — fan out across models and candidates
Multi-model	No — one model per session	Yes — right model for each step
Tool access	Yes	Yes — blocks can call anything
Headless / batch	No — needs live session	Yes — fire and forget
Cost-optimized	No — one model for everything	Yes — haiku for scoring, opus for synthesis
Tracked / comparable	No — sessions are ephemeral	Yes — structured traces, A/B testable
Learnable	No — starts from zero every session	Yes — which recipes work for which problems?

The honest answer: for one-shot interactive tasks, just prompt the LLM. Recipes earn their keep when you need parallelism, multiple models, cost control, or — most importantly — when you want to learn what works across many runs.

Recipes as tools

A Claude Code session can call vario as a tool — offloading the parallel, multi-model, batch work while keeping the session interactive and adaptive. Like how a developer uses CI: you don’t run 50 tests one at a time in your terminal, you kick off a pipeline and get results back. The session decides when to reach for vario vs just thinking directly.

The Execute Block — Real-World Grounding

The killer feature isn’t “better prompting” — it’s the execute block. It bridges the LLM pipeline to real-world evaluation:

Trading strategies: LLM generates config variants → execute block runs walk-forward backtest → real Spearman correlations come back
Code: LLM generates implementations → execute block compiles and runs tests → actual pass/fail
Visual/UX: LLM proposes layout changes → execute block takes screenshots → real pixel comparisons
CEO assessment: LLM generates scoring rubrics → execute block correlates against historical stock returns

Without the execute block, everything is LLM-judging-LLM — turtles all the way down. With it, the feedback loop is grounded in reality. The LLM generates hypotheses; the real world evaluates them.

The Compounding Advantage

Parallelism and multi-model are one-time efficiency gains. Learning what works is cumulative.

With structured recipes + tracked results, you can:

A/B test recipes — run debate vs best_of_n vs self_refine on the same problem set, measure which wins
Build a track record — “debate wins 70% on creative tasks, best_of_n wins 85% on analytical tasks”
Evolve recipes — the route block gets better at picking recipes because it has data
Discover patterns — “adding a score step before synthesis improves scores by 12% on average”
Tune params — “n=5 is better than n=3 for best_of_n, but n=3 is enough for debate”

Raw prompting and tool calling are stateless — every session starts from zero. Recipes are stateful across runs — each run contributes to the knowledge of what works. That’s the real moat.

The analogy

Recipes are to tool calling what CI pipelines are to running commands manually. Parallel + multi-model = faster. Learning = smarter over time.

Extensible by Design

One way or another, you need to capture how to do something like score — what rubric, which model, how harsh, whether to give feedback or just a number. A prompt buries these choices inside a wall of text. A recipe makes them explicit, tunable, and composable.

This matters because the architecture is open to new processing types. Today we have 8 block types (produce, score, verify, revise, reduce, execute, enrich, loop). Adding a new one is:

Write an async Python function with the standard (candidates, params, context) signature
Register it in BLOCK_REGISTRY
Use it in any recipe YAML immediately

No framework changes, no schema migration, no executor rewrite. The same extensibility applies to how existing blocks work — a score block’s behavior is fully determined by its params (rubric, model, feedback mode), all visible in the YAML, all tunable per-recipe.

Structured HOW

A recipe doesn’t just say “score this.” It says: use haiku, score on [correctness, reasoning quality], don’t give feedback, just score. That specificity is what makes recipes comparable and improvable.

Easy to extend

Need a “gather” block that searches the web before generating? A “decompose” block that splits problems into sub-tasks? A “steer” block that adaptively picks the next step? Write the function, register it, use it in YAML. The executor handles everything else.

This is the architectural bet: blocks are cheap to add, recipes are cheap to compose, and the learning system discovers which compositions work. The vocabulary of thinking moves grows over time without requiring redesign.

The Steer Block — Adaptive Orchestration

Fixed pipelines always run the same steps regardless of how the problem is going. best_of_n generates 5 candidates even if the first one scored 98. self_refine runs 3 rounds even if the answer stopped improving after 1. The pipeline doesn’t watch what’s happening.

The steer block replaces a fixed stage list with a decision loop. Given an action space of available blocks, it observes the current state — candidates, scores, budget remaining, what’s been tried — and picks what to do next:

State observed	Steer decides
Scores are low and similar	Generate more diverse candidates
One candidate is close but has a fixable flaw	Score with feedback → revise
80% of budget spent, best score is 85	Stop, return best
Execute block found a real failure	Focus refinement there
3 rounds with no improvement	Try a completely different approach

With steer, a recipe collapses to: steer(action_space=[produce, score, revise, execute, reduce]). That’s it. Produce is just another action — steer can request more candidates at any point, with different params (wider search, different model, different lens). The steer block is the recipe — everything else is its action space.

Why this matters

This is where the learning system pays off. Steer doesn’t just make one-off decisions — it accumulates knowledge about which actions help in which states. “When scores plateau after score, switching to execute (grounded feedback) breaks the plateau 70% of the time.” The action policy improves with every run.

The Research Says This Works

Recent work validates every piece of vario’s approach — and reveals the gap vario fills:

Self-MoA

Li et al., Feb 2025

Same-model ensembles match or beat multi-model mixtures. Quality of each sample matters 1.4–3.2× more than diversity between models.

→ Strategy choice matters more than model diversity. The “how to think” question dominates.

Scaling Test-Time Compute

Snell et al., ICLR 2025

Adaptive compute allocation per difficulty yields 2–4× efficiency gains over uniform sampling. Easy problems need less, hard problems need more.

→ Vario’s move library routes effort where it pays off, not uniformly everywhere.

MetaScale

Chen et al., Mar 2025

GPT-4o + learned meta-strategies beats o1-mini. Uses MAB + genetic evolution to discover “meta-thoughts” — but within a single model, without persistence.

→ Meta-strategy learning works. Vario adds what MetaScale lacks: persistence, multi-model, and interactive exploration.

Conductor

Guo et al., Dec 2024

RL-learned workflow orchestration hits 99.4% on MATH, 93.3% on AIME, beating GPT-5 solo — with only ~3 agent calls on average.

→ Learned composition of reasoning moves works. Conductor does it with RL; vario does it with experience-based learning.

Large Language Monkeys

Brown et al., Jul 2024

Coverage scales log-linearly with samples: 250 samples solve problems 0 of 1 can. But gains plateau without verification — the verification bottleneck.

→ Repeated sampling alone isn’t enough. Vario adds the verification, critique, and synthesis that make samples useful.

LLM Ensemble Survey

Chen et al., Feb 2025

First systematic taxonomy: pre-inference (routing), during-inference (collaboration), post-inference (voting/fusion). Most systems cover one phase.

→ Vario spans all three phases — and adds persistent learning across runs that no surveyed system has.

Vario’s unique position

No published system combines: persistent learning across runs + composable reasoning primitives + multi-model generation + interactive exploration. MetaScale discovers strategies but doesn’t persist them. Conductor learns compositions but on fixed training data. Self-MoA shows strategy matters more than model diversity. Vario sits at the intersection.

Full survey with 30+ papers: meta_llm_strategies.html

Polya’s Framework, Computationally

Vario implements a problem-solving structure humans have used since Polya wrote it down in 1945 — and that LLMs skip entirely. The four phases map directly to vario’s stage types:

Polya phase	What it does	Vario stages / moves
1. Understand	What’s the unknown? Have I seen this before?	expand route / Gather, Frame, Generalize
2. Plan	Find a related problem. Pick an approach.	route + MoveLibrary / meta-level learning
3. Execute	Carry out the plan. Check each step.	produce (engine — parallel across models) / Generate, Decompose
4. Look back	Check the result. Could I derive it differently?	score verify refine / Verify, Critique, Synthesize, Hedge

Polya considered step 1 the most important and the most neglected. Eighty years later, LLMs have the same blind spot. The expand stage exists to close it: think about what to think about, before generating any answers.

Architecture

Question │ ▼ ① UNDERSTAND ───────────────────────────────────────────── │ [expand] classify → build context for this problem type │ Gather: what do we need to know? │ Frame: which lenses apply? │ Generalize: what worked for similar problems? │ [route] match situation → compose moves from library │ ② EXECUTE ─────────────────────────────────────────────── │ [produce] engine fans out across models in parallel ╱ │ ╲ model₁ model₂ model₃ ← breadth: diverse perspectives ╲ │ ╱ │ ③ EVALUATE ────────────────────────────────────────────── │ [score] judge against rubric ← Verify [score] identify weaknesses ← Critique [verify] check correctness ← Verify [filter] keep top candidates ← Hedge │ ④ IMPROVE ─────────────────────────────────────────────── │ [refine] improve based on score ← Optimize ↺ iterate [reduce] combine best elements ← Synthesize [vote] aggregate judgments → select │ ▼ Answer MoveLibrary ←──── record outcome (move × context × outcome, adaptive priority table)

Concrete Example: Investment Thesis Evaluation

When a user says “evaluate this investment thesis,” a sophisticated vario session composes moves from its library:

Gather: “What are the key claims? Let me extract them first.”
Frame: “Each claim needs different analysis — market size needs data, competitive moat needs adversarial thinking, team quality needs precedent matching.”
Generate: Run market-size through search-augmented models; moat through adversarial lens; team through precedent-matching.
Verify: Cross-check market numbers against multiple sources.
Critique: “What’s the strongest bear case? What assumption would kill the thesis?”
Hedge: “What are we most uncertain about? Where would more information change the conclusion?”
Synthesize: Merge into structured assessment with confidence levels per claim.

This is not a single “strategy” — it’s a plan composed from moves, where the plan itself was chosen based on problem analysis.

Compute Tiers

Not every question needs a 10-stage pipeline. The research is clear: most quality gain comes early. Vario lets you dial compute to the problem.

Budget	What you get	When to use
1×	One-shot with good system prompt	Factual lookup, simple tasks
2×	CoT, step-back, single score pass, or rubric-first	Most tasks — this tier captures most of the quality gain
5×	Self-consistency (5 samples), produce → score → revise	Important decisions, estimates, evaluations
10×	Tree-of-thought, Reflexion loops, exhaustive comparison	High-stakes, complex, or formal reasoning

Key finding: self-consistency (sample N, majority vote) is the single highest-ROI LLM strategy — +18% on GSM8K at 5× cost. (Wang et al. 2023)

Composable Building Blocks

Everything in Vario composes at three levels:

Moves

10 cognitive operations: Gather, Frame, Generate, Verify, Critique, Optimize, Generalize, Decompose, Synthesize, Hedge. The atomic primitives.

Strategies

Compositions of moves. 19 built-in (majority vote, reflexion, debate, tree search…) or define your own in YAML. Today’s strategies — tomorrow, learned compositions.

Move Library

Tracks which move compositions work for which situations. Adaptive priority tables evolve with experience — the system gets better at choosing.

Example: rubric-first evaluation

name: rubric_first_evaluation description: Build rubric before generating, then score against it budget_usd: 2.00 stages: - type: enrich # what does "good" mean here? params: {approach: rubric, model: sonnet} - type: produce # multiple models, diverse candidates params: {models: [sonnet, gpt, gemini], n: 2} - type: score # score against rubric + identify weaknesses - type: revise # improve based on score - type: reduce # combine best elements

Example: iterative optimization with real-world feedback

name: iterate_with_execute description: Generate variants, test them against reality, iterate budget_usd: 5.00 stages: - type: produce # seed batch of candidates params: {n: 5, model: grok-fast} - type: loop params: {max_rounds: 10, min_improvement: 0.02} stages: - type: execute # run backtest / compile / screenshot — real-world eval params: {handler: finance.eval.evaluator.evaluate_variant_from_json} - type: score # analyze what worked and why params: {model: haiku} - type: produce # next batch informed by results params: {n: 3, model: grok-fast}

Example: composition — recipes reference other recipes

# debate is a named recipe: produce → score → reduce → score # use it as type: debate with param overrides name: iterative_debate budget_usd: 5.00 stages: - type: debate # expands to 4 stages inline params: {generator: grok-fast, critic: haiku, n: 5} - type: loop params: {max_rounds: 3} stages: - type: execute params: {handler: my.evaluator} - type: debate # same recipe, different params params: {generator: sonnet, n: 3}

Where It Targets

Vario is most valuable where a first draft is easy but a good answer requires judgment:

1. Decision Support

Investment thesis, architecture choice, strategy call. Build rubric, generate from multiple reasoning styles, evaluate, surface second-order effects, synthesize.

2. Research Synthesis

Ingest many sources. Each model interprets differently. Cross-source comparison surfaces agreements, contradictions, gaps. New synthesis, not summary of summaries.

3. Creative Production

Study a subfield deeply → extract what makes the best examples work → generate candidates informed by those principles → evaluate against exemplar-derived rubric → refine.

4. Prose & Argument

Multiple models draft with different rhetorical strategies. Evaluate argument structure, evidence quality, persuasiveness. Extract best paragraphs, synthesize into one stronger piece.

5. Domain Expert Building (the long game)

Accumulate domain experience from the web, users, and experts. Each question builds on everything learned before. After 100 questions in a domain, answers are measurably better than the first 10.

Theoretical Grounding

Vario doesn’t invent a new theory. It implements established ones computationally.

Polya’s How to Solve It

1945

The four-phase pipeline. Understand → Plan → Execute → Look Back.

Kahneman’s System 1 / System 2

2011

One-shot = System 1. Strategy pipeline = System 2. Vario is “System 2 for LLMs.”

Bloom’s Taxonomy

1956, rev. 2001

Task classification for route — remember, understand, apply, analyze, evaluate, create.

Klein’s Recognition-Primed Decisions

1998

MoveLibrary situation tags. Experts recognize patterns and match playbooks, not exhaustive comparison.

Schoenfeld’s Problem-Solving

1985

Extends Polya with metacognition — monitoring your own reasoning. Maps to Critique and Verify moves.

Flavell’s Metacognition

1979

Thinking about thinking. The expand and route stages are metacognitive acts — choosing how to think.

The Arc

Phase 1 (now)

Recipe Engine

20 recipes, composable blocks, execute block for real-world grounding, vario CLI, slim YAML format, NL → recipe parsing.

Phase 2 (next)

Composition & Learning

Recipe-as-type expansion with $params. A/B testing across recipes. Route block auto-selects. Steer block for adaptive control. Track record accumulates.

Phase 3

Autonomous Iteration

Headless recipe runs overnight. Execute blocks bridge to domain evaluators. System discovers which recipes work for which domains. Novel recipes composed from primitives.

Phase 4 (north star)

Parallel Superpowers

Any Claude Code session calls vario as a tool for parallel, multi-model, tracked exploration. The learning compounds. The system gets measurably better at choosing approaches.

Key Components

Blocks — Atomic operations: produce, score, execute, revise, reduce, verify, enrich, loop. Each is an async Python function that can call LLMs, tools, APIs, or spawn sessions.

Recipes — YAML compositions of blocks. Named, parameterized with $variables, composable (a recipe can reference other recipes via type:). Built-in library + user-defined.

Executor — Pipeline engine with loop/convergence detection, budget tracking, soft landing on budget exhaustion, and structured traces for every stage.

Execute block — Bridges to real-world evaluation: backtests, code execution, screenshots, external APIs. The grounding that makes feedback loops real.

Move Library — SQLite-backed record of which recipes work for which problem types. Adaptive priority tables evolve with experience.

vario CLI — Recipe runner. Takes NL or YAML, runs through the executor, returns results. Also usable as a tool from Claude Code sessions.

Try It

The fastest way to see the difference: take a question you care about and run it two ways.

Named recipe: vario best_of_n "your question" — produce 5, score, pick best
From NL: vario parse "debate with 3 strong models" — generates runnable recipe YAML
With real-world eval: vario recipe.yaml "optimize" --handler my.evaluator — execute block bridges to your code
List available: vario list — show all built-in recipes

Good test questions: anything where the first answer feels “fine” but you suspect there’s a better one. Evaluations, estimates, explanations, decisions with trade-offs. Or anything with a measurable objective — that’s where the execute block shines.

Open Vario Studio → vario.localhost

Full README (single source of truth) · Related work survey (30+ papers) · Benchmark data