← Back to Rivus overview

Vario

By thinking 2–3× as much about a problem — generating alternatives, evaluating from more angles, building up background — you can almost always improve on a first-pass answer. Vario makes that effort computational, not human.

Ask an LLM a hard question and it generates one answer, immediately, with no preparation. It doesn’t build a rubric for what “good” means. It doesn’t look for precedent. It doesn’t consider alternative framings. It doesn’t check its own work.

These aren’t exotic techniques — they’re what any thoughtful person does on a question that matters. The gap between a quick take and a considered one is real and valuable. Vario exists to close that gap automatically.

The bet

The difference between “good enough” and “genuinely good” is almost always more effort applied in the right places. Vario makes that effort computational — generates alternatives broadly, evaluates from more angles, builds background before answering, and learns from the delta between what the quick take missed and what the deeper pass found.

Not 19 Strategies — A Language for Thinking

Vario is not “19 strategies that compete.” It’s a composable language for expressing how to think about a problem — and a learning system that discovers which thinking moves work for which situations.

The 19 current strategies (majority_vote, debate, tree_search, etc.) are specific instantiations, but the real value is the primitives they’re built from and the learning that accumulates across runs. A strategy is just a composition of moves. The moves are the atoms:

MoveWhat It DoesExamples
Gather Collect more information before committing Search for evidence, poll multiple sources, check assumptions
Frame Choose a lens through which to analyze “Think like an economist”, “What would a skeptic say?”
Generate Produce candidate solutions Multiple models, temperatures, prompts, styles
Verify Check if a candidate is correct or good Execute code, check math, LLM-as-judge, self-consistency
Critique Identify weaknesses in a candidate Adversarial review, find counterexamples, steelman the opposing view
Optimize Define a loss function and improve against it Set up rubric → score → iterate on low dimensions
Generalize Collect precedents and extract patterns “What worked last time?”, “What do similar problems share?”
Decompose Break into independent sub-problems Branch into parallel sub-tasks, then merge
Synthesize Merge the best parts of multiple candidates Debate, rubric-guided fusion, extract-and-combine
Hedge Identify risks and failure modes “How could this go wrong?”, red-team, pre-mortem

A strategy like debate is really: Generate × 3 → Critique (mutual) → Synthesize (strongest arguments). tree_search is: Generate → Verify → Decompose (expand promising branches) → Synthesize. The current YAML strategies hard-wire these compositions — the goal is to make the composition itself learnable.

Two Levels of Learning

Vario learns at two levels, each feeding the other:

Meta Level — How to Approach

“This looks like a problem where we need more info before generating” → lead with Gather. “This has a clear objective function” → lead with Optimize. “High uncertainty” → Generate diversity → Critique → Hedge.

Domain Level — What Works Here

“For code, Verify via execution works; for prose, LLM-as-judge with debiasing.” “Gemini excels at search-augmented tasks; Opus at nuanced reasoning.” “Temperature 0.9 helps creative, hurts factual.”

The move library (SQLite) currently tracks strategy × situation win/loss. The future is tracking move × context × outcome, so the system can compose novel strategies from primitives rather than selecting from a fixed menu.

Why Not Just Prompt the LLM?

Fair question. A Claude Code session with tool calling already does “generate hypothesis, run code, see result, iterate.” So what do recipes add?

CapabilityTool calling sessionRecipes (vario)
InteractiveYesNot yet
ParallelNo — sequential, one thought at a timeYes — fan out across models and candidates
Multi-modelNo — one model per sessionYes — right model for each step
Tool accessYesYes — blocks can call anything
Headless / batchNo — needs live sessionYes — fire and forget
Cost-optimizedNo — one model for everythingYes — haiku for scoring, opus for synthesis
Tracked / comparableNo — sessions are ephemeralYes — structured traces, A/B testable
LearnableNo — starts from zero every sessionYes — which recipes work for which problems?

The honest answer: for one-shot interactive tasks, just prompt the LLM. Recipes earn their keep when you need parallelism, multiple models, cost control, or — most importantly — when you want to learn what works across many runs.

Recipes as tools

A Claude Code session can call vario as a tool — offloading the parallel, multi-model, batch work while keeping the session interactive and adaptive. Like how a developer uses CI: you don’t run 50 tests one at a time in your terminal, you kick off a pipeline and get results back. The session decides when to reach for vario vs just thinking directly.

The Execute Block — Real-World Grounding

The killer feature isn’t “better prompting” — it’s the execute block. It bridges the LLM pipeline to real-world evaluation:

Without the execute block, everything is LLM-judging-LLM — turtles all the way down. With it, the feedback loop is grounded in reality. The LLM generates hypotheses; the real world evaluates them.

The Compounding Advantage

Parallelism and multi-model are one-time efficiency gains. Learning what works is cumulative.

With structured recipes + tracked results, you can:

Raw prompting and tool calling are stateless — every session starts from zero. Recipes are stateful across runs — each run contributes to the knowledge of what works. That’s the real moat.

The analogy

Recipes are to tool calling what CI pipelines are to running commands manually. Parallel + multi-model = faster. Learning = smarter over time.

Extensible by Design

One way or another, you need to capture how to do something like score — what rubric, which model, how harsh, whether to give feedback or just a number. A prompt buries these choices inside a wall of text. A recipe makes them explicit, tunable, and composable.

This matters because the architecture is open to new processing types. Today we have 8 block types (produce, score, verify, revise, reduce, execute, enrich, loop). Adding a new one is:

  1. Write an async Python function with the standard (candidates, params, context) signature
  2. Register it in BLOCK_REGISTRY
  3. Use it in any recipe YAML immediately

No framework changes, no schema migration, no executor rewrite. The same extensibility applies to how existing blocks work — a score block’s behavior is fully determined by its params (rubric, model, feedback mode), all visible in the YAML, all tunable per-recipe.

Structured HOW

A recipe doesn’t just say “score this.” It says: use haiku, score on [correctness, reasoning quality], don’t give feedback, just score. That specificity is what makes recipes comparable and improvable.

Easy to extend

Need a “gather” block that searches the web before generating? A “decompose” block that splits problems into sub-tasks? A “steer” block that adaptively picks the next step? Write the function, register it, use it in YAML. The executor handles everything else.

This is the architectural bet: blocks are cheap to add, recipes are cheap to compose, and the learning system discovers which compositions work. The vocabulary of thinking moves grows over time without requiring redesign.

The Steer Block — Adaptive Orchestration

Fixed pipelines always run the same steps regardless of how the problem is going. best_of_n generates 5 candidates even if the first one scored 98. self_refine runs 3 rounds even if the answer stopped improving after 1. The pipeline doesn’t watch what’s happening.

The steer block replaces a fixed stage list with a decision loop. Given an action space of available blocks, it observes the current state — candidates, scores, budget remaining, what’s been tried — and picks what to do next:

State observedSteer decides
Scores are low and similarGenerate more diverse candidates
One candidate is close but has a fixable flawScore with feedback → revise
80% of budget spent, best score is 85Stop, return best
Execute block found a real failureFocus refinement there
3 rounds with no improvementTry a completely different approach

With steer, a recipe collapses to: steer(action_space=[produce, score, revise, execute, reduce]). That’s it. Produce is just another action — steer can request more candidates at any point, with different params (wider search, different model, different lens). The steer block is the recipe — everything else is its action space.

Why this matters

This is where the learning system pays off. Steer doesn’t just make one-off decisions — it accumulates knowledge about which actions help in which states. “When scores plateau after score, switching to execute (grounded feedback) breaks the plateau 70% of the time.” The action policy improves with every run.

The Research Says This Works

Recent work validates every piece of vario’s approach — and reveals the gap vario fills:

Self-MoA
Li et al., Feb 2025
Same-model ensembles match or beat multi-model mixtures. Quality of each sample matters 1.4–3.2× more than diversity between models.
→ Strategy choice matters more than model diversity. The “how to think” question dominates.
Scaling Test-Time Compute
Snell et al., ICLR 2025
Adaptive compute allocation per difficulty yields 2–4× efficiency gains over uniform sampling. Easy problems need less, hard problems need more.
→ Vario’s move library routes effort where it pays off, not uniformly everywhere.
MetaScale
Chen et al., Mar 2025
GPT-4o + learned meta-strategies beats o1-mini. Uses MAB + genetic evolution to discover “meta-thoughts” — but within a single model, without persistence.
→ Meta-strategy learning works. Vario adds what MetaScale lacks: persistence, multi-model, and interactive exploration.
Conductor
Guo et al., Dec 2024
RL-learned workflow orchestration hits 99.4% on MATH, 93.3% on AIME, beating GPT-5 solo — with only ~3 agent calls on average.
→ Learned composition of reasoning moves works. Conductor does it with RL; vario does it with experience-based learning.
Large Language Monkeys
Brown et al., Jul 2024
Coverage scales log-linearly with samples: 250 samples solve problems 0 of 1 can. But gains plateau without verification — the verification bottleneck.
→ Repeated sampling alone isn’t enough. Vario adds the verification, critique, and synthesis that make samples useful.
LLM Ensemble Survey
Chen et al., Feb 2025
First systematic taxonomy: pre-inference (routing), during-inference (collaboration), post-inference (voting/fusion). Most systems cover one phase.
→ Vario spans all three phases — and adds persistent learning across runs that no surveyed system has.
Vario’s unique position

No published system combines: persistent learning across runs + composable reasoning primitives + multi-model generation + interactive exploration. MetaScale discovers strategies but doesn’t persist them. Conductor learns compositions but on fixed training data. Self-MoA shows strategy matters more than model diversity. Vario sits at the intersection.

Full survey with 30+ papers: meta_llm_strategies.html

Polya’s Framework, Computationally

Vario implements a problem-solving structure humans have used since Polya wrote it down in 1945 — and that LLMs skip entirely. The four phases map directly to vario’s stage types:

Polya phase What it does Vario stages / moves
1. Understand What’s the unknown? Have I seen this before? expand route / Gather, Frame, Generalize
2. Plan Find a related problem. Pick an approach. route + MoveLibrary / meta-level learning
3. Execute Carry out the plan. Check each step. produce (engine — parallel across models) / Generate, Decompose
4. Look back Check the result. Could I derive it differently? score verify refine / Verify, Critique, Synthesize, Hedge

Polya considered step 1 the most important and the most neglected. Eighty years later, LLMs have the same blind spot. The expand stage exists to close it: think about what to think about, before generating any answers.

Architecture

Question ① UNDERSTAND ───────────────────────────────────────────── [expand] classify → build context for this problem type Gather: what do we need to know? Frame: which lenses apply? Generalize: what worked for similar problems? [route] match situation → compose moves from library ② EXECUTE ─────────────────────────────────────────────── [produce] engine fans out across models in parallel ╱ │ ╲ model₁ model₂ model₃ ← breadth: diverse perspectives ╲ │ ╱ ③ EVALUATE ────────────────────────────────────────────── [score] judge against rubric ← Verify [score] identify weaknesses ← Critique [verify] check correctness ← Verify [filter] keep top candidates ← Hedge ④ IMPROVE ─────────────────────────────────────────────── [refine] improve based on score ← Optimize ↺ iterate [reduce] combine best elements ← Synthesize [vote] aggregate judgments → select Answer MoveLibrary ←──── record outcome (move × context × outcome, adaptive priority table)

Concrete Example: Investment Thesis Evaluation

When a user says “evaluate this investment thesis,” a sophisticated vario session composes moves from its library:

  1. Gather: “What are the key claims? Let me extract them first.”
  2. Frame: “Each claim needs different analysis — market size needs data, competitive moat needs adversarial thinking, team quality needs precedent matching.”
  3. Generate: Run market-size through search-augmented models; moat through adversarial lens; team through precedent-matching.
  4. Verify: Cross-check market numbers against multiple sources.
  5. Critique: “What’s the strongest bear case? What assumption would kill the thesis?”
  6. Hedge: “What are we most uncertain about? Where would more information change the conclusion?”
  7. Synthesize: Merge into structured assessment with confidence levels per claim.

This is not a single “strategy” — it’s a plan composed from moves, where the plan itself was chosen based on problem analysis.

Compute Tiers

Not every question needs a 10-stage pipeline. The research is clear: most quality gain comes early. Vario lets you dial compute to the problem.

Budget What you get When to use
One-shot with good system prompt Factual lookup, simple tasks
CoT, step-back, single score pass, or rubric-first Most tasks — this tier captures most of the quality gain
Self-consistency (5 samples), produce → score → revise Important decisions, estimates, evaluations
10× Tree-of-thought, Reflexion loops, exhaustive comparison High-stakes, complex, or formal reasoning

Key finding: self-consistency (sample N, majority vote) is the single highest-ROI LLM strategy — +18% on GSM8K at 5× cost. (Wang et al. 2023)

Composable Building Blocks

Everything in Vario composes at three levels:

Moves

10 cognitive operations: Gather, Frame, Generate, Verify, Critique, Optimize, Generalize, Decompose, Synthesize, Hedge. The atomic primitives.

Strategies

Compositions of moves. 19 built-in (majority vote, reflexion, debate, tree search…) or define your own in YAML. Today’s strategies — tomorrow, learned compositions.

Move Library

Tracks which move compositions work for which situations. Adaptive priority tables evolve with experience — the system gets better at choosing.

Example: rubric-first evaluation

name: rubric_first_evaluation description: Build rubric before generating, then score against it budget_usd: 2.00 stages: - type: enrich # what does "good" mean here? params: {approach: rubric, model: sonnet} - type: produce # multiple models, diverse candidates params: {models: [sonnet, gpt, gemini], n: 2} - type: score # score against rubric + identify weaknesses - type: revise # improve based on score - type: reduce # combine best elements

Example: iterative optimization with real-world feedback

name: iterate_with_execute description: Generate variants, test them against reality, iterate budget_usd: 5.00 stages: - type: produce # seed batch of candidates params: {n: 5, model: grok-fast} - type: loop params: {max_rounds: 10, min_improvement: 0.02} stages: - type: execute # run backtest / compile / screenshot — real-world eval params: {handler: finance.eval.evaluator.evaluate_variant_from_json} - type: score # analyze what worked and why params: {model: haiku} - type: produce # next batch informed by results params: {n: 3, model: grok-fast}

Example: composition — recipes reference other recipes

# debate is a named recipe: produce → score → reduce → score # use it as type: debate with param overrides name: iterative_debate budget_usd: 5.00 stages: - type: debate # expands to 4 stages inline params: {generator: grok-fast, critic: haiku, n: 5} - type: loop params: {max_rounds: 3} stages: - type: execute params: {handler: my.evaluator} - type: debate # same recipe, different params params: {generator: sonnet, n: 3}

Where It Targets

Vario is most valuable where a first draft is easy but a good answer requires judgment:

1. Decision Support

Investment thesis, architecture choice, strategy call. Build rubric, generate from multiple reasoning styles, evaluate, surface second-order effects, synthesize.

2. Research Synthesis

Ingest many sources. Each model interprets differently. Cross-source comparison surfaces agreements, contradictions, gaps. New synthesis, not summary of summaries.

3. Creative Production

Study a subfield deeply → extract what makes the best examples work → generate candidates informed by those principles → evaluate against exemplar-derived rubric → refine.

4. Prose & Argument

Multiple models draft with different rhetorical strategies. Evaluate argument structure, evidence quality, persuasiveness. Extract best paragraphs, synthesize into one stronger piece.

5. Domain Expert Building (the long game)

Accumulate domain experience from the web, users, and experts. Each question builds on everything learned before. After 100 questions in a domain, answers are measurably better than the first 10.

Theoretical Grounding

Vario doesn’t invent a new theory. It implements established ones computationally.

Polya’s How to Solve It
1945
The four-phase pipeline. Understand → Plan → Execute → Look Back.
Kahneman’s System 1 / System 2
2011
One-shot = System 1. Strategy pipeline = System 2. Vario is “System 2 for LLMs.”
Bloom’s Taxonomy
1956, rev. 2001
Task classification for route — remember, understand, apply, analyze, evaluate, create.
Klein’s Recognition-Primed Decisions
1998
MoveLibrary situation tags. Experts recognize patterns and match playbooks, not exhaustive comparison.
Schoenfeld’s Problem-Solving
1985
Extends Polya with metacognition — monitoring your own reasoning. Maps to Critique and Verify moves.
Flavell’s Metacognition
1979
Thinking about thinking. The expand and route stages are metacognitive acts — choosing how to think.

The Arc

Phase 1 (now)

Recipe Engine

20 recipes, composable blocks, execute block for real-world grounding, vario CLI, slim YAML format, NL → recipe parsing.

Phase 3

Autonomous Iteration

Headless recipe runs overnight. Execute blocks bridge to domain evaluators. System discovers which recipes work for which domains. Novel recipes composed from primitives.

Phase 4 (north star)

Parallel Superpowers

Any Claude Code session calls vario as a tool for parallel, multi-model, tracked exploration. The learning compounds. The system gets measurably better at choosing approaches.

Key Components

Blocks — Atomic operations: produce, score, execute, revise, reduce, verify, enrich, loop. Each is an async Python function that can call LLMs, tools, APIs, or spawn sessions.

Recipes — YAML compositions of blocks. Named, parameterized with $variables, composable (a recipe can reference other recipes via type:). Built-in library + user-defined.

Executor — Pipeline engine with loop/convergence detection, budget tracking, soft landing on budget exhaustion, and structured traces for every stage.

Execute block — Bridges to real-world evaluation: backtests, code execution, screenshots, external APIs. The grounding that makes feedback loops real.

Move Library — SQLite-backed record of which recipes work for which problem types. Adaptive priority tables evolve with experience.

vario CLI — Recipe runner. Takes NL or YAML, runs through the executor, returns results. Also usable as a tool from Claude Code sessions.

Try It

The fastest way to see the difference: take a question you care about and run it two ways.

Good test questions: anything where the first answer feels “fine” but you suspect there’s a better one. Evaluations, estimates, explanations, decisions with trade-offs. Or anything with a measurable objective — that’s where the execute block shines.

Open Vario Studio → vario.localhost

Full README (single source of truth)  ·  Related work survey (30+ papers)  ·  Benchmark data