By thinking 2–3× as much about a problem — generating alternatives, evaluating from more angles, building up background — you can almost always improve on a first-pass answer. Vario makes that effort computational, not human.
Ask an LLM a hard question and it generates one answer, immediately, with no preparation. It doesn’t build a rubric for what “good” means. It doesn’t look for precedent. It doesn’t consider alternative framings. It doesn’t check its own work.
These aren’t exotic techniques — they’re what any thoughtful person does on a question that matters. The gap between a quick take and a considered one is real and valuable. Vario exists to close that gap automatically.
The difference between “good enough” and “genuinely good” is almost always more effort applied in the right places. Vario makes that effort computational — generates alternatives broadly, evaluates from more angles, builds background before answering, and learns from the delta between what the quick take missed and what the deeper pass found.
Vario is not “19 strategies that compete.” It’s a composable language for expressing how to think about a problem — and a learning system that discovers which thinking moves work for which situations.
The 19 current strategies (majority_vote, debate, tree_search, etc.) are specific instantiations, but the real value is the primitives they’re built from and the learning that accumulates across runs. A strategy is just a composition of moves. The moves are the atoms:
| Move | What It Does | Examples |
|---|---|---|
| Gather | Collect more information before committing | Search for evidence, poll multiple sources, check assumptions |
| Frame | Choose a lens through which to analyze | “Think like an economist”, “What would a skeptic say?” |
| Generate | Produce candidate solutions | Multiple models, temperatures, prompts, styles |
| Verify | Check if a candidate is correct or good | Execute code, check math, LLM-as-judge, self-consistency |
| Critique | Identify weaknesses in a candidate | Adversarial review, find counterexamples, steelman the opposing view |
| Optimize | Define a loss function and improve against it | Set up rubric → score → iterate on low dimensions |
| Generalize | Collect precedents and extract patterns | “What worked last time?”, “What do similar problems share?” |
| Decompose | Break into independent sub-problems | Branch into parallel sub-tasks, then merge |
| Synthesize | Merge the best parts of multiple candidates | Debate, rubric-guided fusion, extract-and-combine |
| Hedge | Identify risks and failure modes | “How could this go wrong?”, red-team, pre-mortem |
A strategy like debate is really: Generate × 3 → Critique (mutual) → Synthesize (strongest arguments). tree_search is: Generate → Verify → Decompose (expand promising branches) → Synthesize. The current YAML strategies hard-wire these compositions — the goal is to make the composition itself learnable.
Vario learns at two levels, each feeding the other:
“This looks like a problem where we need more info before generating” → lead with Gather. “This has a clear objective function” → lead with Optimize. “High uncertainty” → Generate diversity → Critique → Hedge.
“For code, Verify via execution works; for prose, LLM-as-judge with debiasing.” “Gemini excels at search-augmented tasks; Opus at nuanced reasoning.” “Temperature 0.9 helps creative, hurts factual.”
The move library (SQLite) currently tracks strategy × situation win/loss. The future is tracking move × context × outcome, so the system can compose novel strategies from primitives rather than selecting from a fixed menu.
Fair question. A Claude Code session with tool calling already does “generate hypothesis, run code, see result, iterate.” So what do recipes add?
| Capability | Tool calling session | Recipes (vario) |
|---|---|---|
| Interactive | Yes | Not yet |
| Parallel | No — sequential, one thought at a time | Yes — fan out across models and candidates |
| Multi-model | No — one model per session | Yes — right model for each step |
| Tool access | Yes | Yes — blocks can call anything |
| Headless / batch | No — needs live session | Yes — fire and forget |
| Cost-optimized | No — one model for everything | Yes — haiku for scoring, opus for synthesis |
| Tracked / comparable | No — sessions are ephemeral | Yes — structured traces, A/B testable |
| Learnable | No — starts from zero every session | Yes — which recipes work for which problems? |
The honest answer: for one-shot interactive tasks, just prompt the LLM. Recipes earn their keep when you need parallelism, multiple models, cost control, or — most importantly — when you want to learn what works across many runs.
A Claude Code session can call vario as a tool — offloading the parallel, multi-model, batch work while keeping the session interactive and adaptive. Like how a developer uses CI: you don’t run 50 tests one at a time in your terminal, you kick off a pipeline and get results back. The session decides when to reach for vario vs just thinking directly.
The killer feature isn’t “better prompting” — it’s the execute block. It bridges the LLM pipeline to real-world evaluation:
Without the execute block, everything is LLM-judging-LLM — turtles all the way down. With it, the feedback loop is grounded in reality. The LLM generates hypotheses; the real world evaluates them.
Parallelism and multi-model are one-time efficiency gains. Learning what works is cumulative.
With structured recipes + tracked results, you can:
Raw prompting and tool calling are stateless — every session starts from zero. Recipes are stateful across runs — each run contributes to the knowledge of what works. That’s the real moat.
Recipes are to tool calling what CI pipelines are to running commands manually. Parallel + multi-model = faster. Learning = smarter over time.
One way or another, you need to capture how to do something like score — what rubric, which model, how harsh, whether to give feedback or just a number. A prompt buries these choices inside a wall of text. A recipe makes them explicit, tunable, and composable.
This matters because the architecture is open to new processing types. Today we have 8 block types (produce, score, verify, revise, reduce, execute, enrich, loop). Adding a new one is:
(candidates, params, context) signatureBLOCK_REGISTRYNo framework changes, no schema migration, no executor rewrite. The same extensibility applies to how existing blocks work — a score block’s behavior is fully determined by its params (rubric, model, feedback mode), all visible in the YAML, all tunable per-recipe.
A recipe doesn’t just say “score this.” It says: use haiku, score on [correctness, reasoning quality], don’t give feedback, just score. That specificity is what makes recipes comparable and improvable.
Need a “gather” block that searches the web before generating? A “decompose” block that splits problems into sub-tasks? A “steer” block that adaptively picks the next step? Write the function, register it, use it in YAML. The executor handles everything else.
This is the architectural bet: blocks are cheap to add, recipes are cheap to compose, and the learning system discovers which compositions work. The vocabulary of thinking moves grows over time without requiring redesign.
Fixed pipelines always run the same steps regardless of how the problem is going. best_of_n generates 5 candidates even if the first one scored 98. self_refine runs 3 rounds even if the answer stopped improving after 1. The pipeline doesn’t watch what’s happening.
The steer block replaces a fixed stage list with a decision loop. Given an action space of available blocks, it observes the current state — candidates, scores, budget remaining, what’s been tried — and picks what to do next:
| State observed | Steer decides |
|---|---|
| Scores are low and similar | Generate more diverse candidates |
| One candidate is close but has a fixable flaw | Score with feedback → revise |
| 80% of budget spent, best score is 85 | Stop, return best |
| Execute block found a real failure | Focus refinement there |
| 3 rounds with no improvement | Try a completely different approach |
With steer, a recipe collapses to: steer(action_space=[produce, score, revise, execute, reduce]). That’s it. Produce is just another action — steer can request more candidates at any point, with different params (wider search, different model, different lens). The steer block is the recipe — everything else is its action space.
This is where the learning system pays off. Steer doesn’t just make one-off decisions — it accumulates knowledge about which actions help in which states. “When scores plateau after score, switching to execute (grounded feedback) breaks the plateau 70% of the time.” The action policy improves with every run.
Recent work validates every piece of vario’s approach — and reveals the gap vario fills:
No published system combines: persistent learning across runs + composable reasoning primitives + multi-model generation + interactive exploration. MetaScale discovers strategies but doesn’t persist them. Conductor learns compositions but on fixed training data. Self-MoA shows strategy matters more than model diversity. Vario sits at the intersection.
Full survey with 30+ papers: meta_llm_strategies.html
Vario implements a problem-solving structure humans have used since Polya wrote it down in 1945 — and that LLMs skip entirely. The four phases map directly to vario’s stage types:
| Polya phase | What it does | Vario stages / moves |
|---|---|---|
| 1. Understand | What’s the unknown? Have I seen this before? | expand route / Gather, Frame, Generalize |
| 2. Plan | Find a related problem. Pick an approach. | route + MoveLibrary / meta-level learning |
| 3. Execute | Carry out the plan. Check each step. | produce (engine — parallel across models) / Generate, Decompose |
| 4. Look back | Check the result. Could I derive it differently? | score verify refine / Verify, Critique, Synthesize, Hedge |
Polya considered step 1 the most important and the most neglected. Eighty years later, LLMs have the same blind spot. The expand stage exists to close it: think about what to think about, before generating any answers.
When a user says “evaluate this investment thesis,” a sophisticated vario session composes moves from its library:
This is not a single “strategy” — it’s a plan composed from moves, where the plan itself was chosen based on problem analysis.
Not every question needs a 10-stage pipeline. The research is clear: most quality gain comes early. Vario lets you dial compute to the problem.
| Budget | What you get | When to use |
|---|---|---|
| 1× | One-shot with good system prompt | Factual lookup, simple tasks |
| 2× | CoT, step-back, single score pass, or rubric-first | Most tasks — this tier captures most of the quality gain |
| 5× | Self-consistency (5 samples), produce → score → revise | Important decisions, estimates, evaluations |
| 10× | Tree-of-thought, Reflexion loops, exhaustive comparison | High-stakes, complex, or formal reasoning |
Key finding: self-consistency (sample N, majority vote) is the single highest-ROI LLM strategy — +18% on GSM8K at 5× cost. (Wang et al. 2023)
Everything in Vario composes at three levels:
10 cognitive operations: Gather, Frame, Generate, Verify, Critique, Optimize, Generalize, Decompose, Synthesize, Hedge. The atomic primitives.
Compositions of moves. 19 built-in (majority vote, reflexion, debate, tree search…) or define your own in YAML. Today’s strategies — tomorrow, learned compositions.
Tracks which move compositions work for which situations. Adaptive priority tables evolve with experience — the system gets better at choosing.
Vario is most valuable where a first draft is easy but a good answer requires judgment:
Investment thesis, architecture choice, strategy call. Build rubric, generate from multiple reasoning styles, evaluate, surface second-order effects, synthesize.
Ingest many sources. Each model interprets differently. Cross-source comparison surfaces agreements, contradictions, gaps. New synthesis, not summary of summaries.
Study a subfield deeply → extract what makes the best examples work → generate candidates informed by those principles → evaluate against exemplar-derived rubric → refine.
Multiple models draft with different rhetorical strategies. Evaluate argument structure, evidence quality, persuasiveness. Extract best paragraphs, synthesize into one stronger piece.
Accumulate domain experience from the web, users, and experts. Each question builds on everything learned before. After 100 questions in a domain, answers are measurably better than the first 10.
Vario doesn’t invent a new theory. It implements established ones computationally.
20 recipes, composable blocks, execute block for real-world grounding, vario CLI, slim YAML format, NL → recipe parsing.
Recipe-as-type expansion with $params. A/B testing across recipes. Route block auto-selects. Steer block for adaptive control. Track record accumulates.
Headless recipe runs overnight. Execute blocks bridge to domain evaluators. System discovers which recipes work for which domains. Novel recipes composed from primitives.
Any Claude Code session calls vario as a tool for parallel, multi-model, tracked exploration. The learning compounds. The system gets measurably better at choosing approaches.
Blocks — Atomic operations: produce, score, execute, revise, reduce, verify, enrich, loop. Each is an async Python function that can call LLMs, tools, APIs, or spawn sessions.
Recipes — YAML compositions of blocks. Named, parameterized with $variables, composable (a recipe can reference other recipes via type:). Built-in library + user-defined.
Executor — Pipeline engine with loop/convergence detection, budget tracking, soft landing on budget exhaustion, and structured traces for every stage.
Execute block — Bridges to real-world evaluation: backtests, code execution, screenshots, external APIs. The grounding that makes feedback loops real.
Move Library — SQLite-backed record of which recipes work for which problem types. Adaptive priority tables evolve with experience.
vario CLI — Recipe runner. Takes NL or YAML, runs through the executor, returns results. Also usable as a tool from Claude Code sessions.
The fastest way to see the difference: take a question you care about and run it two ways.
vario best_of_n "your question" — produce 5, score, pick bestvario parse "debate with 3 strong models" — generates runnable recipe YAMLvario recipe.yaml "optimize" --handler my.evaluator — execute block bridges to your codevario list — show all built-in recipesGood test questions: anything where the first answer feels “fine” but you suspect there’s a better one. Evaluations, estimates, explanations, decisions with trade-offs. Or anything with a measurable objective — that’s where the execute block shines.
Open Vario Studio → vario.localhost
Full README (single source of truth) · Related work survey (30+ papers) · Benchmark data