# Vario Pipeline Topologies

---

## 1. The Core Pattern: produce → score → select

Every vario topology is a variation of this diagram.

![Core Pattern](diagrams/02_best_of_n.png)

Three phases:
- **Produce** — generate candidates in parallel. Same model at temperature > 0 (diversity from sampling) or multiple models (diversity from different training)
- **Score** — evaluate each candidate independently. Rubric-based, verification, pairwise, programmatic, or vote (see scoring methods below)
- **Select** — pick the winner. Top score, majority vote, weighted vote, or synthesis

What changes between topologies is what happens *inside* each box.

### The single-shot call is the degenerate case

What everyone does today: `produce(n=1)`, no score, no select. One candidate, no quality signal, no comparison. The pipeline still exists — it's just trivially short. Everything below is what you get by making it longer.

### Voting: a different select phase

Same produce, same score — but select by vote instead of top score. The diagram is identical except the select box.

![Voting](diagrams/06_voting.png)

- Research: consensus@8 = 100% on AIME 2025 (o4-mini)
- Simple voting captures most gains attributed to debate (NeurIPS 2025)
- Best for definite-answer problems (math, MCQ) where "most common answer" is meaningful

### Multi-model: a different produce phase

Same score, same select — but produce with different models instead of the same model at different temperatures.

![Multi-Model](diagrams/03_multi_model.png)

- Different training data → different blind spots → genuine diversity
- Research: +4-6% accuracy, 30% fewer factual errors (heterogeneous models)

---

### Scoring methods (pluggable)

How candidates get evaluated is independent of how they were generated:

| Method | How it works | Best for |
|--------|-------------|----------|
| **Rubric scoring** | Judge rates each candidate 0-100 on named criteria (correctness, thoroughness, clarity) | General quality — most common default |
| **Verification** | Judge checks specific claims for factual accuracy (pass/fail per claim) | Factual tasks, research |
| **Pairwise comparison** | Judge sees two candidates, picks the better one | Subjective quality, writing, style |
| **Programmatic eval** | Run code, check against test suite or reference answer | Code, math, structured output |
| **Majority vote** | No judge — count which answer appears most often | Definite-answer problems (MCQ, math) |

The topology is how candidates are *generated*; scoring is how they're *evaluated*. Mix and match.

### Tournament: a different scoring phase

Instead of each candidate scored independently, every candidate compared pairwise against every other. Winner by aggregate preference.

![Tournament](diagrams/10_tournament.png)

- N candidates → N×(N-1)/2 pairwise comparisons (4 candidates = 6 pairs)
- More judge calls, but more reliable ranking
- Especially useful when rubric-based scoring is noisy or subjective — a judge that can't assign consistent 0-100 scores can still reliably say "A is better than B"

---

## 2. Refinement Loop (score + feedback → revise → repeat)

The editorial loop: draft, get structured feedback, revise, repeat until quality converges.

![Refinement](diagrams/04_refinement.png)

- Feedback must come from a *different* model (same-model = +1.8%, cross-model = +80%)
- Typically converges in 3-4 rounds
- Stop conditions: score threshold, min improvement, drift detection, budget

---

## 3. Confirm (fast draft → frontier verify)

Draft cheap, verify expensive. Output-first workflow.

![Confirm](diagrams/05_confirm.png)

- Cheap model gets you 80% there, expensive model catches the gaps
- Use when you need speed AND quality

---

## 4. Every Box Can Be Arbitrarily Complex

Here's the basic pipeline:

![Shallow](diagrams/08a_shallow.png)

Now deepen it. The same structure, but produce and score have expanded into sub-pipelines:

![Deep](diagrams/08b_deep.png)

The outer structure hasn't changed — still produce → score → select. But now "produce" internally researches relevant context before generating, runs its own score/select to pick the best draft, and hands that to the outer score. And "score" internally fact-checks claims and verifies reasoning rather than just asking a judge for a number.

This is what makes vario more than a retry loop. Any box can contain:

- **Research first** — retrieve prior analyses, fetch domain knowledge, search for relevant context — then generate with it. The produce box becomes a retrieval + generation pipeline.
- **Learn relevant principles** — query a knowledge base for domain-specific rules that apply to this problem, inject them into the prompt. The produce box incorporates accumulated expertise.
- **Generate plans, not just solutions** — produce multiple *approaches* to the problem, score them for feasibility, pick the best plan, then execute that plan. You don't jump to an answer — you first decide *how* to answer. This is what human experts do on hard problems.
- **Decompose the problem** — break the question into sub-questions, solve each through its own pipeline, then synthesize.

Research supports this layering. The DeepMind ICLR 2025 result on compute-optimal test-time scaling shows that allocating more compute to *planning which approach to take* is more efficient than *more attempts at the same approach*. Generating 5 plans and executing the best one outperforms generating 50 solutions from one plan.

Every level gets its own traces, costs, and provenance — costs roll up so you can audit the full tree.

---

## 5. Adaptive Routing (effort allocation)

Hard items get more pipeline. Easy items fast-path. Effort spent where it matters.

![Adaptive](diagrams/09_adaptive.png)

- Research: 4× efficiency from adaptive allocation vs uniform (DeepMind ICLR 2025)
- The pipeline observes its own output and decides how deep to go

---

## Strengths at a Glance

| Topology | Core strength | Limitation |
|----------|--------------|------------|
| **Best-of-N** | Breadth — explores the output space | All candidates share the same context and training |
| **Multi-model** | Diversity — genuinely different perspectives | More expensive per candidate |
| **Voting** | Robustness — resistant to noisy scoring | Only works when there's a definite right answer |
| **Tournament** | Reliable ranking — consistent even when absolute scores aren't | O(n²) judge calls |
| **Refinement** | Depth — iteratively improves a single answer | Needs external feedback to work (not self-critique) |
| **Confirm** | Speed — cheap draft, expensive verification | Limited to problems where fast models get close |
| **Deepened boxes** | Power — incorporates research, planning, decomposition | Complexity; harder to debug |
| **Adaptive** | Efficiency — matches effort to difficulty | Requires knowing what "hard" looks like |