# Strategies Benchmark v2 — Design

> **Status**: Approved
> **Date**: 2026-02-20
> **Goal**: Which problem-solving strategies are most effective? Do models rise beyond their baseline capability with scaffolding? Does the effect transfer to frontier models?

## Dataset

- **100 MATH problems**, randomly sampled from the full 5K test set (hendrycks/competition_math)
- Natural difficulty distribution across 7 subjects (not balanced for any target accuracy)
- `seed=42` for reproducibility
- Same 100 questions for all models and strategies

## Models

| # | Model              | Alias          | $/M in | $/M out | Role                  |
|---|--------------------|----------------|--------|---------|------------------------|
| 1 | Gemini 3 Flash     | `gemini-flash` | $0.50  | $2.00   | Cheap test subject     |
| 2 | Grok 4.1 Fast      | `grok`         | $0.20  | $0.50   | Cheap test subject     |
| 3 | Haiku 4.5          | `haiku`        | $1.00  | $5.00   | Cheap test subject     |
| 4 | GPT-5.2            | `gpt`          | $2.50  | $10.00  | Frontier (all tiers)   |

All 4 models run all strategies. GPT-5.2 included at full scope to answer the transferability question.

## Strategies (11, 3 tiers)

### Tier 1 — Single-pass

These make one attempt (possibly with multiple samples) and pick the best.

| # | Name             | Calls/q | Description                                    |
|---|------------------|---------|------------------------------------------------|
| 1 | `baseline`       | 1       | Single shot, temp=0                            |
| 2 | `majority_vote`  | 5       | 5 samples at temp=0.8, pick most common answer |
| 3 | `best_of_n`      | 5+5     | 5 samples at temp=0.7, LLM judge scores each   |

### Tier 2 — Iterative (2-3 rounds)

These add feedback loops: solve, evaluate, retry.

| # | Name                  | Calls/q | Description                                                       |
|---|-----------------------|---------|-------------------------------------------------------------------|
| 4 | `self_critique`       | ~6      | Solve → critique own solution → retry (up to 2 retries)           |
| 5 | `generate_and_verify` | ~10     | 5 candidates → verify each step-by-step → vote among verified     |
| 6 | `reflexion`           | ~6      | Solve → external critique → retry with critique in context        |
| 7 | `debate`              | ~8      | Generate 2 solutions → adversarial argument (2 rounds) → judge    |

### Tier 3 — Deep (3+ rounds)

Multi-round, branching, or cross-model strategies.

| # | Name                     | Calls/q | Description                                                           |
|---|--------------------------|---------|-----------------------------------------------------------------------|
| 8 | `progressive_deepening`  | ~6      | Basic prompt → add hints if wrong → add worked example if still wrong |
| 9 | `tree_search`            | ~12     | Branch 3 approaches, verify each, backtrack if stuck, pick best       |
| 10| `cross_model_peer`       | ~8      | Flash solves ↔ Grok critiques (2 rounds), cheap model pairing        |
| 11| `cross_model_frontier`   | ~6      | Cheap model solves → GPT-5.2 critiques → cheap model retries          |

**Note on cross-model strategies:**
- `cross_model_peer` pairs two cheap models (Flash ↔ Grok) to test whether model diversity at the same tier helps.
- `cross_model_frontier` uses GPT-5.2 as a verifier/critic for cheap model generations, testing whether a stronger judge rescues weak generators.

## Correctness Evaluation

Official MATH `is_equiv()` function (LaTeX-aware symbolic comparison). Same as pilot and `benchmarks/eval/math_official.py`.

Answer extraction: `\boxed{}` parser with regex fallbacks.

## Cost Estimate

| Scope                    | Est. calls | Est. cost |
|--------------------------|-----------|-----------|
| 3 cheap models × tier 1  | 3,300     | ~$3       |
| 3 cheap models × tier 2  | 9,000     | ~$9       |
| 3 cheap models × tier 3  | 7,800     | ~$8       |
| GPT-5.2 × tier 1         | 1,100     | ~$11      |
| GPT-5.2 × tier 2         | 3,000     | ~$30      |
| GPT-5.2 × tier 3         | 2,600     | ~$26      |
| Cross-model experiments   | ~1,400    | ~$12      |
| **Total**                | **~28,200** | **~$99** |

**Hard budget cap: $150** (safety margin for retries and longer-than-expected outputs).

## Execution Plan

1. **Smoke test** (10 problems, Flash only, baseline + majority_vote) — validate pipeline, ~$0.10
2. **Tier 1** on all 4 models — cheapest, confirms the 100 problems work end-to-end
3. **Tier 2** on all 4 models — iterative strategies
4. **Tier 3** on all 4 models + cross-model experiments
5. **Generate report** — HTML dashboard with per-tier accuracy, cost-effectiveness, heatmaps

Each tier's results are saved to `experiments.db` incrementally. If budget is exceeded mid-run, we get partial results for analysis.

## Output

- **Database**: `brain/strategies/data/experiments.db` (existing schema — experiments + results tables)
- **Report**: `brain/strategies/data/benchmark_v2.html` (shareable via `static.localhost`)
- **Analysis**:
  - Per-model accuracy by tier and strategy
  - Cost-effectiveness (accuracy gain per dollar)
  - Per-problem heatmap (which strategies rescue which problems)
  - Statistical significance via McNemar test (paired binary correctness)
  - Tier comparison: does deeper scaffolding actually help?
  - Transferability: do strategies that help cheap models also help GPT-5.2?

## Key Questions This Answers

1. **Which strategies work?** — Ranked by accuracy gain over baseline, broken down by tier.
2. **Do multi-round strategies justify their cost?** — Tier 2 vs tier 1 accuracy gain per dollar.
3. **Does depth help?** — Tier 3 vs tier 2. Is there diminishing returns?
4. **Does model diversity matter?** — Cross-model peer vs single-model strategies.
5. **Does a strong judge help?** — Cross-model frontier vs self-verification.
6. **Do strategies transfer to frontier?** — GPT-5.2 accuracy gains vs cheap model gains.
7. **What's unsolvable?** — Problems no model+strategy combination can solve.

## Relation to Pilot

The pilot (v1) ran 7 single-pass strategies on 16 balanced problems with Haiku and Flash. This experiment:
- Scales from 16 → 100 problems (natural distribution)
- Adds Grok and GPT-5.2
- Adds iterative and deep strategy tiers
- Tests cross-model collaboration
- Uses proper statistical tests

## Implementation Notes

- Reuse `brain/strategies/experiment.py` runner and `brain/strategies/schema.py` types
- New strategy implementations go in `brain/strategies/experiments/` as Python functions
- Monkey-patch `_check_correctness` with `is_equiv` (same as pilot)
- Cross-model strategies need to accept a `critic_model` parameter
- Progressive deepening needs access to problem metadata (subject, difficulty) for hint generation