# Draft Review Recipe — Considerations & Future Directions

**Date**: 2026-02-27
**Companion to**: `2026-02-27-draft-review-design.md`

Captures ideas, research findings, and future directions beyond MVP.
Organized by theme, not priority.

> **Citation Standard for LLM Research**
>
> When citing SOTA findings in this doc or any review-system document:
> 1. **Date**: Earliest month/year the work was done
> 2. **Systems tested**: Which specific models were evaluated
> 3. **ELO context**: Arena ELO of those models vs current frontier
>
> A finding tested on GPT-4o (May 2024, ELO ~1310) is tested on what is now
> a mid-tier model, ~190 points below current frontier (Opus 4.6 = 1503,
> Gemini 3.1 Pro = 1500). Results from such papers represent a *floor* —
> current models would likely perform substantially better.
>
> **ELO reference (Feb 2026)**: Opus 4.6 = 1503 · Gemini 3.1 Pro = 1500 ·
> Grok 4.20 = 1495 · Sonnet 4.6 = 1458 · GPT-4.5 = 1444 ·
> DeepSeek R1 = 1419 · GPT-4o (2024) = 1346 · Llama 3.3 70B = 1319

---

## 1. SOTA Research Findings (Feb 2026)

### What the literature says we should do

| # | Finding | Source | Systems tested | ELO gap | Implication |
|---|---------|--------|---------------|---------|-------------|
| F1 | Models converge on same obvious issues | DREAM (Feb 2025) [R1], FREE-MAD (Sep 2025) [R6] | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~160 pts below frontier | Need **anti-convergence** — likely worse with weaker models, but convergence is a structural issue that persists at all tiers |
| F2 | Breadth of critique > depth of iteration | DeepCritic (May 2025) [R4] | 7B fine-tuned critic vs GPT-4o (~1346) | 7B << frontier; GPT-4o ~160 pts below | Multi-lens correct; with stronger models the breadth advantage may narrow but principle holds |
| F3 | Models hallucinate 38% of critique findings | CriticGPT (Jun 2024) [R5], Nature study (2025) [R10] | GPT-4 fine-tuned (~1324 base) | ~180 pts below frontier | Require **evidence citations** — hallucination rate likely lower with frontier models but still non-zero |
| F4 | Structured output degrades creativity -18% | Tam et al. (2025) [R11] | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~160 pts below | **Structured envelope, freeform content** — tradeoff likely persists |
| F5 | Over-criticism buries signal | LLM-as-Judge survey (Nov 2024) [R8] | GPT-4 (~1324), Claude 3 Opus (~1300) | ~200 pts below | Cap at **top N most impactful** |
| F6 | 2-3 passes capture most value; effectiveness drops after iteration 3 | Debugging Decay Index (Jun 2025) [R12] | GPT-4o (~1346), GPT-3.5 (~1100) | 160-400 pts below | Iteration cap in V2; frontier models may sustain more passes but diminishing returns are structural |
| F7 | Quality peaks then degrades after ~6 iterations | Koivisto et al., Nature Sci Rep (2025) [R10] | GPT-4 (~1324) | ~180 pts below | Hard stop on iteration count |
| F8 | Sycophancy causes "disagreement collapse" in multi-agent debate | Leng et al. (Sep 2025) [R13] | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~160 pts below | Independent eval before seeing others' responses |
| F9 | Different models > different temperatures for diversity | Practical multi-model studies [R14] | Various | N/A | Using model diversity (already in design) is correct |
| F10 | Centralized judge resilient to sycophancy | LLM-as-Judge survey (Nov 2024) [R8] | GPT-4 (~1324) as judge | ~180 pts below | Our cross-lens synthesis acts as centralized judge — stronger judge = better |

### Anti-Convergence Techniques (prioritized for V2)

1. **Anonymize model outputs** in per-lens synthesis — don't label by model name, prevents favoritism. Pattern from llm-council (Karpathy, 2024) [T2].
2. **Devil's advocate variant** — one model per lens gets an adversarial system prompt. Inspired by DREAM's adversarial stance initialization [R1].
3. **Adversarial stance initialization** (DREAM, Feb 2025 [R1]) — tested on GPT-4o (~1346). Two agents given opposing stances. Achieved 95.2% labeling accuracy with 3.5% human involvement on relevance assessment. With frontier models, accuracy likely higher.
4. **Anti-conformity rule** (FREE-MAD, Sep 2025 [R6]) — tested on GPT-4o (~1346), Claude 3.5 Sonnet (~1342). Agents only change beliefs when they see clear evidence, not from social pressure. Single debate round sufficient — cut token costs dramatically.

### Critic & Judge Panel Composition

**MVP panel**: One flagship from each major provider — maximizes genuine diversity of training biases:

| Role | Model | ELO | Why |
|------|-------|-----|-----|
| Critic 1 | Opus 4.6 | 1503 | Strongest overall, deep reasoning |
| Critic 2 | Gemini 3.1 Pro | 1500 | Different training data, strong on GPQA (94.3%) |
| Critic 3 | Grok 4.1 | 1462 | Different provider, independent perspective |
| Critic 4 | GPT-4.5 | 1444 | OpenAI perspective, strong general capability |
| **Judge** (synthesis) | Opus 4.6 | 1503 | Calibrated judge per lib/gym findings |

**Rationale**: Cross-provider diversity matters more than same-provider model count. Four models from different providers will disagree more meaningfully than four Anthropic models. This maps to our `maxthink` preset.

**Lightweight panel** (fast preset): Haiku + Gemini Flash + Grok Fast + GPT-mini. Same provider diversity, lower cost. For quick iteration.

**Judge selection**: Per `lib/gym/calibration/` findings (Feb 2026), Opus discriminates well (scores range 20-72) while some models cluster (Gemini 3.1 Pro gave 100 to 12/15 items). Use Opus for synthesis/judging.

**Open question**: Should critic and judge be from different providers? (Avoids self-bias in judging own provider's critique.) Worth testing in the Review Gym.

### Referenced Work (full citations)

| ID | Paper / Tool | Date | Systems Tested | ELO Range | URL | Key Finding |
|----|-------------|------|---------------|-----------|-----|-------------|
| R1 | DREAM: Debate-based Relevance Assessment (Meng et al.) | Feb 2025 | GPT-4o | ~1346 | [arxiv.org/html/2602.06526](https://arxiv.org/html/2602.06526) | Adversarial stance init → 95.2% accuracy, 3.5% human involvement |
| R2 | Self-Refine (Madaan et al., NeurIPS) | Mar 2023 | GPT-3.5 (~1100), GPT-4 (~1250 est.) | 1100-1250 | [arxiv.org/abs/2303.17651](https://arxiv.org/abs/2303.17651) | Single-model critique→refine loop, ~20% abs. improvement. Tested on models ~250-400 pts below frontier |
| R3 | CRITIC (Gou et al., ICLR) | May 2023 | GPT-3.5, GPT-4 (early) | 1100-1250 | [arxiv.org/abs/2305.11738](https://arxiv.org/abs/2305.11738) | Tool-interactive critique beats introspection-only. Very early models |
| R4 | DeepCritic | May 2025 | 7B fine-tuned vs GPT-4o (~1346) | 7B << 1346 | [arxiv.org/abs/2505.00662](https://arxiv.org/abs/2505.00662) | Dedicated 7B critic outperforms GPT-4o on error identification. Implies: fine-tuned small critic > prompted large general model |
| R5 | CriticGPT (McAleese et al., OpenAI) | Jun 2024 | GPT-4 fine-tuned (~1324 base) | ~1324 | [arxiv.org/abs/2407.00215](https://arxiv.org/abs/2407.00215) | Model critiques preferred over human >80%. Human+AI > either alone. 38% hallucinated findings |
| R6 | FREE-MAD: Consensus-Free Multi-Agent Debate | Sep 2025 | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~1342-1346 | [arxiv.org/abs/2509.11035](https://arxiv.org/abs/2509.11035) | Anti-conformity mechanism. Single round sufficient. Cuts token costs |
| R7 | DMAD: Breaking Mental Set | 2025 | Not specified (OpenReview) | Unknown | [openreview.net/forum?id=t6QHYUOQL7](https://openreview.net/forum?id=t6QHYUOQL7) | Distinct reasoning approaches per agent → highest idea diversity, outperforming humans |
| R8 | Survey on LLM-as-a-Judge (Gu et al.) | Nov 2024 | GPT-4 (~1324), Claude 3 Opus (~1300 est.) | ~1300-1324 | [arxiv.org/abs/2411.15594](https://arxiv.org/abs/2411.15594) | Multi-agent evaluators achieve higher reliability. 80% agreement with human preferences |
| R9 | Constitutional AI (Bai et al., Anthropic) | Dec 2022 | Claude early models | ~1100 est. | [arxiv.org/abs/2212.08073](https://arxiv.org/abs/2212.08073) | Explicit enumerated criteria → specific per-principle assessment. Pattern is model-independent |
| R10 | Iterative refinement quality degradation (Koivisto et al.) | 2025 | GPT-4 (~1324) | ~1324 | [nature.com/articles/s41598-025-31075-1](https://www.nature.com/articles/s41598-025-31075-1) | Quality peaks then degrades after ~6 iterations. Sweet spot around iteration 2-3 |
| R11 | Structured output creativity penalty (Tam et al.) | 2025 | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~1342-1346 | [openreview.net/pdf?id=vYkz5tzzjV](https://openreview.net/pdf?id=vYkz5tzzjV) | JSON: -18.3% insight, -17.1% novelty. Structured Markdown closest to baseline |
| R12 | Debugging Decay Index | Jun 2025 | GPT-4o (~1346), GPT-3.5 (~1100) | 1100-1346 | [arxiv.org/html/2506.18403v2](https://arxiv.org/html/2506.18403v2) | Effectiveness drops exponentially after 3rd debugging attempt |
| R13 | Sycophancy in Multi-Agent Debate (Leng et al.) | Sep 2025 | GPT-4o (~1346), Claude 3.5 Sonnet (~1342) | ~1342-1346 | [arxiv.org/html/2509.23055v1](https://arxiv.org/html/2509.23055v1) | All-sycophantic panels worst. Mix "peacemaker" + "troublemaker" roles |
| R14 | Reflexion (Shinn et al., NeurIPS) | Mar 2023 | GPT-3.5, GPT-4 (early) | 1100-1250 | [arxiv.org/abs/2303.11366](https://arxiv.org/abs/2303.11366) | Verbal reinforcement learning. Episodic memory for iterative improvement |
| R15 | RefCritic | Jul 2025 | Not specified | Unknown | [arxiv.org/html/2507.15024v1](https://arxiv.org/html/2507.15024v1) | Critique quality improves when trained on refinement feedback |
| R16 | Agent-as-a-Judge | Aug 2025 | Not specified | Unknown | [arxiv.org/html/2508.02994v1](https://arxiv.org/html/2508.02994v1) | Judge can take actions (run code, search, verify) rather than just read and score |
| R17 | CISC: Confidence Improves Self-Consistency | 2025 | Not specified | Unknown | [aclanthology.org/2025.findings-acl.1030.pdf](https://aclanthology.org/2025.findings-acl.1030.pdf) | Calibrated confidence scores + weighted voting outperforms basic self-consistency |

### Tools & Implementations

| ID | Tool | URL | Pattern |
|----|------|-----|---------|
| T1 | llm-consortium | [github.com/irthomasthomas/llm-consortium](https://github.com/irthomasthomas/llm-consortium) | Parallel multi-model + arbiter + confidence threshold + iteration |
| T2 | llm-council (Karpathy) | [github.com/karpathy/llm-council](https://github.com/karpathy/llm-council) | Anonymized peer ranking. FastAPI + React |
| T3 | LangGraph reflection | [github.com/langchain-ai/langgraph-reflection](https://github.com/langchain-ai/langgraph-reflection) | Generate→critique→refine state machine |
| T4 | DeepEval | [deepeval.com](https://deepeval.com) | 50+ research-backed metrics, pytest integration |
| T5 | GitHub Copilot Code Review | [github.blog (Apr 2025)](https://github.blog/changelog/2025-04-04-copilot-code-review-now-generally-available/) | Hybrid LLM + static analysis, customizable lenses, multi-model backend |

---

## 2. CSS & Report Generation

### Shared CSS Strategy

Current state: `strategies/report.py` has `_CSS` as a ~350-line Python string constant. This works but means two report generators duplicate CSS.

Plan:
- Extract shared CSS to `vario/static/report_base.css` as source of truth
- Both `strategies/report.py` and `review_report.py` read it at import time
- Self-contained HTML still works (CSS inlined into `<style>`)
- CSS is never LLM-generated — it's a static asset copied as a chunk
- Review-specific CSS additions are a small string constant appended

```python
_CSS_FILE = Path(__file__).parent / "static" / "report_base.css"
_BASE_CSS = _CSS_FILE.read_text() if _CSS_FILE.exists() else ""
_REVIEW_CSS = """/* review-specific additions */..."""
_CSS = _BASE_CSS + _REVIEW_CSS
```

### Foldable Report Sections

For browsing optimization, the report should be designed for **progressive disclosure**:
- Executive summary + strength badge always visible
- Per-lens findings in `<details>` (collapsible) — already planned
- Individual model outputs nested inside per-lens (double-collapsible)
- Redline/diff views foldable — scan summary, unfold for details
- Consider: "reader mode" toggle that hides metadata/costs and shows only findings + suggestions

---

## 3. Streamlining & Redline

### Brevity Levels & Audience Adaptation (V2+)

The review pipeline always runs identically — the output layer adapts.

**Brevity levels** (how much detail):

| Level | What you get |
|-------|-------------|
| `brief` | Strength score + top 3 suggestions + one paragraph |
| `standard` | Executive summary + prioritized suggestions + per-lens highlights |
| `full` | Everything — per-lens, per-model, raw outputs, cost breakdown |

**Audience adaptation** (assumed knowledge, jargon):

| Audience | Framing |
|----------|---------|
| `expert` | Field terminology, skip background, assume technique knowledge |
| `general` | Define terms, explain techniques, accessible to non-specialists |
| `executive` | Focus on impact and decisions, minimize methodology |

Both are presentation-layer concerns applied during the synthesis/report phase. The cross-lens synthesis prompt gets audience context: "Write for a {audience} reader at {brevity} depth."

**Further work**: What to present inline vs appendix vs linked detail. Progressive disclosure in HTML (collapsible sections already planned). Could also generate separate report variants (summary.html + full.html) from the same underlying data.

### Redline / Tracked Changes (V2)

For fine-grained language work, show what changed:

- **Coarse view**: section-level summary ("removed 2 paragraphs from §3, tightened §5")
- **Fine view**: word-level redline (`<del>In order to</del><ins>To</ins>`)
- Implementation: Python `difflib.HtmlDiff` or custom word-level diff
- CSS: deletions in red with strikethrough, additions in green with underline
- Report shows both views: summary for scanning, redline for close review

```html
<span class="del">In order to achieve</span><span class="add">To achieve</span>
<span class="del">a significant improvement in</span><span class="add">better</span>
```

---

## 4. Specialized Structure Sub-Lenses

The "structure" lens can be decomposed into specialized skills for specific document elements:

| Sub-Lens | What it checks |
|----------|---------------|
| **TOC / Outline** | Is the table of contents logical? Do section titles accurately describe content? Is the hierarchy balanced (no 8 subsections under one header, 1 under another)? |
| **Overview / Introduction** | Does it establish context? Does the reader know what they'll learn and why it matters? Is the scope stated? |
| **Examples** | Are examples well-chosen? Do they illustrate the point or distract? Are they proportionate (too many, too few)? Do they progress from simple to complex? |
| **Abstract / Claims** | Does the abstract accurately represent the content? Are claims in the abstract supported in the body? Is the abstract self-contained? |
| **Conclusion** | Does it summarize without repeating? Does it answer the "so what?" Does it connect back to the opening? |
| **References / Evidence** | Are claims backed by citations? Are sources credible? Are there circular references? |

For MVP: these are all part of the single "structure" lens prompt. For V2: can be broken out into individual sub-lenses if the structure review needs more depth per element.

---

## 5. Prompt Evolution & Experimentation

Moved to its own doc — applies to all prompt-driven systems, not just review:

**→ [`learning/docs/prompt-evolution.md`](https://static.localhost/learning/docs/prompt-evolution.md)**

Covers: reference corpus calibration, ablation experiments, degradation→restoration tests, A/B testing, user feedback loops, prompt health reports, versioning, and integration with gyms infrastructure.

---

## 6. Advanced Lens: Attention & Emotion Management

For reports, presentations, and documents meant to persuade or educate:

### What it evaluates

- **Attention arc**: Does the document manage the reader's attention? Where are the high-engagement moments vs. the valleys?
- **Payoff points**: Where does the reader get rewarded for their attention investment? Are payoffs front-loaded enough to sustain interest?
- **Cognitive load**: Are dense sections broken up? Is information introduced at a digestible pace?
- **Emotional journey**: For persuasive docs — does it build urgency, provide relief, end with motivation?
- **Scanability**: Can a reader who skims get 80% of the value? (Executive readers do this.)
- **Foldable sections**: Which parts should be expandable/collapsible for different reader depths?

### Output

Beyond findings, this lens produces a **reading experience map**:

```
Section 1 (Introduction):     ██████████ HIGH — strong hook
Section 2 (Background):       ████░░░░░░ MEDIUM — necessary but could be shorter
Section 3 (Method details):   ██░░░░░░░░ LOW — dense, no payoff yet
Section 4 (Results):          █████████░ HIGH — the payoff
Section 5 (Discussion):       ███████░░░ MEDIUM — connects dots
Section 6 (Appendix):         █░░░░░░░░░ LOW — reference only
```

Suggests: make §3 foldable/skippable, move key result preview to §2, add "why this matters" teaser before §3.

### When to use

Not for all documents — this lens is for:
- External reports meant to be read by non-captive audiences
- Presentations and demos
- Marketing/sales documents
- Grant proposals, investment memos

Skip for: internal technical docs, API docs, reference material.

---

## 7. Automatic Spec Iteration

### How to evolve the review system spec itself

The design doc (`draft-review-design.md`) will need revision as we learn from real use. Process:

1. **After each real review**: Note what worked, what was missing, what was noisy
2. **Capture in logbook**: `vario/LOGBOOK.md` entry per review session
3. **Periodic spec review**: Run the review system on its own spec (meta-review)
4. **Prompt tuning cycle**:
   - Collect 5-10 review sessions of feedback
   - Identify patterns (e.g., "structure lens never finds anything useful" or "logic lens over-flags")
   - Adjust prompts, re-run on test corpus, compare
   - Version prompts (e.g., `logic_critic_v2`)

### Automation opportunities

- After N reviews, auto-generate a "prompt health report" showing per-lens hit rate, user acceptance rate, and severity distribution
- Flag lenses whose findings are consistently rejected (false positive rate too high)
- Flag lenses that never trigger (either the lens is too lenient or it's reviewing the wrong things)

---

## 8. Auto-Calibration via Review Gym (V2+)

The existing gyms infrastructure (`lib/gym/` + `learning/gyms/`) provides the engine. A **Review Gym** follows the same Gen→Eval→Learn pattern as existing gyms (badge, claim_extraction, llm_task, apply).

### Search Space (knobs to turn)

| Dim | What | Space |
|-----|------|-------|
| **S1** | Prompt wording per lens | Text variants (logic_critic_v1 vs v2 vs v3) |
| **S2** | Model per lens | Maybe logic needs opus, language needs sonnet |
| **S3** | Staging order | Structure-first vs all-parallel vs custom |
| **S4** | Models per lens | 2 vs 4 vs all (cost/quality tradeoff) |
| **S5** | Reduce strategy per lens | synthesize vs debate vs rubric_first |
| **S6** | Temperature per lens | 0.0 vs 0.3 vs 0.7 |
| **S7** | Top-N cap | 5 vs 10 vs 20 suggestions |
| **S8** | Severity thresholds | What counts as critical vs minor |

### Optimization Signal (what to measure)

| Signal | Source | Weight |
|--------|--------|--------|
| **O1** | User acceptance rate (applied the suggestion?) | High |
| **O2** | False positive rate (rejected findings) | High (negative) |
| **O3** | Finding actionability (specific fix vs vague) | Medium |
| **O4** | Coverage vs ground truth | High |
| **O5** | Redundancy across lenses | Medium (negative) |
| **O6** | Cost per useful finding | Medium |
| **O7** | Severity calibration accuracy | Medium |

### Gym Implementation

```
learning/gyms/
  review/                      ← NEW
    gym.py                     # ReviewGym(GymBase)
    corpus/                    # Test documents with known issues
    tasks/
      prompt_sweep.yaml        # A/B test prompt variants
      model_sweep.yaml         # Which models for which lens
      staging_sweep.yaml       # Staging order variants
      ablation.yaml            # Remove one lens at a time
      degradation.yaml         # Synthetic doc degradation tests
    results/
    reports/
```

Inherits from `lib/gym/base.py` (Candidate, CorpusStore) and uses `lib/gym/judge.py` (calibrated, cached, monotonicity-tested). No new framework needed.

### Gym Pipeline

```
1. CORPUS: Collect reviewed documents + user decisions (accepted/rejected per finding)
2. GENERATE: Run review with variant configs (prompt × model × staging)
3. EVALUATE:
   - vs ground truth (known issues in test docs)
   - vs user acceptance history
   - judge scores (actionability, specificity, evidence quality)
4. LEARN: Update default configs, promote winning prompt variants
```

### Degradation Tests (sensitivity calibration)

Synthetically degrade documents along each dimension, verify the corresponding lens detects it:

```python
async def degrade_document(doc: str, dimension: str, severity: float) -> str:
    """LLM degrades doc along one dimension. severity: 0.0-1.0."""
    ...
```

| Degradation | Tests lens | Expected |
|-------------|-----------|----------|
| Inject passive voice, filler, broken parallelism | Language | Language lens flags them |
| Weaken evidence, add contradictions | Logic+Claims | Logic lens catches them |
| Shuffle sections, remove transitions | Structure | Structure lens flags it |
| Add jargon, remove definitions | Readability | Readability lens flags it |

Measures **sensitivity** (does the lens detect known problems?) and **calibration** (does severity rating match degradation degree?).

### Should there be an "optimization expert" module?

No new module — the Review Gym IS the optimization expert. It sits in the existing `learning/gyms/` infrastructure alongside badge, apply, claim_extraction, etc. The gyms framework already handles:
- Corpus management (JSONL append-only)
- Variant generation (model × prompt combinations)
- Calibrated judging (with cache and monotonicity testing)
- HTML reporting
- Results persistence

The missing feedback loop (noted in MEMORY.md) between doctor→learning→gyms gets partially closed by the Review Gym recording user acceptance/rejection signals back to learning.db.

---

## 9. Annotated Document View (Critical Reader Agent)

### Margin Notes

Two-column layout: document body (~70%) on left, critique annotations (~30%) on right. Lines or color bands connect annotations to source text. Like Google Docs comments or LaTeX `\marginpar`.

```html
<div class="annotated-doc">
  <div class="doc-body">
    <p>Our system achieves state-of-the-art results on all benchmarks...</p>
  </div>
  <aside class="margin-notes">
    <div class="note severity-moderate" data-anchor="para-3">
      ⚡ Overclaiming — no benchmark comparison provided.
      Consider: "competitive results"
    </div>
  </aside>
</div>
```

Best for: the review report, where you want to see the document alongside its critiques without modifying the text.

### Inline Annotations (blue highlights)

Critiqued text highlighted in blue with tooltip/popover showing the finding:

```html
<p>Our system achieves <mark class="critique moderate"
  title="Overclaiming — no benchmark comparison">state-of-the-art results</mark>
  on all benchmarks.</p>
```

Best for: the redline/streamline view, where you want to see exactly which words are problematic.

### Rendering Strategy

| View | Style | When |
|------|-------|------|
| **Review report** | Margin notes | Default view — scannable, non-destructive |
| **Redline view** | Inline highlights + strikethrough/additions | When streamlining or applying suggestions |
| **Print view** | Inline only (margins don't print well) | Export/sharing |

All CSS-only for static rendering. Color coding by severity:
- 🔴 Critical: red margin band
- 🟡 Moderate: yellow/amber
- 🔵 Minor: blue
- 🟢 Praise: green (what's working well)

---

## 10. Design Decisions Log

Decisions made during design that could be revisited:

| Decision | Rationale | Revisit if... |
|----------|-----------|---------------|
| Lenses in Python, not YAML | Small set (4), reference existing prompts | Set grows beyond 8 |
| All lenses parallel | Independent concerns | A lens needs output from another lens |
| Per-lens reduce uses debate strategy | Best for cross-model synthesis | Too expensive for fast preset |
| `maxthink` as default preset | Quality matters for review | Users want faster/cheaper default |
| Work directory under `vario/reviews/` | Keep with Vario code | Gets large → move to `~/.vario/reviews/` |
| Single cross-lens synthesis call | Simple, one LLM call | Need more nuanced weighting across lenses |
| CSS inlined for self-contained HTML | Portability, shareable | HTML files get too large (>1MB) |

---

## References

### Key Papers
- Self-Refine (Madaan et al., NeurIPS 2023) — arxiv.org/abs/2303.17651
- CRITIC (Gou et al., ICLR 2024) — arxiv.org/abs/2305.11738
- Reflexion (Shinn et al., NeurIPS 2023) — arxiv.org/abs/2303.11366
- DeepCritic (May 2025) — arxiv.org/abs/2505.00662
- CriticGPT (OpenAI, June 2024) — arxiv.org/abs/2407.00215
- RefCritic (2025) — arxiv.org/html/2507.15024v1
- DREAM (2025) — arxiv.org/html/2602.06526
- FREE-MAD (Sept 2025) — arxiv.org/abs/2509.11035
- DMAD: Breaking Mental Set (2025) — openreview.net/forum?id=t6QHYUOQL7
- Constitutional AI (Anthropic, 2022) — arxiv.org/abs/2212.08073
- LLM-as-Judge Survey (Nov 2024) — arxiv.org/abs/2411.15594
- Agent-as-a-Judge (2025) — arxiv.org/html/2508.02994v1

### Tools & Implementations
- llm-council (Karpathy) — github.com/karpathy/llm-council
- llm-consortium — github.com/irthomasthomas/llm-consortium
- LangGraph reflection — github.com/langchain-ai/langgraph-reflection
- DeepEval — deepeval.com
- GitHub Copilot Code Review (GA April 2025)