# Design: Unified Judge Framework, Calibration, and Stage Gyms

**Date**: 2026-02-25
**Status**: Approved
**Scope**: `lib/gym/judge.py`, `lib/gym/calibration/`, stage gyms for SemanticNet pipeline

## Problem

10 judge implementations scattered across the codebase, all doing the same thing (LLM scores output quality) with zero calibration:

- **7 use 0-100 + subscores**: sandbox replay, vario eval, badge gym, llm_task gym, kb scenario, vario strategies, learning eval
- **2 use binary verdicts**: pair judge, related-work judge
- **1 statistical (no LLM)**: finance backtest eval

Each reimplements JSON parsing, error handling, caching, concurrency. None have been tested for score distribution, monotonicity, or discrimination power.

4 SemanticNet domains (a16z, Healthy Gamer, VIC, session learning) all extract structured knowledge from unstructured content. None have quality evaluation at any pipeline stage (extract → abstract → cleanup → apply).

## Principles

1. **Calibrate before measuring** — a judge that clusters at 70-80 is useless. Fix the thermometer before taking temperatures.
2. **Diversity over repeats** — 3 different models × 1 run >> 1 model × 3 runs. Repeats are correlated (same biases); cross-model disagreement is where the signal is.
3. **Monotonicity is the fundamental test** — degrade input in known ways, verify score drops. No human labels needed.
4. **Slim by default, escalate on data** — start with the cheapest discriminating judge. Only add models where measured self-agreement is low or monotonicity fails.

## Calibration Findings (2026-02-25)

First calibration run: 15 sandbox samples × 2 models × 5 configs (repeats/temp variations).

### Within-model repeats are useless

| Model | σ across 5 repeats (temp=1.0) | Score range | Discrimination |
|-------|------|------|------|
| **Opus 4.6** | 0.39 (effectively 0) | 20-72 | Good — uses 3 quintiles |
| **Gemini 3.1 Pro** | 1.51 (trivial) | 50-100 | Broken — 12/15 samples score 100 |

Repeats of the same model add zero information. Opus gives identical answers at temp=0.0 and temp=1.0. Gemini varies by ~1.5 points.

### Cross-model disagreement is massive

| Metric | Value |
|--------|-------|
| Mean score diff (same sample) | **46.7 points** |
| Samples within 10 points | **0 / 15** |
| Kendall τ (rank agreement) | **-0.182** (anti-correlated!) |

Opus scores harsh but discriminates (20-72 spread). Gemini gives 100 to nearly everything. They don't even agree on *ranking*.

### Implications for judge design

1. **Never use repeats as the diversity mechanism** — they're correlated. Use different models.
2. **Screen models for discrimination power** before using them as judges — plot score distribution, reject models that cluster (>60% in a 20-point band).
3. **Monotonicity testing is the first gate** — can the judge tell worse from better? If not, no amount of repeats fixes it.
4. **Start with one discriminating model** (Opus scores 20-72, uses the scale). Add a second model only if monotonicity tests reveal blind spots Opus misses.
5. **Cross-model disagreement is signal** — when two discriminating models disagree on a sample, that sample is genuinely ambiguous or the rubric is underspecified. Don't paper over it with median.

## Literature Grounding

Key findings from LLM-as-judge research (2024-2026):

| Finding | Source | Implication |
|---------|--------|-------------|
| Scores cluster 60-80 on 0-100 | Multiple papers | Must test distribution before trusting any judge |
| Binary scales: omega >0.989 reliability | "Can You Trust" (Dec 2024) | Prefer binary/3-point over fine-grained when possible |
| Single-shot reliability: 0.167-1.000 | "Can You Trust" (Dec 2024) | Average multiple runs, never trust one score |
| Position bias >10% accuracy shift | "Judging the Judges" (ACLNLP 2025) | Pairwise comparisons need order-swapping |
| Claude most bias-resistant | CALM framework (2024) | Use best model for judging |
| Isotonic regression enforces monotonicity | CJE (Dec 2025) | Mechanical test: worse → lower score |
| CALM's 12-perturbation suite | CALM (2024) | Synthetic degradation is gold standard for calibration |

## Architecture

### File Structure

```
lib/gym/
├── base.py                # existing: GymBase, Candidate, CorpusStore
├── judge.py               # NEW: unified judge classes
├── record.py              # NEW: results → learning.db (future)
├── calibration/           # NEW: judge calibration (first-class concern)
│   ├── README.md          # Methodology, how to run, settled findings
│   ├── LOGBOOK.md         # Calibration journey, runs, threshold decisions
│   ├── perturbations.py   # Synthetic degradation generators
│   ├── monotonicity.py    # Topo-order tests
│   ├── distribution.py    # Score spread analysis
│   └── report.py          # HTML calibration report
```

### `lib/gym/judge.py` — Unified Judge

```python
@dataclass
class JudgeResult:
    score: float              # 0-100 or 0/1
    reason: str
    subscores: dict[str, float] = field(default_factory=dict)
    model: str = ""
    duration_ms: int = 0

@dataclass
class RubricCriterion:
    name: str
    description: str
    weight: float = 1.0

class RubricJudge:
    """0-100 + subscores. The common pattern across 7 existing judges."""

    def __init__(
        self,
        criteria: list[RubricCriterion],
        model: str = "opus",       # calibration: Opus discriminates (20-72); cheap models don't
        system: str = "",
        cache_db: Path | None = None,
    ): ...

    async def score(self, candidate: str, context: dict) -> JudgeResult:
        """Single call. No repeats — calibration shows σ=0 for same model."""
        ...

    async def score_batch(self, items: list[tuple[str, dict]],
                          concurrency: int = 5) -> list[JudgeResult]:
        """Score multiple items with semaphore-bounded concurrency."""
        ...

class BinaryJudge:
    """Verdict-based (repair/not_repair, relevant/irrelevant)."""

    def __init__(
        self,
        verdicts: list[str],
        model: str = "opus",
    ): ...

    async def score(self, candidate: str, context: dict) -> JudgeResult:
        """Single call. Add second model only if cross-model disagreement is high."""
        ...
```

**Shared infrastructure**:
- JSON parsing: strip markdown fencing, handle arrays vs objects, retry on parse failure
- Caching: SQLite hash-based (sha256 of prompt+candidate+model → cached result)
- Cost logging: via `lib/llm/cost_log.py`
- Concurrency: configurable `asyncio.Semaphore`

**Start with one discriminating model** (calibration shows Opus uses 20-72 range, Gemini Pro gives 100 to everything). If monotonicity tests reveal blind spots the primary judge misses, add a second model — but never same-model repeats (σ=0, zero information gain). Track cross-model disagreement as a signal that rubrics need tightening, not that you need more judges.

### `lib/gym/calibration/` — Judge Calibration

#### Perturbation Types

| Perturbation | What it tests | Method |
|---|---|---|
| **remove_evidence** | Completeness sensitivity | Strip supporting quotes/data from claims |
| **add_fluff** | Precision / noise detection | Insert plausible but irrelevant sentences |
| **vague_ify** | Specificity sensitivity | Replace concrete numbers/names with "some" / "various" |
| **inject_errors** | Accuracy sensitivity | Replace correct facts with plausible wrong ones |
| **scramble_order** | Position bias | Randomize order of claims/sections |
| **duplicate_content** | Dedup sensitivity | Repeat claims verbatim |
| **strip_actionability** | Actionability sensitivity | Keep description, remove "do this" / "use X" |

#### Calibration Pipeline

```
Input: judge (model + rubric) + seed corpus (existing sandbox/gym outputs)
    ↓
1. Score originals (N=5 runs each for self-agreement baseline)
    ↓
2. Generate perturbations (each type applied independently)
    ↓
3. Score perturbations (same judge, same N runs)
    ↓
4. Monotonicity check:
   - For each perturbation type: original_score > perturbed_score?
   - Statistical test: paired t-test, p<0.05 across items
   - Effect size: Cohen's d (want >0.5 = medium effect)
    ↓
5. Distribution check:
   - Histogram of all scores — spread across 0-100?
   - If >60% of scores in a 20-point band → rubric needs work
    ↓
6. Self-agreement:
   - Across 5 runs of same item: std dev < 10? Agreement rate >90%?
   - If not → consider different model or binary scale
    ↓
7. Output:
   - Calibration report (HTML): histograms, monotonicity pass/fail, agreement stats
   - LOGBOOK.md entry: what we tried, what worked, threshold decisions
   - Recommended judge config per stage
```

#### Success Criteria

| Metric | Threshold | Action if fail |
|--------|-----------|----------------|
| Monotonicity (all perturbation types) | p<0.05, Cohen's d>0.5 | Iterate rubric prompt |
| Score distribution spread | Scores in ≥3 of 5 quintiles | Add anchor examples to prompt |
| Self-agreement (5 runs) | >90% within ±10 points | Switch to binary scale or better model |
| Disagreement rate (if testing second judge) | <85% agreement with first | Keep both; >85% → drop the second |

## Stage Gyms (After Calibration)

Each gym uses the calibrated `RubricJudge` or `BinaryJudge`. ~50-100 lines each.

### Priority Order

#### 1. ApplyGym — `/recall` retrieval quality (first)

- **Input**: coding context (file path, user prompt, current task)
- **Generate**: retrieve top-K learnings via different strategies (FTS, semantic, hybrid, re-ranked)
- **Evaluate**: RubricJudge with criteria [relevance, actionability, noise_level]
- **Corpus**: retroactive study episodes (72 labeled), principle_applications table
- **Connects to**: `learn find -s`, `/recall` skill, decision-point hooks
- **Location**: `learning/gyms/apply/`

#### 2. CleanupGym — admission gate accuracy

- **Input**: pairs of candidate learnings + existing principles
- **Generate**: admission gate decisions (link to existing vs create new)
- **Evaluate**: BinaryJudge (correct_link / incorrect_link)
- **Corpus**: existing admission results from 2026-02-24 retroactive cleanup (37→3)
- **Location**: `learning/gyms/cleanup/`

#### 3. AbstractGym — generalization quality

- **Input**: specific observations/claims from any domain
- **Generate**: generalizations at various abstraction levels
- **Evaluate**: RubricJudge [abstraction_level, preserves_specificity, actionable]
- **Corpus**: existing principle↔instance pairs from learning.db
- **Location**: `learning/gyms/abstract/`

#### 4. ExtractGym — claim completeness across domains

- **Input**: documents from any DomainAdapter (VIC writeups, HG transcripts, sessions)
- **Generate**: extracted claims/learnings with different models/prompts
- **Evaluate**: RubricJudge [completeness, specificity, accuracy, false_positives]
- **Corpus**: holdout documents with human-labeled claims (build during calibration)
- **Location**: `learning/gyms/extract/`

## Documentation

| Document | What | When updated |
|----------|------|-------------|
| `lib/gym/calibration/LOGBOOK.md` | Calibration journey: runs, findings, prompt iterations | Every calibration run |
| `lib/gym/calibration/README.md` | Settled methodology, how to run, current judge configs | After calibration converges |
| `lib/gym/README.md` | Judge framework API, how to add stage gyms | After judge.py ships |
| `learning/CLAUDE.md` | Point to gym framework for evals, add ApplyGym usage | After ApplyGym works |

## Execution: Iterative Approach

Don't block on perfect calibration before measuring anything. Generate directional results, spot-check, tighten, repeat.

**Cycle 1** (current):
1. Build `lib/gym/judge.py` — done
2. Build perturbation generators — done
3. Build monotonicity runner (calibration infra)
4. Build ApplyGym with uncalibrated Opus judge — directional results
5. Spot-check HTML: surface 20 most interesting cases for human review
6. Run ApplyGym, review spot-check output

**Cycle 2** (based on Cycle 1 findings):
7. Run monotonicity calibration — iterate rubric based on what spot-check revealed
8. Re-run ApplyGym with calibrated judge — compare to Cycle 1
9. Distribution analysis — verify score spread across 0-100

**Later**:
- Migrate existing gyms (badge, llm_task, sandbox) to shared judge
- Each migration: run both old and new judge on same corpus, confirm correlation >0.9

## Cost Budget

| Activity | Estimated cost |
|----------|---------------|
| Calibration (5 runs × ~50 items × 7 perturbations × 1 model) | ~$2-5 |
| ApplyGym (72 retroactive episodes × 5 retrieval strategies × 3 judge runs) | ~$3-5 |
| ExtractGym holdout (25 documents × 3 models × 3 judge runs) | ~$2-3 |
| Total initial | ~$7-13 |

## Autonomy: Overnight Architecture Review

An autonomous job that reviews past and upcoming architectural decisions:

- **Past decisions**: scan recent commits/sessions for library choices, pattern commitments, stack decisions
- **Alternative assessment**: for each decision, search for current alternatives, compare trade-offs
- **Upcoming decisions**: for roadmap items, research options and present comparison before the human decides
- **Artifact**: `learning/data/architecture_review_YYYY-MM-DD.md`
- **Feedback**: architectural decisions become instances in learning.db

Day sessions make decisions fast under time pressure. Overnight sessions audit whether those decisions hold up under scrutiny with proper comparison of alternatives.

## Non-Goals

- **Same-model repeats**: calibration proves σ=0. Never use repeats as diversity mechanism.
- **Calibration neural network** (LLM-Rubric approach): overkill for now; revisit if prompt-based calibration plateaus
- **Full sandbox campaign runs** (full-report-v1, ablation-pairs): deferred until judges are calibrated
- **Memory system** (pgvector): separate concern, not in scope