# Judge Calibration & Stage Gyms — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Unified judge framework with iterative calibration, then stage gyms for SemanticNet pipeline evaluation.

**Architecture:** Extract common judge pattern into `lib/gym/judge.py`. Get directional results FAST with ApplyGym using uncalibrated judge. Surface spot-check cases for human review. Then tighten calibration with monotonicity tests. Iterate: measure → spot-check → tighten → repeat.

**Tech Stack:** Python, asyncio, lib/llm (call_llm), SQLite (caching), loguru, pytest

**Design doc:** `docs/plans/2026-02-25-judge-calibration-stage-gyms.md`

**Approach:** Iterative, not blocking. Don't get stuck on perfect calibration before measuring anything. Generate directional results with ok judging, surface interesting cases (disagreements, low confidence) for human spot-checking, then improve judges based on what we see.

**Task order (iterative):**
1. `lib/gym/judge.py` — core judge classes (pure code, no API)
2. Perturbation generators (pure code)
3. ApplyGym — run with uncalibrated judge for directional results
4. Spot-check HTML — surface 20 most interesting cases for human review
5. Monotonicity runner — tighten calibration based on spot-check findings
6. Distribution analysis + report
7. Run calibration, iterate rubrics
8. README documentation

---

### Task 1: `lib/gym/judge.py` — Core Judge Classes

**Files:**
- Create: `lib/gym/judge.py`
- Test: `lib/gym/tests/test_judge.py`

**Step 1: Write failing tests**

```python
# lib/gym/tests/test_judge.py
import pytest
from lib.gym.judge import JudgeResult, RubricCriterion, RubricJudge, BinaryJudge, parse_judge_json

def test_judge_result_dataclass():
    r = JudgeResult(score=75.0, reason="good", model="opus")
    assert r.score == 75.0
    assert r.subscores == {}

def test_rubric_criterion():
    c = RubricCriterion(name="relevance", description="Is it relevant?")
    assert c.weight == 1.0

def test_parse_judge_json_clean():
    assert parse_judge_json('{"score": 80, "reason": "ok"}') == {"score": 80, "reason": "ok"}

def test_parse_judge_json_markdown_fenced():
    assert parse_judge_json('```json\n{"score": 80, "reason": "ok"}\n```')["score"] == 80

def test_parse_judge_json_with_preamble():
    text = 'Here is my evaluation:\n```json\n{"score": 60, "reason": "partial"}\n```'
    assert parse_judge_json(text)["score"] == 60

def test_parse_judge_json_array_wraps():
    """Some models return [{"score": 80}] instead of {"score": 80}."""
    assert parse_judge_json('[{"score": 80, "reason": "ok"}]')["score"] == 80

def test_rubric_judge_builds_prompt():
    judge = RubricJudge(
        criteria=[
            RubricCriterion("relevance", "Is it relevant to the context?"),
            RubricCriterion("actionability", "Can you act on it?"),
        ],
        model="opus",
    )
    prompt = judge._build_prompt("candidate text", {"goal": "find bugs"})
    assert "relevance" in prompt
    assert "actionability" in prompt
    assert "candidate text" in prompt
    assert "find bugs" in prompt

def test_binary_judge_builds_prompt():
    judge = BinaryJudge(verdicts=["relevant", "irrelevant"], model="opus")
    prompt = judge._build_prompt("candidate", {"query": "gradio gotchas"})
    assert "relevant" in prompt
    assert "irrelevant" in prompt

@pytest.mark.asyncio
async def test_rubric_judge_score_mock(monkeypatch):
    """Mock call_llm to test scoring flow without API calls."""
    async def mock_llm(**kwargs):
        return '{"score": 75, "reason": "mostly good", "subscores": {"relevance": 80, "actionability": 70}}'

    import lib.gym.judge as jmod
    monkeypatch.setattr(jmod, "call_llm", mock_llm)

    judge = RubricJudge(
        criteria=[RubricCriterion("relevance", "relevant?"), RubricCriterion("actionability", "actionable?")],
        model="opus",
    )
    result = await judge.score("some candidate", {"goal": "test"})
    assert result.score == 75
    assert result.subscores["relevance"] == 80
    assert result.model == "opus"

@pytest.mark.asyncio
async def test_binary_judge_score_mock(monkeypatch):
    async def mock_llm(**kwargs):
        return '{"verdict": "relevant", "confidence": 0.9, "reason": "matches context"}'

    import lib.gym.judge as jmod
    monkeypatch.setattr(jmod, "call_llm", mock_llm)

    judge = BinaryJudge(verdicts=["relevant", "irrelevant"], model="opus")
    result = await judge.score("candidate", {"query": "test"})
    assert result.score == 1.0  # first verdict = 1.0
    assert "relevant" in result.reason
```

**Step 2: Run tests to verify they fail**

Run: `python -m pytest lib/gym/tests/test_judge.py -v`
Expected: FAIL — `ImportError: cannot import name 'JudgeResult' from 'lib.gym.judge'`

**Step 3: Implement `lib/gym/judge.py`**

```python
"""Unified LLM judge framework.

Two judge types:
- RubricJudge: 0-100 + subscores (the pattern 7/10 existing judges use)
- BinaryJudge: verdict-based (repair/not_repair, relevant/irrelevant)

Calibration findings (2026-02-25):
- Same-model repeats add zero information (Opus σ=0 at temp=1.0)
- Cross-model diversity is where signal lives (mean diff=46.7 pts)
- Start with one discriminating model, add second only on measured need
"""

from __future__ import annotations

import hashlib
import json
import re
import sqlite3
import time
from dataclasses import dataclass, field
from pathlib import Path

from loguru import logger

from lib.llm import call_llm
from lib.llm.cost_log import log_call


@dataclass
class JudgeResult:
    score: float              # 0-100 for rubric, 0.0/1.0 for binary
    reason: str
    subscores: dict[str, float] = field(default_factory=dict)
    model: str = ""
    duration_ms: int = 0
    cached: bool = False


@dataclass
class RubricCriterion:
    name: str
    description: str
    weight: float = 1.0


def parse_judge_json(text: str) -> dict:
    """Parse JSON from LLM response. Handles markdown fencing, preamble, arrays."""
    text = text.strip()
    # Try to extract from markdown fence
    match = re.search(r"```(?:json)?\s*\n?(.*?)\n?```", text, re.DOTALL)
    if match:
        text = match.group(1).strip()
    data = json.loads(text)
    # If array, take first element
    if isinstance(data, list) and data:
        data = data[0]
    return data


class _JudgeCache:
    """SQLite hash-based cache for judge results."""

    def __init__(self, db_path: Path):
        self.db_path = db_path
        db_path.parent.mkdir(parents=True, exist_ok=True)
        self._conn = sqlite3.connect(str(db_path))
        self._conn.execute("""CREATE TABLE IF NOT EXISTS judge_cache (
            cache_key TEXT PRIMARY KEY,
            result_json TEXT,
            created_at TEXT DEFAULT CURRENT_TIMESTAMP
        )""")
        self._conn.commit()

    def get(self, key: str) -> JudgeResult | None:
        row = self._conn.execute(
            "SELECT result_json FROM judge_cache WHERE cache_key = ?", (key,)
        ).fetchone()
        if not row:
            return None
        data = json.loads(row[0])
        return JudgeResult(cached=True, **data)

    def put(self, key: str, result: JudgeResult):
        data = {"score": result.score, "reason": result.reason,
                "subscores": result.subscores, "model": result.model,
                "duration_ms": result.duration_ms}
        self._conn.execute(
            "INSERT OR REPLACE INTO judge_cache (cache_key, result_json) VALUES (?, ?)",
            (key, json.dumps(data)),
        )
        self._conn.commit()

    @staticmethod
    def hash_key(model: str, prompt: str, candidate: str) -> str:
        h = hashlib.sha256(f"{model}|{prompt}|{candidate}".encode()).hexdigest()[:32]
        return h


class RubricJudge:
    """0-100 + subscores judge. Single call per score (repeats are useless)."""

    def __init__(
        self,
        criteria: list[RubricCriterion],
        model: str = "opus",
        system: str = "",
        cache_db: Path | None = None,
    ):
        self.criteria = criteria
        self.model = model
        self.system = system or "You are a precise evaluator. Score outputs against the given criteria."
        self._cache = _JudgeCache(cache_db) if cache_db else None

    def _build_prompt(self, candidate: str, context: dict) -> str:
        criteria_text = "\n".join(
            f"- **{c.name}** (weight {c.weight}): {c.description}"
            for c in self.criteria
        )
        goal = context.get("goal", "Evaluate the quality of this output.")
        return f"""Evaluate this candidate output.

**Goal:** {goal}

**Candidate:**
{candidate[:8000]}

**Score each criterion 0-100, then give an overall weighted score.**

Criteria:
{criteria_text}

Respond in JSON:
{{"score": <int 0-100>, "reason": "<1-2 sentences>", "subscores": {{{", ".join(f'"{c.name}": <int 0-100>' for c in self.criteria)}}}}}"""

    async def score(self, candidate: str, context: dict) -> JudgeResult:
        prompt = self._build_prompt(candidate, context)

        # Check cache
        if self._cache:
            key = _JudgeCache.hash_key(self.model, prompt, candidate)
            cached = self._cache.get(key)
            if cached:
                return cached

        t0 = time.monotonic()
        try:
            raw = await call_llm(
                model=self.model, system=self.system,
                prompt=prompt, temperature=0.0,
            )
            data = parse_judge_json(raw)
            duration_ms = int((time.monotonic() - t0) * 1000)

            result = JudgeResult(
                score=float(data.get("score", 0)),
                reason=data.get("reason", ""),
                subscores={k: float(v) for k, v in data.get("subscores", {}).items()},
                model=self.model,
                duration_ms=duration_ms,
            )
        except Exception as e:
            logger.warning("Judge scoring failed: {}", e)
            result = JudgeResult(
                score=0.0, reason=f"judge_error: {e}", model=self.model,
                duration_ms=int((time.monotonic() - t0) * 1000),
            )

        if self._cache:
            self._cache.put(key, result)
        return result

    async def score_batch(
        self, items: list[tuple[str, dict]], concurrency: int = 5,
    ) -> list[JudgeResult]:
        import asyncio
        sem = asyncio.Semaphore(concurrency)

        async def _one(candidate: str, context: dict) -> JudgeResult:
            async with sem:
                return await self.score(candidate, context)

        return await asyncio.gather(*[_one(c, ctx) for c, ctx in items])


class BinaryJudge:
    """Verdict-based judge. Returns score=1.0 for first verdict, 0.0 for second."""

    def __init__(
        self,
        verdicts: list[str],
        model: str = "opus",
        system: str = "",
        cache_db: Path | None = None,
    ):
        assert len(verdicts) == 2, "BinaryJudge needs exactly 2 verdicts"
        self.verdicts = verdicts
        self.model = model
        self.system = system or "You are a precise evaluator. Give a clear verdict."
        self._cache = _JudgeCache(cache_db) if cache_db else None

    def _build_prompt(self, candidate: str, context: dict) -> str:
        goal = context.get("goal", context.get("query", "Evaluate this."))
        return f"""Evaluate this candidate.

**Context:** {goal}

**Candidate:**
{candidate[:8000]}

**Verdict:** Choose exactly one: "{self.verdicts[0]}" or "{self.verdicts[1]}"

Respond in JSON:
{{"verdict": "<{self.verdicts[0]} or {self.verdicts[1]}>", "confidence": <0.0-1.0>, "reason": "<1-2 sentences>"}}"""

    async def score(self, candidate: str, context: dict) -> JudgeResult:
        prompt = self._build_prompt(candidate, context)

        if self._cache:
            key = _JudgeCache.hash_key(self.model, prompt, candidate)
            cached = self._cache.get(key)
            if cached:
                return cached

        t0 = time.monotonic()
        try:
            raw = await call_llm(
                model=self.model, system=self.system,
                prompt=prompt, temperature=0.0,
            )
            data = parse_judge_json(raw)
            verdict = data.get("verdict", "").lower().strip()
            score = 1.0 if verdict == self.verdicts[0].lower() else 0.0
            duration_ms = int((time.monotonic() - t0) * 1000)

            result = JudgeResult(
                score=score,
                reason=f"{verdict}: {data.get('reason', '')}",
                subscores={"confidence": float(data.get("confidence", 0.5))},
                model=self.model,
                duration_ms=duration_ms,
            )
        except Exception as e:
            logger.warning("Binary judge failed: {}", e)
            result = JudgeResult(
                score=0.0, reason=f"judge_error: {e}", model=self.model,
                duration_ms=int((time.monotonic() - t0) * 1000),
            )

        if self._cache:
            self._cache.put(key, result)
        return result
```

**Step 4: Run tests**

Run: `python -m pytest lib/gym/tests/test_judge.py -v`
Expected: All PASS

**Step 5: Commit**

```bash
git add lib/gym/judge.py lib/gym/tests/test_judge.py
git commit -m "feat(gym): unified judge framework — RubricJudge + BinaryJudge"
```

---

### Task 2: Calibration — Perturbation Generators

**Files:**
- Create: `lib/gym/calibration/__init__.py`
- Create: `lib/gym/calibration/perturbations.py`
- Test: `lib/gym/calibration/tests/__init__.py`
- Test: `lib/gym/calibration/tests/test_perturbations.py`

**Step 1: Write failing tests**

```python
# lib/gym/calibration/tests/test_perturbations.py
import pytest
from lib.gym.calibration.perturbations import (
    remove_evidence, add_fluff, vague_ify, inject_errors,
    scramble_order, duplicate_content, strip_actionability,
    apply_perturbation, PERTURBATION_TYPES,
)

SAMPLE = """The system uses SQLite for storage with 228 active principles.
Retrieval precision is 90.5% on the test set.
Use `learn find -s "query"` to search semantically.
The admission gate threshold is 0.55 for patterns."""

def test_remove_evidence_strips_numbers():
    result = remove_evidence(SAMPLE)
    assert "228" not in result or "90.5%" not in result  # at least some removed
    assert len(result) < len(SAMPLE)

def test_add_fluff_increases_length():
    result = add_fluff(SAMPLE)
    assert len(result) > len(SAMPLE)

def test_vague_ify_removes_specifics():
    result = vague_ify(SAMPLE)
    # Specific numbers should be replaced
    assert "228" not in result or "90.5%" not in result or "0.55" not in result

def test_scramble_order_changes_sequence():
    multiline = "Line A\nLine B\nLine C\nLine D\nLine E"
    result = scramble_order(multiline)
    assert set(result.strip().split("\n")) == set(multiline.strip().split("\n"))

def test_duplicate_content_repeats():
    result = duplicate_content(SAMPLE)
    assert len(result) > len(SAMPLE)

def test_strip_actionability_removes_imperatives():
    text = "Use semantic search. Run `learn find -s query`. Always verify results."
    result = strip_actionability(text)
    assert "Use " not in result or "Run " not in result or "Always " not in result

def test_apply_perturbation_all_types():
    for ptype in PERTURBATION_TYPES:
        result = apply_perturbation(SAMPLE, ptype)
        assert isinstance(result, str)
        assert len(result) > 0

def test_perturbation_types_complete():
    expected = {"remove_evidence", "add_fluff", "vague_ify", "inject_errors",
                "scramble_order", "duplicate_content", "strip_actionability"}
    assert set(PERTURBATION_TYPES.keys()) == expected
```

**Step 2: Run tests to verify they fail**

Run: `python -m pytest lib/gym/calibration/tests/test_perturbations.py -v`

**Step 3: Implement perturbations**

```python
# lib/gym/calibration/perturbations.py
"""Synthetic perturbation generators for judge calibration.

Each function takes text and returns a degraded version.
If the judge is calibrated, the degraded version should score lower.

Inspired by CALM framework (2024) perturbation suite.
"""

from __future__ import annotations

import random
import re


def remove_evidence(text: str, *, seed: int = 42) -> str:
    """Strip supporting data: numbers, percentages, quoted code, URLs."""
    rng = random.Random(seed)
    lines = text.split("\n")
    result = []
    for line in lines:
        # Remove lines with numbers/percentages ~50% of the time
        if re.search(r'\d+\.?\d*%?', line) and rng.random() < 0.5:
            continue
        # Remove backtick-quoted content
        line = re.sub(r'`[^`]+`', '[removed]', line)
        result.append(line)
    return "\n".join(result)


def add_fluff(text: str, *, seed: int = 42) -> str:
    """Insert plausible but irrelevant filler sentences."""
    rng = random.Random(seed)
    fillers = [
        "This is an important consideration for the overall system architecture.",
        "It should be noted that many factors contribute to the final outcome.",
        "The implications of this are worth considering in a broader context.",
        "Additionally, there are several other aspects that could be explored.",
        "This approach has both advantages and disadvantages depending on the use case.",
        "Further investigation may reveal additional insights about this topic.",
        "The relationship between these components is complex and multifaceted.",
    ]
    lines = text.split("\n")
    result = []
    for line in lines:
        result.append(line)
        if rng.random() < 0.3:
            result.append(rng.choice(fillers))
    return "\n".join(result)


def vague_ify(text: str, *, seed: int = 42) -> str:
    """Replace specific numbers/names with vague language."""
    result = text
    # Replace specific numbers
    result = re.sub(r'\b\d+\.?\d*%', 'some percentage', result)
    result = re.sub(r'\b\d{2,}\b', 'several', result)
    result = re.sub(r'\b\d+\.\d+\b', 'a certain value', result)
    # Replace specific tool/function names with generic
    result = re.sub(r'`[^`]+`', 'the relevant tool', result)
    return result


def inject_errors(text: str, *, seed: int = 42) -> str:
    """Replace correct facts with plausible wrong ones."""
    rng = random.Random(seed)
    lines = text.split("\n")
    result = []
    for line in lines:
        if rng.random() < 0.3 and re.search(r'\d+', line):
            # Multiply/divide numbers randomly
            def mangle(m):
                n = float(m.group())
                return str(round(n * rng.choice([0.1, 0.5, 2.0, 10.0]), 1))
            line = re.sub(r'\b(\d+\.?\d*)\b', mangle, line, count=1)
        result.append(line)
    return "\n".join(result)


def scramble_order(text: str, *, seed: int = 42) -> str:
    """Randomize order of lines/paragraphs."""
    rng = random.Random(seed)
    # Split by double-newline (paragraphs) or single newline
    if "\n\n" in text:
        parts = text.split("\n\n")
    else:
        parts = text.split("\n")
    rng.shuffle(parts)
    sep = "\n\n" if "\n\n" in text else "\n"
    return sep.join(parts)


def duplicate_content(text: str, *, seed: int = 42) -> str:
    """Repeat some content verbatim."""
    rng = random.Random(seed)
    lines = text.split("\n")
    result = []
    for line in lines:
        result.append(line)
        if rng.random() < 0.25 and line.strip():
            result.append(line)  # duplicate
    return "\n".join(result)


def strip_actionability(text: str, *, seed: int = 42) -> str:
    """Remove imperative sentences (Use X, Run Y, Always Z)."""
    lines = text.split("\n")
    result = []
    for line in lines:
        stripped = line.lstrip("- •*").strip()
        if re.match(r'^(Use |Run |Always |Never |Add |Set |Configure |Call |Check )', stripped):
            continue
        result.append(line)
    return "\n".join(result)


PERTURBATION_TYPES: dict[str, callable] = {
    "remove_evidence": remove_evidence,
    "add_fluff": add_fluff,
    "vague_ify": vague_ify,
    "inject_errors": inject_errors,
    "scramble_order": scramble_order,
    "duplicate_content": duplicate_content,
    "strip_actionability": strip_actionability,
}


def apply_perturbation(text: str, perturbation_type: str, *, seed: int = 42) -> str:
    """Apply a named perturbation to text."""
    fn = PERTURBATION_TYPES[perturbation_type]
    return fn(text, seed=seed)
```

**Step 4: Run tests**

Run: `python -m pytest lib/gym/calibration/tests/test_perturbations.py -v`

**Step 5: Commit**

```bash
git add lib/gym/calibration/
git commit -m "feat(calibration): synthetic perturbation generators for judge monotonicity testing"
```

---

### Task 3: Calibration — Monotonicity Runner

**Files:**
- Create: `lib/gym/calibration/monotonicity.py`
- Create: `lib/gym/calibration/LOGBOOK.md`
- Test: `lib/gym/calibration/tests/test_monotonicity.py`

**Step 1: Write failing test**

```python
# lib/gym/calibration/tests/test_monotonicity.py
import pytest
from lib.gym.calibration.monotonicity import MonotonicityResult, check_monotonicity

def test_monotonicity_result_passes():
    r = MonotonicityResult(
        perturbation_type="add_fluff",
        original_scores=[80, 75, 82],
        perturbed_scores=[60, 55, 65],
    )
    assert r.passes  # originals > perturbed
    assert r.mean_drop > 0
    assert r.effect_size > 0

def test_monotonicity_result_fails():
    r = MonotonicityResult(
        perturbation_type="add_fluff",
        original_scores=[80, 75, 82],
        perturbed_scores=[85, 80, 90],  # perturbed scored HIGHER
    )
    assert not r.passes

def test_check_monotonicity_structure():
    """Verify the output structure (actual API calls tested in integration)."""
    # Just test the result aggregation, not the LLM calls
    from lib.gym.calibration.monotonicity import _aggregate_results
    results = [
        MonotonicityResult("remove_evidence", [80, 70], [50, 40]),
        MonotonicityResult("add_fluff", [80, 70], [75, 68]),
    ]
    summary = _aggregate_results(results)
    assert summary["total_types"] == 2
    assert summary["passing_types"] >= 0
```

**Step 2: Run to verify fail**

**Step 3: Implement monotonicity runner**

```python
# lib/gym/calibration/monotonicity.py
"""Monotonicity testing: degrade inputs, verify scores drop.

The fundamental calibration test. If you make something worse and the
judge doesn't notice, the judge is broken.
"""

from __future__ import annotations

import asyncio
from dataclasses import dataclass, field
from statistics import mean

from loguru import logger

from lib.gym.judge import RubricJudge, JudgeResult
from lib.gym.calibration.perturbations import PERTURBATION_TYPES, apply_perturbation


@dataclass
class MonotonicityResult:
    perturbation_type: str
    original_scores: list[float]
    perturbed_scores: list[float]
    sample_ids: list[str] = field(default_factory=list)

    @property
    def mean_drop(self) -> float:
        if not self.original_scores:
            return 0.0
        drops = [o - p for o, p in zip(self.original_scores, self.perturbed_scores)]
        return mean(drops)

    @property
    def effect_size(self) -> float:
        """Cohen's d — want >0.5 for meaningful discrimination."""
        if len(self.original_scores) < 2:
            return 0.0
        from statistics import stdev
        drops = [o - p for o, p in zip(self.original_scores, self.perturbed_scores)]
        pooled_std = stdev(self.original_scores + self.perturbed_scores)
        if pooled_std == 0:
            return float("inf") if self.mean_drop > 0 else 0.0
        return self.mean_drop / pooled_std

    @property
    def passes(self) -> bool:
        """Monotonicity holds if mean drop > 0 and effect size > 0.5."""
        return self.mean_drop > 0 and self.effect_size > 0.5

    @property
    def pct_correct(self) -> float:
        """% of samples where original > perturbed."""
        if not self.original_scores:
            return 0.0
        correct = sum(1 for o, p in zip(self.original_scores, self.perturbed_scores) if o > p)
        return correct / len(self.original_scores)


def _aggregate_results(results: list[MonotonicityResult]) -> dict:
    return {
        "total_types": len(results),
        "passing_types": sum(1 for r in results if r.passes),
        "per_type": {
            r.perturbation_type: {
                "passes": r.passes,
                "mean_drop": round(r.mean_drop, 1),
                "effect_size": round(r.effect_size, 2),
                "pct_correct": round(r.pct_correct * 100, 1),
                "n_samples": len(r.original_scores),
            }
            for r in results
        },
    }


async def check_monotonicity(
    judge: RubricJudge,
    corpus: list[dict],
    *,
    perturbation_types: list[str] | None = None,
    concurrency: int = 4,
) -> dict:
    """Run monotonicity tests on a corpus.

    Args:
        judge: The judge to calibrate.
        corpus: List of dicts with "candidate" and "context" keys.
        perturbation_types: Which perturbations to test (default: all).
        concurrency: Max concurrent judge calls.

    Returns:
        Summary dict with per-type pass/fail and aggregate stats.
    """
    types_to_test = perturbation_types or list(PERTURBATION_TYPES.keys())
    sem = asyncio.Semaphore(concurrency)

    # 1. Score originals
    logger.info("Scoring {} original samples", len(corpus))

    async def _score(candidate: str, context: dict) -> JudgeResult:
        async with sem:
            return await judge.score(candidate, context)

    original_results = await asyncio.gather(
        *[_score(item["candidate"], item["context"]) for item in corpus]
    )

    # 2. For each perturbation type, score perturbed versions
    monotonicity_results = []
    for ptype in types_to_test:
        logger.info("Testing perturbation: {}", ptype)
        perturbed_candidates = [
            apply_perturbation(item["candidate"], ptype) for item in corpus
        ]
        perturbed_results = await asyncio.gather(
            *[_score(pc, item["context"]) for pc, item in zip(perturbed_candidates, corpus)]
        )

        mr = MonotonicityResult(
            perturbation_type=ptype,
            original_scores=[r.score for r in original_results],
            perturbed_scores=[r.score for r in perturbed_results],
            sample_ids=[item.get("id", str(i)) for i, item in enumerate(corpus)],
        )
        monotonicity_results.append(mr)
        status = "PASS" if mr.passes else "FAIL"
        logger.info(
            "  {} — mean_drop={:.1f}, effect_size={:.2f}, pct_correct={:.0f}%",
            status, mr.mean_drop, mr.effect_size, mr.pct_correct * 100,
        )

    return _aggregate_results(monotonicity_results)
```

**Step 4: Run tests**

Run: `python -m pytest lib/gym/calibration/tests/test_monotonicity.py -v`

**Step 5: Create LOGBOOK.md scaffold**

```markdown
# Judge Calibration — LOGBOOK

Track calibration runs, findings, and threshold decisions.

## 2026-02-25 — First Calibration (repeats vs diversity)

**Setup**: 15 sandbox samples × 2 models (Opus 4.6, Gemini 3.1 Pro) × 5 configs

**Key findings**:
- Within-model repeats: useless (Opus σ=0.39, Gemini σ=1.51)
- Cross-model disagreement: massive (mean diff=46.7, Kendall τ=-0.182)
- Gemini 3.1 Pro: broken thermometer (12/15 samples = 100)
- Opus: discriminates well (range 20-72, 3 quintiles)

**Decision**: Use Opus as primary judge. No repeats. Diversity comes from different models.

Raw data: `learning/session_review/data/judge_calibration/`

---

(Future entries go here as monotonicity tests are run)
```

**Step 6: Commit**

```bash
git add lib/gym/calibration/
git commit -m "feat(calibration): monotonicity runner + LOGBOOK with first findings"
```

---

### Task 4: Calibration — Distribution Analysis & HTML Report

**Files:**
- Create: `lib/gym/calibration/distribution.py`
- Create: `lib/gym/calibration/report.py`
- Create: `lib/gym/calibration/__main__.py` (CLI entry point)

**Step 1: Implement distribution analysis**

```python
# lib/gym/calibration/distribution.py
"""Score distribution analysis — detect clustering, check spread."""

from __future__ import annotations
from dataclasses import dataclass


@dataclass
class DistributionReport:
    scores: list[float]
    model: str

    @property
    def quintile_counts(self) -> dict[str, int]:
        """Count scores per quintile (0-19, 20-39, ..., 80-100)."""
        bins = {f"{i}-{i+19}": 0 for i in range(0, 100, 20)}
        for s in self.scores:
            b = min(int(s) // 20 * 20, 80)
            bins[f"{b}-{b+19}"] += 1
        return bins

    @property
    def quintiles_used(self) -> int:
        return sum(1 for v in self.quintile_counts.values() if v > 0)

    @property
    def clustered(self) -> bool:
        """True if >60% of scores fall in a single 20-point band."""
        total = len(self.scores)
        if total == 0:
            return True
        return any(v / total > 0.6 for v in self.quintile_counts.values())

    @property
    def discriminates(self) -> bool:
        """Good: scores spread across ≥3 quintiles and not clustered."""
        return self.quintiles_used >= 3 and not self.clustered
```

**Step 2: Implement HTML report**

```python
# lib/gym/calibration/report.py
"""HTML calibration report — histograms, monotonicity, distribution."""

from __future__ import annotations
from datetime import datetime, timezone
from lib.gym.calibration.distribution import DistributionReport
from lib.gym.calibration.monotonicity import MonotonicityResult


def generate_calibration_report(
    distribution: DistributionReport,
    monotonicity: dict,
    model: str,
    corpus_size: int,
) -> str:
    """Generate HTML calibration report."""
    now = datetime.now(tz=timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
    dist = distribution

    mono_rows = ""
    for ptype, stats in monotonicity.get("per_type", {}).items():
        status_class = "good" if stats["passes"] else "bad"
        mono_rows += f"""<tr>
            <td>{ptype}</td>
            <td class="{status_class}">{"PASS" if stats["passes"] else "FAIL"}</td>
            <td>{stats["mean_drop"]}</td>
            <td>{stats["effect_size"]}</td>
            <td>{stats["pct_correct"]}%</td>
        </tr>"""

    quintile_bars = ""
    for q, count in dist.quintile_counts.items():
        pct = count / len(dist.scores) * 100 if dist.scores else 0
        quintile_bars += f'<div style="display:flex;align-items:center;gap:8px;margin:2px 0"><span style="width:50px">{q}</span><div style="background:#2563eb;height:20px;width:{pct*3}px"></div><span>{count}</span></div>'

    return f"""<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Judge Calibration Report</title>
<style>
body {{ font-family: -apple-system, system-ui, sans-serif; max-width: 900px; margin: 2rem auto; padding: 0 1rem; }}
h1 {{ border-bottom: 3px solid #333; padding-bottom: 0.5rem; }}
table {{ border-collapse: collapse; width: 100%; margin: 1rem 0; }}
th, td {{ border: 1px solid #ddd; padding: 8px 12px; text-align: left; }}
th {{ background: #f0f0f0; }}
.good {{ color: #16a34a; font-weight: bold; }}
.bad {{ color: #dc2626; font-weight: bold; }}
.card {{ background: white; border: 1px solid #ddd; border-radius: 8px; padding: 1.5rem; margin: 1rem 0; }}
</style></head><body>
<h1>Judge Calibration Report</h1>
<p>Generated: {now} | Model: <code>{model}</code> | Corpus: {corpus_size} samples</p>

<h2>Score Distribution</h2>
<div class="card">
<p>Quintiles used: <strong>{dist.quintiles_used}/5</strong>
 | Clustered: <strong class="{"bad" if dist.clustered else "good"}">{"YES" if dist.clustered else "NO"}</strong>
 | Discriminates: <strong class="{"good" if dist.discriminates else "bad"}">{"YES" if dist.discriminates else "NO"}</strong></p>
{quintile_bars}
</div>

<h2>Monotonicity Tests</h2>
<p>Passing: {monotonicity.get("passing_types", 0)}/{monotonicity.get("total_types", 0)}</p>
<table>
<tr><th>Perturbation</th><th>Status</th><th>Mean Drop</th><th>Effect Size (d)</th><th>% Correct</th></tr>
{mono_rows}
</table>

<h2>Interpretation</h2>
<div class="card">
<p><strong>Distribution:</strong> {"Judge discriminates well — scores spread across multiple quality levels." if dist.discriminates else "Judge clusters scores — consider adding anchor examples or switching to binary scale."}</p>
<p><strong>Monotonicity:</strong> {monotonicity.get("passing_types", 0)} of {monotonicity.get("total_types", 0)} perturbation types produce statistically significant score drops.</p>
</div>
</body></html>"""
```

**Step 3: Create CLI entry point**

```python
# lib/gym/calibration/__main__.py
"""Run judge calibration.

Usage:
    python -m lib.gym.calibration --model opus --corpus sandbox
    python -m lib.gym.calibration --model opus --corpus sandbox --perturbations remove_evidence,add_fluff
"""

import asyncio
import json
import sys
from pathlib import Path

import click
from loguru import logger

logger.remove()
logger.add(sys.stderr, format="{time:HH:mm:ss} | {level:<7} | {message}")


def _load_sandbox_corpus(limit: int = 20) -> list[dict]:
    """Load corpus from sandbox_results.db."""
    import sqlite3
    db = Path(__file__).resolve().parent.parent.parent.parent / "learning" / "session_review" / "data" / "sandbox_results.db"
    conn = sqlite3.connect(str(db))
    conn.row_factory = sqlite3.Row
    rows = conn.execute("""
        SELECT id, prompt, raw_messages FROM sandbox_runs
        WHERE raw_messages IS NOT NULL AND error IS NULL
        ORDER BY id LIMIT ?
    """, (limit,)).fetchall()
    conn.close()

    corpus = []
    for row in rows:
        raw = json.loads(row["raw_messages"])
        result_text = ""
        for entry in reversed(raw):
            if isinstance(entry, dict) and entry.get("type") == "result":
                result_text = entry.get("result", "")
                break
            if isinstance(entry, dict) and entry.get("role") == "assistant":
                content = entry.get("content", "")
                if isinstance(content, str) and len(content) > 50:
                    result_text = content[:4000]
                    break
        if result_text:
            corpus.append({
                "id": str(row["id"]),
                "candidate": result_text[:4000],
                "context": {"goal": row["prompt"]},
            })
    return corpus


@click.command()
@click.option("--model", default="opus", help="Judge model")
@click.option("--corpus", default="sandbox", type=click.Choice(["sandbox"]))
@click.option("--limit", default=20, help="Max corpus samples")
@click.option("--perturbations", default=None, help="Comma-separated perturbation types")
@click.option("--output-dir", default=None, help="Output directory for report")
def main(model, corpus, limit, perturbations, output_dir):
    """Run judge calibration: distribution + monotonicity tests."""
    asyncio.run(_main(model, corpus, limit, perturbations, output_dir))


async def _main(model, corpus_type, limit, perturbations_str, output_dir_str):
    from lib.gym.judge import RubricJudge, RubricCriterion
    from lib.gym.calibration.monotonicity import check_monotonicity
    from lib.gym.calibration.distribution import DistributionReport
    from lib.gym.calibration.report import generate_calibration_report

    # Load corpus
    if corpus_type == "sandbox":
        corpus = _load_sandbox_corpus(limit)
    logger.info(f"Loaded {len(corpus)} samples from {corpus_type}")

    # Build judge with generic task-completion rubric
    judge = RubricJudge(
        criteria=[
            RubricCriterion("completeness", "Did it address all parts of the task?"),
            RubricCriterion("accuracy", "Are the facts and code correct?"),
            RubricCriterion("specificity", "Does it give concrete, actionable detail vs vague statements?"),
        ],
        model=model,
    )

    # Parse perturbation types
    ptypes = perturbations_str.split(",") if perturbations_str else None

    # Run monotonicity
    mono_results = await check_monotonicity(judge, corpus, perturbation_types=ptypes)

    # Distribution from original scores
    original_scores = []
    for ptype_data in mono_results.get("per_type", {}).values():
        break  # just need one set of original scores
    # Re-score originals for distribution (they're computed inside monotonicity)
    # For now, use the first perturbation's original scores
    # TODO: refactor to return original scores from check_monotonicity
    dist_scores = []
    for item in corpus:
        result = await judge.score(item["candidate"], item["context"])
        dist_scores.append(result.score)

    dist = DistributionReport(scores=dist_scores, model=model)

    # Report
    out_dir = Path(output_dir_str) if output_dir_str else Path(__file__).parent / "reports"
    out_dir.mkdir(parents=True, exist_ok=True)

    html = generate_calibration_report(dist, mono_results, model, len(corpus))
    report_path = out_dir / f"calibration_{model}_{len(corpus)}samples.html"
    report_path.write_text(html)

    # JSON data
    data_path = out_dir / f"calibration_{model}_{len(corpus)}samples.json"
    data_path.write_text(json.dumps({"distribution": dist.quintile_counts,
                                      "monotonicity": mono_results}, indent=2))

    # Summary
    print(f"\nModel: {model} | Samples: {len(corpus)}")
    print(f"Distribution: {'GOOD' if dist.discriminates else 'CLUSTERED'} "
          f"({dist.quintiles_used}/5 quintiles)")
    print(f"Monotonicity: {mono_results['passing_types']}/{mono_results['total_types']} passing")
    for ptype, stats in mono_results["per_type"].items():
        status = "PASS" if stats["passes"] else "FAIL"
        print(f"  {ptype:25s} {status}  drop={stats['mean_drop']:>5.1f}  d={stats['effect_size']:>4.2f}")
    print(f"\nReport: {report_path}")


if __name__ == "__main__":
    main()
```

**Step 4: Commit**

```bash
git add lib/gym/calibration/
git commit -m "feat(calibration): distribution analysis, HTML report, CLI runner"
```

---

### Task 5: Run First Monotonicity Calibration

**This is an execution task, not a coding task.**

**Step 1: Run calibration with Opus on sandbox corpus**

Run: `python -m lib.gym.calibration --model opus --limit 20`

Expected: Report showing per-perturbation monotonicity pass/fail + distribution spread.

**Step 2: Review results**

- Check distribution: does Opus spread across ≥3 quintiles? (Previous data says yes: 20-72)
- Check monotonicity: does each perturbation type cause a significant score drop?
- If any perturbation fails: note which one and why in LOGBOOK.md

**Step 3: Update LOGBOOK.md with findings**

Add entry with: perturbation results, which passed/failed, any rubric adjustments needed.

**Step 4: Commit results**

```bash
git add lib/gym/calibration/LOGBOOK.md lib/gym/calibration/reports/
git commit -m "docs(calibration): first monotonicity run results"
```

---

### Task 6: ApplyGym — `/recall` Retrieval Quality

**Files:**
- Create: `learning/gyms/apply/__init__.py`
- Create: `learning/gyms/apply/gym.py`
- Create: `learning/gyms/apply/config.yaml`
- Test: `learning/gyms/apply/tests/__init__.py`
- Test: `learning/gyms/apply/tests/test_apply_gym.py`

**Step 1: Write failing test**

```python
# learning/gyms/apply/tests/test_apply_gym.py
import pytest
from learning.gyms.apply.gym import ApplyGym, build_corpus_from_retroactive_study


def test_build_corpus_returns_dicts():
    """Corpus items have required keys."""
    corpus = build_corpus_from_retroactive_study(limit=5)
    assert len(corpus) > 0
    for item in corpus:
        assert "context_snippet" in item
        assert "principle_id" in item
        assert "outcome" in item  # success, failure, partial


def test_apply_gym_init():
    from pathlib import Path
    gym = ApplyGym(gym_dir=Path("learning/gyms/apply"))
    assert gym.config  # loaded from config.yaml
```

**Step 2: Implement ApplyGym**

```python
# learning/gyms/apply/gym.py
"""ApplyGym — evaluate /recall retrieval quality.

Given a coding context, does retrieval surface the right principles?
Gold corpus: retroactive study episodes with labeled principle applications.

Usage:
    python -m learning.gyms.apply.gym --limit 50
    python -m learning.gyms.apply.gym --limit 50 --report
"""

from __future__ import annotations

import asyncio
import json
import sqlite3
import sys
from pathlib import Path

import click
from loguru import logger

sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent.parent))

from lib.gym.base import GymBase, Candidate
from lib.gym.judge import RubricJudge, RubricCriterion, JudgeResult

LEARNING_DB = Path(__file__).resolve().parent.parent.parent / "data" / "learning.db"
GYM_DIR = Path(__file__).parent


def build_corpus_from_retroactive_study(
    limit: int | None = None,
    outcomes: list[str] | None = None,
) -> list[dict]:
    """Load labeled principle applications from retroactive study.

    Each item has a context_snippet (what was happening) and the
    principle_id that was labeled as relevant (success/failure/partial).
    """
    conn = sqlite3.connect(str(LEARNING_DB))
    conn.row_factory = sqlite3.Row
    outcome_filter = outcomes or ["success", "failure", "partial"]
    placeholders = ",".join("?" * len(outcome_filter))
    query = f"""
        SELECT pa.principle_id, pa.context_snippet, pa.outcome, pa.outcome_notes,
               p.slug, p.full_text
        FROM principle_applications pa
        JOIN principles p ON p.id = pa.principle_id
        WHERE pa.recorded_by IN ('retroactive_study', 'retroactive_study_v2')
          AND pa.outcome IN ({placeholders})
          AND pa.context_snippet IS NOT NULL
          AND length(pa.context_snippet) > 50
        ORDER BY pa.id
    """
    params = list(outcome_filter)
    if limit:
        query += " LIMIT ?"
        params.append(limit)
    rows = conn.execute(query, params).fetchall()
    conn.close()

    return [
        {
            "principle_id": row["principle_id"],
            "principle_slug": row["slug"],
            "principle_text": row["full_text"] or "",
            "context_snippet": row["context_snippet"],
            "outcome": row["outcome"],
            "outcome_notes": row["outcome_notes"] or "",
        }
        for row in rows
    ]


class ApplyGym(GymBase):
    """Evaluate retrieval quality: given context, does recall surface the right principle?"""

    def __init__(self, gym_dir: Path | None = None):
        super().__init__(gym_dir=gym_dir or GYM_DIR)

    async def generate(self, input_data, feedback=None) -> list[Candidate]:
        """Retrieve principles for a context snippet using learn find."""
        from learning.schema.learning_store import LearningStore

        store = LearningStore(LEARNING_DB)
        context = input_data["context_snippet"]
        target_principle = input_data["principle_slug"]

        # Retrieve top-10 via semantic search
        results = store.search_semantic(context, limit=10)
        store.close()

        candidates = []
        for i, r in enumerate(results):
            candidates.append(Candidate(
                id=f"{target_principle}_{i}",
                content=json.dumps({
                    "retrieved_slug": r.get("slug", ""),
                    "retrieved_text": r.get("full_text", r.get("text", ""))[:500],
                    "similarity": r.get("score", 0),
                    "rank": i + 1,
                }),
                variant="semantic_search",
                metadata={
                    "target_principle": target_principle,
                    "context": context[:200],
                    "rank": i + 1,
                },
            ))
        return candidates

    async def evaluate(self, candidates: list[Candidate]) -> list[Candidate]:
        """Check if the target principle appears in retrieved results."""
        # Simple rank-based scoring (no LLM needed for this metric)
        for c in candidates:
            data = json.loads(c.content)
            target = c.metadata.get("target_principle", "")
            retrieved = data.get("retrieved_slug", "")
            if target in retrieved or retrieved in target:
                # Found it — score based on rank (higher rank = higher score)
                c.score = max(0, 100 - (data["rank"] - 1) * 10)
                c.reason = f"Target principle found at rank {data['rank']}"
            else:
                c.score = 0.0
                c.reason = f"Retrieved {retrieved}, wanted {target}"
        return candidates


@click.command()
@click.option("--limit", default=50, help="Max corpus items")
@click.option("--report", is_flag=True, help="Generate HTML report")
def main(limit, report):
    asyncio.run(_main(limit, report))


async def _main(limit, report):
    corpus = build_corpus_from_retroactive_study(limit=limit)
    logger.info(f"Loaded {len(corpus)} labeled applications")

    gym = ApplyGym()
    hits_at_1 = 0
    hits_at_5 = 0
    hits_at_10 = 0
    total = 0

    for item in corpus:
        candidates = await gym.generate(item)
        candidates = await gym.evaluate(candidates)

        found_ranks = [c for c in candidates if c.score > 0]
        if found_ranks:
            best_rank = min(c.metadata["rank"] for c in found_ranks)
            if best_rank <= 1: hits_at_1 += 1
            if best_rank <= 5: hits_at_5 += 1
            if best_rank <= 10: hits_at_10 += 1
        total += 1

    print(f"\nRetrieval Quality ({total} queries)")
    print(f"  P@1:  {hits_at_1/total:.1%}")
    print(f"  P@5:  {hits_at_5/total:.1%}")
    print(f"  P@10: {hits_at_10/total:.1%}")


if __name__ == "__main__":
    main()
```

**Step 3: Create config.yaml**

```yaml
# learning/gyms/apply/config.yaml
name: apply
description: Evaluate /recall retrieval quality

corpus:
  source: retroactive_study
  outcomes: [success, failure, partial]
  min_context_length: 50

retrieval_strategies:
  - name: semantic_search
    method: search_semantic
    limit: 10
  # Future: fts, hybrid, re-ranked

evaluation:
  metrics: [p_at_1, p_at_5, p_at_10]

judge:
  model: opus
  criteria:
    - name: relevance
      description: "Would this principle actually help in the given context?"
    - name: actionability
      description: "Does the principle give concrete guidance for this situation?"
    - name: noise_level
      description: "Is this retrieval noise-free, or does it include irrelevant principles?"
```

**Step 4: Run test**

Run: `python -m pytest learning/gyms/apply/tests/test_apply_gym.py -v`

**Step 5: Commit**

```bash
git add learning/gyms/apply/
git commit -m "feat(gym): ApplyGym — /recall retrieval quality evaluation"
```

---

### Task 7: Run ApplyGym on Retroactive Study Corpus

**Execution task.**

**Step 1: Run ApplyGym**

Run: `python -m learning.gyms.apply.gym --limit 50`

**Step 2: Review P@1/P@5/P@10 numbers**

These are the baseline retrieval quality metrics. Record in LOGBOOK.

**Step 3: Update lib/gym/calibration/LOGBOOK.md**

Add ApplyGym baseline results.

**Step 4: Commit**

```bash
git add lib/gym/calibration/LOGBOOK.md learning/gyms/apply/corpus/
git commit -m "docs(apply): baseline retrieval quality metrics"
```

---

### Task 8: Documentation — README files

**Files:**
- Create: `lib/gym/README.md`
- Create: `lib/gym/calibration/README.md`

**Step 1: Write lib/gym/README.md**

Cover: judge.py API, how to add a stage gym, calibration methodology, existing gyms table.

**Step 2: Write lib/gym/calibration/README.md**

Cover: how to run calibration, perturbation types, monotonicity testing, interpreting reports.

**Step 3: Commit**

```bash
git add lib/gym/README.md lib/gym/calibration/README.md
git commit -m "docs(gym): README for judge framework and calibration"
```

---

## Summary

| Task | What | Est. time | Cost |
|------|------|-----------|------|
| 1 | `lib/gym/judge.py` + tests | 15 min | $0 |
| 2 | Perturbation generators + tests | 10 min | $0 |
| 3 | Monotonicity runner + LOGBOOK | 10 min | $0 |
| 4 | Distribution + report + CLI | 10 min | $0 |
| 5 | **Run calibration** (API calls) | 5 min | ~$3-5 |
| 6 | ApplyGym + tests | 15 min | $0 |
| 7 | **Run ApplyGym** (API calls) | 5 min | ~$1-2 |
| 8 | Documentation | 5 min | $0 |

Total: ~75 min code, ~$4-7 API cost.

Tasks 1-4 are pure code (no API calls). Task 5 is the first live calibration. Task 6-7 is the ApplyGym. Task 8 is docs.