# Unified Judge — `lib/eval/` Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Move the judge framework from `lib/gym/judge.py` to `lib/eval/`, split into parse/cache/judge modules, update all consumers, wire vario ng critique to shared parser.

**Architecture:** Split `lib/gym/judge.py` (347 lines) into 3 focused modules under `lib/eval/`: `parse.py` (JSON extraction), `cache.py` (SQLite hash cache), `judge.py` (RubricJudge + BinaryJudge). Move `lib/gym/calibration/` alongside. Update all 6 consumer imports. Add `prompt_template` to RubricJudge for custom prompts.

**Tech Stack:** Python, asyncio, SQLite, lib/llm call_llm, pytest

---

### Task 1: Create `lib/eval/parse.py` — JSON extraction

**Files:**
- Create: `lib/eval/__init__.py` (empty for now)
- Create: `lib/eval/parse.py`
- Create: `lib/eval/tests/__init__.py`
- Create: `lib/eval/tests/test_parse.py`

**Step 1: Write the test file**

```python
# lib/eval/tests/test_parse.py
"""Tests for lib.eval.parse — JSON extraction from LLM responses."""

import json

import pytest

from lib.eval.parse import parse_judge_json


class TestParseJudgeJson:
    def test_clean_json(self):
        raw = '{"score": 80, "reason": "good"}'
        result = parse_judge_json(raw)
        assert result == {"score": 80, "reason": "good"}

    def test_markdown_fenced_json(self):
        raw = '```json\n{"score": 90, "reason": "great"}\n```'
        result = parse_judge_json(raw)
        assert result == {"score": 90, "reason": "great"}

    def test_json_with_preamble(self):
        raw = 'Here is my evaluation:\n\n{"score": 50, "reason": "mediocre"}'
        result = parse_judge_json(raw)
        assert result == {"score": 50, "reason": "mediocre"}

    def test_array_wrapped_json(self):
        raw = '[{"score": 75, "reason": "ok"}]'
        result = parse_judge_json(raw)
        assert result == {"score": 75, "reason": "ok"}

    def test_markdown_fenced_no_lang(self):
        raw = '```\n{"score": 60, "reason": "meh"}\n```'
        result = parse_judge_json(raw)
        assert result == {"score": 60, "reason": "meh"}

    def test_invalid_json_raises(self):
        with pytest.raises((json.JSONDecodeError, ValueError)):
            parse_judge_json("not json at all")
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/eval/tests/test_parse.py -v`
Expected: FAIL (ModuleNotFoundError: No module named 'lib.eval')

**Step 3: Create parse.py**

```python
# lib/eval/parse.py
"""JSON extraction from LLM judge responses.

Handles: clean JSON, markdown-fenced JSON, JSON with preamble, array-wrapped.
"""

import json
import re


def parse_judge_json(text: str) -> dict:
    """Parse JSON from LLM response.

    Handles: clean JSON, markdown-fenced JSON, JSON with preamble, array-wrapped.
    Raises json.JSONDecodeError or ValueError on failure.
    """
    text = text.strip()

    # Try direct parse first
    try:
        parsed = json.loads(text)
        if isinstance(parsed, list) and len(parsed) == 1:
            return parsed[0]
        return parsed
    except json.JSONDecodeError:
        pass

    # Try markdown fence extraction: ```json ... ``` or ``` ... ```
    fence_match = re.search(r"```(?:json)?\s*\n(.*?)\n```", text, re.DOTALL)
    if fence_match:
        inner = fence_match.group(1).strip()
        parsed = json.loads(inner)
        if isinstance(parsed, list) and len(parsed) == 1:
            return parsed[0]
        return parsed

    # Try finding first { ... } in the text (handles preamble)
    brace_match = re.search(r"\{.*\}", text, re.DOTALL)
    if brace_match:
        parsed = json.loads(brace_match.group(0))
        return parsed

    # Try finding [ ... ] for array-wrapped
    bracket_match = re.search(r"\[.*\]", text, re.DOTALL)
    if bracket_match:
        parsed = json.loads(bracket_match.group(0))
        if isinstance(parsed, list) and len(parsed) == 1:
            return parsed[0]
        return parsed

    raise ValueError(f"Could not parse JSON from LLM response: {text[:200]}")
```

Also create:
```python
# lib/eval/__init__.py
"""Unified evaluation library — judges, parsing, caching, calibration."""
```

```python
# lib/eval/tests/__init__.py
```

**Step 4: Run test to verify it passes**

Run: `python -m pytest lib/eval/tests/test_parse.py -v`
Expected: 6 passed

**Step 5: Commit**

```bash
git add lib/eval/__init__.py lib/eval/parse.py lib/eval/tests/__init__.py lib/eval/tests/test_parse.py
git commit -m "feat(eval): add lib/eval/parse.py — JSON extraction from LLM judge responses"
```

---

### Task 2: Create `lib/eval/cache.py` — SQLite judge cache

**Files:**
- Create: `lib/eval/cache.py`

**Step 1: Write cache.py**

Extract `_JudgeCache` and `_CREATE_TABLE_SQL` from `lib/gym/judge.py` into `lib/eval/cache.py`. The class needs `JudgeResult` — import it from `lib.eval.judge` (created in Task 3). To avoid circular imports, define `JudgeResult` in its own small file or put it in `cache.py` as a forward reference.

**Decision:** Put `JudgeResult` and `RubricCriterion` dataclasses in `lib/eval/judge.py` (Task 3) and import from there. Cache depends on judge for `JudgeResult` type only — this is fine since cache is always used with judge.

For now, write cache.py with JudgeResult imported from a types module:

```python
# lib/eval/cache.py
"""SQLite hash-based cache for judge results."""

import hashlib
import json
import sqlite3
from pathlib import Path

from lib.eval.types import JudgeResult

_CREATE_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS judge_cache (
    key TEXT PRIMARY KEY,
    score REAL,
    reason TEXT,
    subscores TEXT,
    model TEXT,
    duration_ms INTEGER,
    created_at TEXT DEFAULT (datetime('now'))
)
"""


class JudgeCache:
    """SQLite hash-based cache for judge results."""

    def __init__(self, db_path: Path):
        self.db_path = db_path
        db_path.parent.mkdir(parents=True, exist_ok=True)
        self._conn = sqlite3.connect(str(db_path))
        self._conn.execute(_CREATE_TABLE_SQL)
        self._conn.commit()

    def get(self, key: str) -> JudgeResult | None:
        """Retrieve cached result by key. Returns None on miss."""
        row = self._conn.execute(
            "SELECT score, reason, subscores, model, duration_ms FROM judge_cache WHERE key = ?",
            (key,),
        ).fetchone()
        if row is None:
            return None
        return JudgeResult(
            score=row[0],
            reason=row[1],
            subscores=json.loads(row[2]) if row[2] else {},
            model=row[3] or "",
            duration_ms=row[4] or 0,
            cached=True,
        )

    def put(self, key: str, result: JudgeResult):
        """Store a result in cache. Overwrites if key exists."""
        self._conn.execute(
            """INSERT OR REPLACE INTO judge_cache (key, score, reason, subscores, model, duration_ms)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (
                key,
                result.score,
                result.reason,
                json.dumps(result.subscores),
                result.model,
                result.duration_ms,
            ),
        )
        self._conn.commit()

    @staticmethod
    def hash_key(model: str, prompt: str, candidate: str) -> str:
        """Deterministic hash from model + prompt + candidate."""
        content = f"{model}|{prompt}|{candidate}"
        return hashlib.sha256(content.encode()).hexdigest()
```

Actually — to avoid a 4th module just for two dataclasses, put the dataclasses at the top of `judge.py` and have `cache.py` import them. This matches the existing pattern and keeps things simple. Revising:

**Revised approach:** Create `lib/eval/judge.py` first (Task 3) with dataclasses + judge classes, then `cache.py` imports `JudgeResult` from it. Cache tests are included in the judge tests (Task 3). Skip creating cache as a standalone task — it's part of Task 3.

**Step 1: Write cache.py** (imports JudgeResult from lib.eval.judge)

Same code as above but with `from lib.eval.judge import JudgeResult`.

No separate test file for cache — cache tests already exist in test_judge.py and will move with it.

**Step 2: No separate commit for cache — bundled with Task 3.**

---

### Task 3: Create `lib/eval/judge.py` — RubricJudge, BinaryJudge + prompt_template

**Files:**
- Create: `lib/eval/judge.py`
- Create: `lib/eval/cache.py`
- Move: `lib/gym/tests/test_judge.py` → `lib/eval/tests/test_judge.py`

**Step 1: Write lib/eval/judge.py**

Split from `lib/gym/judge.py`: keep dataclasses + judge classes, import parse from `lib.eval.parse`, import cache inline to avoid circular deps.

```python
# lib/eval/judge.py
"""Unified LLM judge framework.

Two judge types:
- RubricJudge: 0-100 + subscores (the common pattern across existing judges)
- BinaryJudge: verdict-based (relevant/irrelevant, repair/not_repair)

Design decisions:
- No repeats — calibration shows same-model repeats add zero info (Opus sigma=0 at temp=1.0)
- Default model = opus — calibration shows it discriminates (20-72 range)
- Temperature = 0.0 for deterministic scoring
- SQLite hash-based cache — sha256(model|prompt|candidate) -> cached result
- Uses lib/llm call_llm for all LLM calls
"""

import asyncio
import time
from dataclasses import dataclass, field
from pathlib import Path

from lib.llm import call_llm

from lib.eval.parse import parse_judge_json


# ---------------------------------------------------------------------------
# Data classes
# ---------------------------------------------------------------------------


@dataclass
class JudgeResult:
    """Result from a judge evaluation."""

    score: float  # 0-100 for rubric, 0.0/1.0 for binary
    reason: str
    subscores: dict[str, float] = field(default_factory=dict)
    model: str = ""
    duration_ms: int = 0
    cached: bool = False


@dataclass
class RubricCriterion:
    """A single criterion in a rubric."""

    name: str
    description: str
    weight: float = 1.0


# ---------------------------------------------------------------------------
# Default prompt template
# ---------------------------------------------------------------------------

_DEFAULT_RUBRIC_TEMPLATE = """Evaluate the following candidate on these criteria:

{criteria}
{context}

Candidate:
{candidate}

Respond with JSON only (no markdown fencing). Format:
{{"score": <0-100 overall>, "reason": "<brief explanation>", "subscores": {{{subscores_schema}}}}}"""


# ---------------------------------------------------------------------------
# RubricJudge
# ---------------------------------------------------------------------------


class RubricJudge:
    """0-100 + subscores judge. Single LLM call per score.

    Args:
        criteria: List of RubricCriterion for scoring.
        model: LLM model alias (default "opus").
        system: Optional system prompt.
        cache_db: Optional Path to SQLite cache database.
        prompt_template: Optional custom prompt with {criteria}, {candidate},
            {context}, {subscores_schema} placeholders.
    """

    def __init__(
        self,
        criteria: list[RubricCriterion],
        model: str = "opus",
        system: str = "",
        cache_db: Path | None = None,
        prompt_template: str = "",
    ):
        self.criteria = criteria
        self.model = model
        self.system = system
        self.prompt_template = prompt_template or _DEFAULT_RUBRIC_TEMPLATE
        self._cache = None
        if cache_db:
            from lib.eval.cache import JudgeCache
            self._cache = JudgeCache(cache_db)

    def _build_prompt(self, candidate: str, context: dict) -> str:
        """Build the scoring prompt with criteria and candidate."""
        criteria_lines = []
        for c in self.criteria:
            criteria_lines.append(f"- **{c.name}** (weight {c.weight}): {c.description}")
        criteria_block = "\n".join(criteria_lines)

        context_block = ""
        if context:
            context_lines = [f"- {k}: {v}" for k, v in context.items()]
            context_block = "\n\nContext:\n" + "\n".join(context_lines)

        subscores_schema = ", ".join(f'"{c.name}": <0-100>' for c in self.criteria)

        return self.prompt_template.format(
            criteria=criteria_block,
            context=context_block,
            candidate=candidate,
            subscores_schema=subscores_schema,
        )

    async def score(self, candidate: str, context: dict | None = None) -> JudgeResult:
        """Score a single candidate. Returns JudgeResult."""
        context = context or {}
        prompt = self._build_prompt(candidate, context)

        # Check cache
        cache_key = ""
        if self._cache:
            from lib.eval.cache import JudgeCache
            cache_key = JudgeCache.hash_key(self.model, prompt, candidate)
            cached = self._cache.get(cache_key)
            if cached is not None:
                return cached

        t0 = time.monotonic()
        raw = await call_llm(
            self.model,
            prompt,
            system=self.system or None,
            temperature=0.0,
        )
        duration_ms = int((time.monotonic() - t0) * 1000)

        parsed = parse_judge_json(raw)
        result = JudgeResult(
            score=float(parsed.get("score", 0)),
            reason=parsed.get("reason", ""),
            subscores={k: float(v) for k, v in parsed.get("subscores", {}).items()},
            model=self.model,
            duration_ms=duration_ms,
            cached=False,
        )

        # Store in cache
        if self._cache and cache_key:
            self._cache.put(cache_key, result)

        return result

    async def score_batch(
        self, items: list[tuple[str, dict]], concurrency: int = 5
    ) -> list[JudgeResult]:
        """Score multiple candidates with bounded concurrency. Preserves order."""
        semaphore = asyncio.Semaphore(concurrency)
        results: list[JudgeResult | None] = [None] * len(items)

        async def _score_one(idx: int, candidate: str, context: dict):
            async with semaphore:
                results[idx] = await self.score(candidate, context)

        tasks = [
            asyncio.create_task(_score_one(i, cand, ctx))
            for i, (cand, ctx) in enumerate(items)
        ]
        await asyncio.gather(*tasks)
        return results  # type: ignore[return-value]


# ---------------------------------------------------------------------------
# BinaryJudge
# ---------------------------------------------------------------------------


class BinaryJudge:
    """Verdict-based judge. Returns score=1.0 for first verdict, 0.0 for second."""

    def __init__(
        self,
        verdicts: list[str],
        model: str = "opus",
        system: str = "",
        cache_db: Path | None = None,
    ):
        if len(verdicts) != 2:
            msg = f"BinaryJudge requires exactly 2 verdicts, got {len(verdicts)}"
            raise ValueError(msg)
        self.verdicts = verdicts
        self.model = model
        self.system = system
        self._cache = None
        if cache_db:
            from lib.eval.cache import JudgeCache
            self._cache = JudgeCache(cache_db)

    def _build_prompt(self, candidate: str, context: dict) -> str:
        """Build the verdict prompt."""
        context_block = ""
        if context:
            context_lines = [f"- {k}: {v}" for k, v in context.items()]
            context_block = "\n\nContext:\n" + "\n".join(context_lines)

        return f"""Classify the following candidate into one of these verdicts: {self.verdicts[0]} or {self.verdicts[1]}.
{context_block}

Candidate:
{candidate}

Respond with JSON only (no markdown fencing). Format:
{{"verdict": "<{self.verdicts[0]} or {self.verdicts[1]}>", "reason": "<brief explanation>"}}"""

    async def score(self, candidate: str, context: dict | None = None) -> JudgeResult:
        """Score a single candidate. Returns JudgeResult with score 1.0 or 0.0."""
        context = context or {}
        prompt = self._build_prompt(candidate, context)

        # Check cache
        cache_key = ""
        if self._cache:
            from lib.eval.cache import JudgeCache
            cache_key = JudgeCache.hash_key(self.model, prompt, candidate)
            cached = self._cache.get(cache_key)
            if cached is not None:
                return cached

        t0 = time.monotonic()
        raw = await call_llm(
            self.model,
            prompt,
            system=self.system or None,
            temperature=0.0,
        )
        duration_ms = int((time.monotonic() - t0) * 1000)

        parsed = parse_judge_json(raw)
        verdict = parsed.get("verdict", "").strip().lower()
        first_verdict = self.verdicts[0].lower()

        result = JudgeResult(
            score=1.0 if verdict == first_verdict else 0.0,
            reason=parsed.get("reason", ""),
            model=self.model,
            duration_ms=duration_ms,
            cached=False,
        )

        # Store in cache
        if self._cache and cache_key:
            self._cache.put(cache_key, result)

        return result
```

**Step 2: Write cache.py** (same code from Task 2 but with `from lib.eval.judge import JudgeResult`)

**Step 3: Move and update test file**

Copy `lib/gym/tests/test_judge.py` → `lib/eval/tests/test_judge.py` with these import changes:
- `from lib.gym.judge import ...` → `from lib.eval.judge import ...`
- `from lib.gym.judge import parse_judge_json` → `from lib.eval.parse import parse_judge_json`
- Mock patch target: `lib.gym.judge.call_llm` → `lib.eval.judge.call_llm`
- `_JudgeCache` → import `JudgeCache` from `lib.eval.cache`

Additionally add a test for `prompt_template`:

```python
class TestRubricJudgePromptTemplate:
    def test_custom_template_used(self):
        criteria = [RubricCriterion(name="accuracy", description="Is it accurate?")]
        template = """Reference: {context}

Candidate: {candidate}

Criteria: {criteria}

Score JSON: {{"score": <0-100>, "subscores": {{{subscores_schema}}}}}"""
        judge = RubricJudge(criteria=criteria, prompt_template=template)
        prompt = judge._build_prompt("test candidate", {"reference": "gold standard"})
        assert "Reference:" in prompt
        assert "test candidate" in prompt
        assert "accuracy" in prompt
```

**Step 4: Run tests**

Run: `python -m pytest lib/eval/tests/ -v`
Expected: All pass

**Step 5: Commit**

```bash
git add lib/eval/judge.py lib/eval/cache.py lib/eval/tests/test_judge.py
git commit -m "feat(eval): add lib/eval/judge.py + cache.py — RubricJudge, BinaryJudge with prompt_template support"
```

---

### Task 4: Move calibration to `lib/eval/calibration/`

**Files:**
- Create: `lib/eval/calibration/__init__.py`
- Create: `lib/eval/calibration/monotonicity.py` (from `lib/gym/calibration/monotonicity.py`)
- Create: `lib/eval/calibration/perturbations.py` (from `lib/gym/calibration/perturbations.py`)
- Create: `lib/eval/calibration/tests/__init__.py`
- Create: `lib/eval/calibration/tests/test_monotonicity.py` (updated imports)
- Create: `lib/eval/calibration/tests/test_perturbations.py` (updated imports)

**Step 1: Copy perturbations.py** (no import changes needed — it has no lib.gym imports)

Copy `lib/gym/calibration/perturbations.py` → `lib/eval/calibration/perturbations.py` unchanged.

**Step 2: Copy and update monotonicity.py**

Change these imports:
```python
# Old
from lib.gym.calibration.perturbations import PERTURBATION_TYPES, apply_perturbation
from lib.gym.judge import RubricJudge

# New
from lib.eval.calibration.perturbations import PERTURBATION_TYPES, apply_perturbation
from lib.eval.judge import RubricJudge
```

**Step 3: Copy and update test files**

`test_perturbations.py` — change:
```python
# Old
from lib.gym.calibration.perturbations import ...

# New
from lib.eval.calibration.perturbations import ...
```

`test_monotonicity.py` — change:
```python
# Old
from lib.gym.calibration.monotonicity import MonotonicityResult, _aggregate_results

# New
from lib.eval.calibration.monotonicity import MonotonicityResult, _aggregate_results
```

**Step 4: Create empty `__init__.py` files**

```python
# lib/eval/calibration/__init__.py
# lib/eval/calibration/tests/__init__.py
```

**Step 5: Run tests**

Run: `python -m pytest lib/eval/calibration/tests/ -v`
Expected: All pass (52 tests — 18 monotonicity + 34 perturbations)

**Step 6: Commit**

```bash
git add lib/eval/calibration/
git commit -m "feat(eval): move calibration (monotonicity + perturbations) to lib/eval/calibration/"
```

---

### Task 5: Update `lib/eval/__init__.py` — public API

**Files:**
- Modify: `lib/eval/__init__.py`

**Step 1: Write the public API**

```python
# lib/eval/__init__.py
"""Unified evaluation library — judges, parsing, caching, calibration."""

from lib.eval.cache import JudgeCache
from lib.eval.judge import BinaryJudge, JudgeResult, RubricCriterion, RubricJudge
from lib.eval.parse import parse_judge_json

__all__ = [
    "BinaryJudge",
    "JudgeCache",
    "JudgeResult",
    "RubricCriterion",
    "RubricJudge",
    "parse_judge_json",
]
```

**Step 2: Verify import works**

Run: `python -c "from lib.eval import RubricJudge, BinaryJudge, JudgeResult, RubricCriterion, parse_judge_json, JudgeCache; print('OK')"`
Expected: OK

**Step 3: Commit**

```bash
git add lib/eval/__init__.py
git commit -m "feat(eval): public API exports in lib/eval/__init__.py"
```

---

### Task 6: Update all consumers — switch imports from lib.gym to lib.eval

**Files:**
- Modify: `lib/gym/__init__.py`
- Modify: `lib/tune/evaluate.py`
- Modify: `learning/gyms/apply/gym.py`

**Step 1: Update lib/gym/__init__.py**

```python
# Old
from lib.gym.judge import BinaryJudge, JudgeResult, RubricCriterion, RubricJudge

# New
from lib.eval import BinaryJudge, JudgeResult, RubricCriterion, RubricJudge
```

**Step 2: Update lib/tune/evaluate.py**

```python
# Old (line 16)
from lib.gym.judge import RubricCriterion, RubricJudge

# New
from lib.eval import RubricCriterion, RubricJudge
```

**Step 3: Update learning/gyms/apply/gym.py**

```python
# Old (line 30)
from lib.gym.judge import RubricCriterion, RubricJudge

# New
from lib.eval import RubricCriterion, RubricJudge
```

**Step 4: Run all judge-related tests**

Run: `python -m pytest lib/eval/tests/ lib/eval/calibration/tests/ -v`
Expected: All pass

**Step 5: Commit**

```bash
git add lib/gym/__init__.py lib/tune/evaluate.py learning/gyms/apply/gym.py
git commit -m "refactor: update all consumers to import from lib.eval instead of lib.gym.judge"
```

---

### Task 7: Delete old files from lib/gym/

**Files:**
- Delete: `lib/gym/judge.py`
- Delete: `lib/gym/tests/test_judge.py`
- Delete: `lib/gym/calibration/monotonicity.py`
- Delete: `lib/gym/calibration/perturbations.py`
- Delete: `lib/gym/calibration/tests/test_monotonicity.py`
- Delete: `lib/gym/calibration/tests/test_perturbations.py`
- Delete: `lib/gym/calibration/tests/__init__.py`
- Delete: `lib/gym/calibration/__init__.py`

**Step 1: Delete old files**

```bash
trash lib/gym/judge.py
trash lib/gym/tests/test_judge.py
trash lib/gym/calibration/monotonicity.py
trash lib/gym/calibration/perturbations.py
trash lib/gym/calibration/tests/test_monotonicity.py
trash lib/gym/calibration/tests/test_perturbations.py
trash lib/gym/calibration/tests/__init__.py
trash lib/gym/calibration/__init__.py
```

**Step 2: Verify nothing references old paths**

Run: `grep -r "lib.gym.judge\|lib.gym.calibration\|lib/gym/judge\|lib/gym/calibration" --include="*.py" .`
Expected: No matches (or only in docs/plans which is fine)

**Step 3: Run full test suite for affected modules**

Run: `python -m pytest lib/eval/ lib/gym/ -v`
Expected: All pass (lib/gym/ tests only have test_base.py left)

**Step 4: Commit**

```bash
git add -u lib/gym/
git commit -m "refactor: remove old lib/gym/judge.py and lib/gym/calibration/ (moved to lib/eval/)"
```

---

### Task 8: Wire vario/ng critique to lib/eval parser

**Files:**
- Modify: `vario/blocks/critique.py`

**Step 1: Replace `_parse_score_response` with `parse_judge_json`**

In `vario/blocks/critique.py`, the `_parse_score_response` function (lines 66-97) handles JSON parsing + fallbacks. Replace its JSON parsing with `parse_judge_json` but keep the score clamping and fallback logic that's specific to critique:

```python
# Add import at top
from lib.eval.parse import parse_judge_json

# Replace _parse_score_response:
def _parse_score_response(response: str) -> dict[str, Any]:
    """Parse JSON from LLM response, with fallbacks."""
    try:
        data = parse_judge_json(response)
        if isinstance(data, dict):
            return {
                k: max(0.0, min(100.0, float(v))) if isinstance(v, (int, float)) else v
                for k, v in data.items()
            }
    except (json.JSONDecodeError, ValueError):
        pass

    # Fallback: Score: N / Critique: text
    result: dict[str, Any] = {}
    score_match = re.search(r"Score:\s*(\d+(?:\.\d+)?)", response)
    if score_match:
        result["score"] = max(0.0, min(100.0, float(score_match.group(1))))
    critique_match = re.search(r"Critique:\s*(.*?)(?=Score:|$)", response, re.DOTALL)
    if critique_match:
        result["reason"] = critique_match.group(1).strip()
    if result:
        return result

    # Last resort: any number
    num_match = re.search(r"(\d+(?:\.\d+)?)", response)
    if num_match:
        return {"score": max(0.0, min(100.0, float(num_match.group(1))))}

    return {}
```

**Step 2: Run vario ng tests**

Run: `python -m pytest vario/tests/test_critique.py -v`
Expected: All pass

**Step 3: Commit**

```bash
git add vario/blocks/critique.py
git commit -m "refactor: wire vario/ng critique to lib.eval.parse for JSON extraction"
```

---

### Task 9: Update documentation

**Files:**
- Modify: `lib/CLAUDE.md` — add lib/eval/ to tables
- Modify: `learning/CLAUDE.md` — update references from lib/gym/judge to lib/eval
- Modify: `learning/TODO.md` — mark roadmap item #2 progress

**Step 1: Update lib/CLAUDE.md**

Add to the Core table:
```
| `eval/`             | Unified LLM judge: RubricJudge, BinaryJudge, scoring cache, calibration |
```

Update the Internal table entry for `gym/`:
```
| `gym/`               | Base classes for gen→eval→learn loops (Candidate, GymBase)        |
```
(Remove "judge" reference since it moved to eval/)

**Step 2: Update learning/CLAUDE.md**

Change references from `lib/gym/judge.py` to `lib/eval`:
- "The unified judge framework (`lib/eval`)"
- "`lib/eval/calibration/` — first-class concern"

**Step 3: Update learning/TODO.md**

Mark roadmap item #2 as in-progress or done:
- `lib/eval/judge.py` — extracted from lib/gym, shared with vario ng | ✅ done

**Step 4: Commit**

```bash
git add lib/CLAUDE.md learning/CLAUDE.md learning/TODO.md
git commit -m "docs: update references from lib/gym/judge to lib/eval"
```

---

## Summary

| Task | What | Files touched |
|------|------|--------------|
| 1 | `lib/eval/parse.py` + tests | 4 new |
| 2 | (merged into 3) | — |
| 3 | `lib/eval/judge.py` + `cache.py` + tests | 3 new |
| 4 | Move calibration | 6 new |
| 5 | `__init__.py` public API | 1 modified |
| 6 | Update consumer imports | 3 modified |
| 7 | Delete old files | 8 deleted |
| 8 | Wire vario/ng critique | 1 modified |
| 9 | Update docs | 3 modified |

Total: ~13 new files, 7 modified, 8 deleted. Net change: +5 files (split from 1 monolith into 3 focused modules + moved calibration).