# Unified Judge — `lib/eval/`

**Date**: 2026-03-02
**Status**: Design approved
**Depends on**: Judge calibration (#1 ✅ done)
**Unblocks**: ApplyGym, CleanupGym, AbstractGym, ExtractGym, vario ng critique

## Problem

Judge logic is duplicated across 6 locations:
- `lib/gym/judge.py` — RubricJudge + BinaryJudge (used by ApplyGym, lib/tune)
- `vario/blocks/critique.py` — inline score/verify parsing
- `learning/gyms/claim_extraction/gym.py` — inline judge
- `learning/gyms/badge/gym.py` — inline judge
- `learning/gyms/llm_task/gym.py` — inline judge
- `learning/gyms/fetchability/fetchability_tool.py` — inline judge

Each has its own JSON parsing, prompt building, and error handling. No caching in 4 of 6. Inconsistent temperature (0.0 / 0.1 / 0.3).

## Decision

Move judge to `lib/eval/` — a general-purpose evaluation library shared by gyms and vario ng. No shims — move and update all imports directly.

## Structure

```
lib/eval/
  __init__.py              # public API exports
  judge.py                 # RubricJudge, BinaryJudge, JudgeResult, RubricCriterion
  parse.py                 # parse_judge_json (shared JSON extraction)
  cache.py                 # _JudgeCache (SQLite hash-based)
  calibration/
    __init__.py
    monotonicity.py        # MonotonicityResult, check_monotonicity
    perturbations.py       # PERTURBATION_TYPES, apply_perturbation
    tests/
      test_monotonicity.py
      test_perturbations.py
  tests/
    test_judge.py
    test_parse.py
```

## Changes

### Move (delete old, create new)

| From | To |
|---|---|
| `lib/gym/judge.py` | `lib/eval/judge.py` + `lib/eval/parse.py` + `lib/eval/cache.py` |
| `lib/gym/calibration/` | `lib/eval/calibration/` |
| `lib/gym/tests/test_judge.py` | `lib/eval/tests/test_judge.py` |

### Update imports (6 files)

| File | Old import | New import |
|---|---|---|
| `lib/gym/__init__.py` | `from lib.gym.judge import ...` | `from lib.eval import ...` |
| `lib/tune/evaluate.py` | `from lib.gym.judge import ...` | `from lib.eval import ...` |
| `learning/gyms/apply/gym.py` | `from lib.gym.judge import ...` | `from lib.eval import ...` |
| `lib/gym/tests/test_judge.py` | `from lib.gym.judge import ...` | `from lib.eval import ...` |
| `lib/gym/calibration/tests/test_monotonicity.py` | `from lib.gym.calibration...` | `from lib.eval.calibration...` |
| `lib/gym/calibration/tests/test_perturbations.py` | `from lib.gym.calibration...` | `from lib.eval.calibration...` |

### Small additions

- **`prompt_template` param** on RubricJudge — optional custom prompt with `{criteria}`, `{candidate}`, `{context}` placeholders. Default = current generic prompt. Enables reference+candidate comparison prompts without subclassing.
- **Split parse/cache into own files** — vario ng critique imports just `parse_judge_json` without pulling in judge classes.

### Wire vario/ng critique to lib/eval

Replace `_parse_score_response()` in `vario/blocks/critique.py` with `from lib.eval.parse import parse_judge_json`. Keep the verify intent's custom parsing (different output structure).

### Migrate inline gym judges (stretch)

Not required now. Each gym can migrate incrementally by importing `RubricJudge` from `lib.eval` and deleting their inline judge code. Priority order: claim_extraction (most similar to RubricJudge), llm_task, badge, fetchability (most custom).

## Not in scope

- Multi-model panel scoring
- Auto-calibration before use
- Streaming judge interface
- Cost tracking integration
- Migrating all inline gym judges (stretch goal, not required)

## Public API (`lib/eval/__init__.py`)

```python
from lib.eval.judge import BinaryJudge, JudgeResult, RubricCriterion, RubricJudge
from lib.eval.parse import parse_judge_json
from lib.eval.cache import JudgeCache
```

## Doc updates

- `lib/CLAUDE.md` — add lib/eval/ to table, update lib/gym/ description
- `learning/CLAUDE.md` — update references from lib/gym/judge to lib/eval
- `learning/TODO.md` — update roadmap item #2