# Assessor Gym Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Build a calibration gym for person_intel assessor prompts with ground truth corpus

**Architecture:** Click CLI gym at `intel/people/gym/` that runs assessor functions against a JSONL corpus of people with expected score ranges, computes calibration metrics (range accuracy, ordering, spread), and renders a report.

**Tech Stack:** Python, Click, asyncio, lib/llm (call_llm_json), JSONL corpus

---

### Task 1: Corpus fixtures

**Files:**
- Create: `intel/people/gym/__init__.py`
- Create: `intel/people/gym/corpus.jsonl`

**Step 1: Create package init**

```python
"""Person intel assessor calibration gym."""
```

**Step 2: Build corpus JSONL**

Create 15 entries with realistic enrichment + research data and expected score ranges. Include mix of:
- 5 real famous people (Hinton, LeCun, Fei-Fei Li, Karpathy, Jensen Huang) with accurate data
- 5 archetype real people (Patrick Collison, Satya Nadella, Daphne Koller, etc.)
- 5 synthetic archetypes (ML PhD, startup CTO, junior PM, sales VP, solo founder)

Each entry: `{"id", "name", "enrichment": {...}, "research": {...}, "expected": {"academic": [min, max], "prior_success": [min, max], "network_quality": [min, max], "technical_depth": [min, max]}}`

**Step 3: Commit**

```bash
git add intel/people/gym/
git commit -m "feat(gym): assessor calibration corpus with 15 ground truth entries"
```

### Task 2: Gym core + run command

**Files:**
- Create: `intel/people/gym/gym.py`
- Create: `intel/people/gym/__main__.py`

**Step 1: Write gym.py**

AssessorGym class with:
- `load_corpus()` — load JSONL
- `run_assessor(name, model)` — run one assessor on all corpus entries
- `run_scorer(model)` — run core 3-dimension scoring on all corpus entries
- `compute_metrics(actual, expected)` — range accuracy, Kendall τ, spread

The key: build the `context` dict that assessors expect from corpus entry data, then call the assessor function directly.

**Step 2: Write __main__.py with run command**

Click CLI:
- `run` — runs all assessors + scorer, saves results to `results/` dir
- `--assessor NAME` — run single assessor
- `--model MODEL` — override model (default: gemini-flash)

**Step 3: Test it**

```bash
python -m intel.people.gym run --assessor academic
```

**Step 4: Commit**

```bash
git add intel/people/gym/
git commit -m "feat(gym): assessor gym run command"
```

### Task 3: Calibration report

**Files:**
- Create: `intel/people/gym/report.py`
- Modify: `intel/people/gym/__main__.py`

**Step 1: Write report.py**

Compute and display:
- Per-assessor table: person, expected range, actual score, in-range?
- Summary metrics: range accuracy %, Kendall τ, σ, cluster %
- Highlight violations (score outside expected range)

**Step 2: Add report command to CLI**

`python -m intel.people.gym report` — reads cached results, shows report

**Step 3: Commit**

```bash
git add intel/people/gym/
git commit -m "feat(gym): calibration report with range accuracy + ordering metrics"
```

### Task 4: Cross-model command

**Files:**
- Modify: `intel/people/gym/__main__.py`
- Modify: `intel/people/gym/gym.py`

**Step 1: Add cross_model method to AssessorGym**

Run assessor with multiple models, compute pairwise Kendall τ on score orderings.

**Step 2: Add cross-model CLI command**

`python -m intel.people.gym cross-model --models gemini-flash,haiku,sonnet`

**Step 3: Commit**

### Task 5: Harvest command

**Files:**
- Modify: `intel/people/gym/__main__.py`

**Step 1: Add harvest command**

Reads real pipeline output from `intel/people/data/{slug}/` + jobs DB, packages into corpus JSONL entry with user-supplied expected ranges.

```bash
python -m intel.people.gym harvest yann-lecun --expected-academic 9-10
```

**Step 2: Commit**
