# Assessor Calibration Gym — Design

**Goal**: Systematically calibrate person_intel scoring and assessor prompts against ground truth people with known expected score ranges.

**Problem**: The person_intel handler scores people on 3 core dimensions (prior_success, network_quality, technical_depth) and runs prowess assessors (academic, future: engineering, publication, media). Without calibration data, we don't know if:
- A serial entrepreneur gets 7 or 4 on prior_success
- An h-index-40 researcher gets 9 or 6 on academic
- The scoring prompts use the full 0-10 range or cluster in 5-8

**Location**: `intel/people/gym/`

## Corpus Design

JSONL file (`intel/people/gym/corpus.jsonl`), ~15-20 entries. Each entry contains:

```jsonl
{
  "id": "yann-lecun",
  "name": "Yann LeCun",
  "enrichment": { ... realistic enrichment.json data ... },
  "research": { ... realistic research stage output ... },
  "expected": {
    "academic": [9, 10],
    "prior_success": [7, 9],
    "network_quality": [8, 10],
    "technical_depth": [9, 10]
  }
}
```

The `enrichment` and `research` fields mirror real pipeline output structure — the gym feeds them directly to the assessor/scoring functions, bypassing web search and API calls.

### Corpus Composition

15 people spanning the full range across all dimensions:

| Person | Academic | Prior Success | Network | Technical |
|--------|----------|---------------|---------|-----------|
| Geoffrey Hinton | 10 | 8-9 | 9-10 | 9-10 |
| Yann LeCun | 9-10 | 7-9 | 8-10 | 9-10 |
| Fei-Fei Li | 9-10 | 7-8 | 8-9 | 9-10 |
| Andrej Karpathy | 7-8 | 6-8 | 8-9 | 8-9 |
| Daphne Koller | 8-9 | 8-9 | 8-9 | 8-9 |
| Jensen Huang | 3-4 | 9-10 | 9-10 | 6-7 |
| Patrick Collison | 2-3 | 9-10 | 9-10 | 6-8 |
| Satya Nadella | 3-4 | 8-9 | 9-10 | 5-6 |
| Typical ML PhD (synthetic) | 5-6 | 2-3 | 3-4 | 5-6 |
| Startup CTO (synthetic) | 3-4 | 4-5 | 4-5 | 6-7 |
| Mid-level eng (synthetic) | 1-2 | 1-2 | 2-3 | 4-5 |
| Junior PM (synthetic) | 0-1 | 0-1 | 1-2 | 0-1 |
| Sales VP (synthetic) | 0-1 | 3-4 | 5-6 | 0-1 |
| Solo founder (synthetic) | 1-2 | 2-3 | 2-3 | 3-4 |
| Research scientist (synthetic) | 6-7 | 3-4 | 4-5 | 7-8 |

Real people have realistic but hand-curated enrichment/research data (we know their h-indices, company histories, etc.). Synthetic people have fabricated but plausible data.

## Calibration Metrics

Per assessor/dimension:

| Metric | Computation | Target |
|--------|-------------|--------|
| Range accuracy | % entries where score ∈ [expected_min, expected_max] | >80% |
| Ordering (Kendall τ) | Rank correlation: expected midpoint vs actual score | τ > 0.8 |
| Spread (σ) | Standard deviation of actual scores | σ > 2.0 |
| Cluster % | % of scores in densest 2-point band | <50% |

Cross-model: run with 2+ models, compute pairwise Kendall τ on the ordering. Target: τ > 0.7.

## Commands (Click CLI)

```bash
python -m intel.people.gym run                    # Run all assessors on corpus
python -m intel.people.gym run --assessor academic # Run single assessor
python -m intel.people.gym run --model opus        # Override model
python -m intel.people.gym report                  # Show calibration report
python -m intel.people.gym cross-model --models gemini-flash,sonnet,haiku
python -m intel.people.gym harvest SLUG            # Capture live pipeline output as fixture
```

## Architecture

```
intel/people/gym/
├── __init__.py
├── __main__.py          # click CLI
├── gym.py               # AssessorGym class
├── corpus.jsonl          # Ground truth corpus
├── results/              # Cached run results (gitignored)
└── report.py             # Calibration report rendering
```

The gym directly imports and calls:
- `_assess_academic` (and future assessors) with crafted context dicts
- `call_llm_json` for the core score stage prompt
- No job framework involved — just the scoring functions

## Integration with Prompt Iteration

The workflow:
1. Run gym → see calibration report
2. Tweak assessor prompt (rubric, examples, anchoring)
3. Re-run gym → see if calibration improved
4. Cheap iteration: only LLM calls, no web search

This matches the research→analyze split philosophy: the corpus provides cached "research" data, and the gym only exercises the cheap scoring step.

## harvest Command

Captures real pipeline output as corpus fixtures:

```bash
# After running person_intel on a real person:
python -m intel.people.gym harvest yann-lecun --expected-academic 9-10 --expected-prior 7-9
```

Reads `intel/people/data/{slug}/enrichment.json` + jobs results for research, packages into corpus entry with expected ranges.
