First run — 2026-03-09 · Model: gemini-flash · 15 corpus entries (8 real, 7 synthetic)
The person_intel job scores people on 4 pre-chosen dimensions using LLM prompts. The gym doesn't question whether these are the right dimensions — it checks whether the scoring prompts produce well-calibrated numbers for the dimensions we've committed to:
| Dimension | Prompt Location | What It Scores |
|---|---|---|
| prior_success | _stage_score | Exits, company outcomes, leadership trajectory |
| network_quality | _stage_score | VC connections, operator network, ecosystem centrality |
| technical_depth | _stage_score | Patents, papers, open source, technical roles |
| academic | _assess_academic | h-index, publications, citations, institution pedigree |
These prompts were written once and never tested against known examples. Without calibration data, we had no idea whether "Jensen Huang = 3 on academic" or "Jensen Huang = 7 on academic." The scoring rubrics existed but the LLM's interpretation of them was unchecked.
15 people — 8 real, 7 synthetic — spanning the full 0–10 range. Real people (Hinton, LeCun, Karpathy, Jensen Huang, etc.) have accurate enrichment data drawn from public sources. Synthetic people (fictional names, fabricated-but-plausible profiles) fill the lower tiers where no famous person would naturally land. Each entry has hand-built enrichmentData gathered from 12+ APIs: Apollo, Semantic Scholar, USPTO Patents, SEC EDGAR, Wikipedia, arXiv, OpenAlex, DBLP, PubMed, etc. and researchLLM web-search-grounded evidence: career history, exits, network signals, technical contributions, education. data, plus expected score ranges per dimension:
| Tier | People | Example Expected (Academic) |
|---|---|---|
| 9–10 | Geoffrey Hinton, Yann LeCun, Fei-Fei Li, Daphne Koller | [9, 10] |
| 7–8 | Andrej Karpathy, Dr. Priya Sharma (synthetic) | [6, 8] |
| 5–6 | Dr. Sarah Chen (synthetic ML PhD) | [5, 7] |
| 2–4 | Jensen Huang, Satya Nadella, Mike Rodriguez (synthetic CTO) | [2, 4] |
| 0–2 | Alex Park, Taylor Kim, Karen O'Brien, David Okonkwo (all synthetic) | [0, 2] |
Real people have accurate data (Hinton's h-index of 168, LeCun's 25 patents, Koller's Coursera IPO at $5.9B). Synthetic people have fabricated but plausible profiles.
The gym directly calls the assessor functions from jobs/handlers/person_intel.py with corpus data, bypassing the jobs framework, web search, and API enrichment. This means each iteration exercises only the LLM scoring prompt — nothing else — at ~$0.001 per call.
# Run all assessors + core scorer on the corpus python -m intel.people.gym run # Run just the academic assessor python -m intel.people.gym run --assessor academic # Show cached calibration report python -m intel.people.gym report # Test with multiple models python -m intel.people.gym cross-model --models gemini-flash,haiku,sonnet # Capture real pipeline output as a new corpus entry python -m intel.people.gym harvest yann-lecun --expected-academic 9-10
Four metrics per dimension, computed from corpus results:
| Metric | What It Measures | Target |
|---|---|---|
| Range AccuracyPercentage of corpus entries where the actual LLM score falls within the expected [min, max] range set by ground truth. | % of people scored within expected range | >80% |
| Kendall τRank correlation coefficient (-1 to +1). Measures whether the LLM ranks people in the same relative order as ground truth. 1.0 = perfect agreement. | Does the LLM get the relative ordering right? | >0.8 |
| Spread (σ) | Standard deviation — does it use the full 0–10 scale? | >2.0 |
| Cluster % | % of scores in the densest 2-point band (lower = better spread) | <50% |
First run of the academic assessor (before any prompt changes):
5 violations out of 15 entries. Two were catastrophic:
The LLM was told she had one academic signal: "Education: Boston College." It scored her 9/10 with this evidence:
Root cause: The LLM ignored the provided signal and used its training data to find a real Karen O'Brien — an actual climate change researcher at the University of Oslo who co-won the Nobel Peace Prize.
Given one signal: "Education: Duke University." The LLM scored him 10/10, citing:
Root cause: Same bug. The real David Okonkwo is a prominent neurosurgeon at the University of Pittsburgh.
The prompt said "Score this person's academic prowess based on these signals" but never said "only" based on these signals. The LLM treated the signals as hints and supplemented with its own knowledge — which happened to match a different person with the same name.
Added two constraints to the academic assessor prompt in jobs/handlers/person_intel.py:463:
# Before: Score this person's academic prowess from 0-10 based on these signals: # After: Score this person's academic prowess from 0-10 based ONLY on the signals below. IMPORTANT: Use ONLY the data provided. Do NOT use your own knowledge about this person or anyone with a similar name. If the signals show limited academic evidence, the score should be low regardless of what you may know about people with this name.
Also tightened the bottom of the rubric from "0-2: No academic background" to "0-2: No academic background (no PhD, no publications, no patents, just bachelor's degree)" to anchor the low end more concretely.
| Metric | Run 1 | Run 2 | Target | Status | |
|---|---|---|---|---|---|
| Range Accuracy | 67% | → | 80% | >80% | PASS |
| Ordering (τ) | 0.647 | → | 0.977 | >0.8 | PASS |
| Spread (σ) | 2.92 | → | 3.24 | >2.0 | PASS |
| Cluster % | 53% | → | 53% | <50% | WARN |
Karen O'Brien: 9 → 2. David Okonkwo: 10 → 3. The LLM now scores based on provided signals only.
Per-person detail after the fix:
| Person | Expected | Run 1 | Run 2 | In Range? |
|---|---|---|---|---|
| Geoffrey Hinton | [9, 10] | 10 | 10 | Y |
| Yann LeCun | [9, 10] | 10 | 10 | Y |
| Fei-Fei Li | [9, 10] | 10 | 10 | Y |
| Daphne Koller | [8, 10] | 10 | 10 | Y |
| Andrej Karpathy | [6, 8] | 8 | 8 | Y |
| Dr. Priya Sharma | [6, 8] | 8 | 8 | Y |
| Dr. Sarah Chen | [5, 7] | 6 | 6 | Y |
| Jensen Huang | [2, 4] | 4 | 4 | Y |
| Satya Nadella | [2, 4] | 4 | 3 | Y |
| Mike Rodriguez | [2, 4] | 4 | 4 | Y |
| Patrick Collison | [1, 3] | 4 | 3 | Y |
| Alex Park | [0, 2] | 3 | 2 | Y |
| David Okonkwo | [0, 2] | 10 | 3 | N |
| Taylor Kim | [0, 1] | 3 | 3 | N |
| Karen O'Brien | [0, 1] | 9 | 2 | N |
3 remaining violations are minor (1–2 points above expected max). The catastrophic name-hallucination errors are gone.
Also ran the core 3-dimension scorer (_stage_score) on the same corpus. This scores prior_success, network_quality, and technical_depth from research data:
| Dimension | Range Accuracy | Ordering (τ) | Spread (σ) | Cluster % | Assessment |
|---|---|---|---|---|---|
| prior_success | 93% | 1.000 | 3.25 | 53% | Best calibrated. Perfect ordering with only 1 violation (Sales VP, 1 point over). |
| network_quality | 73% | 1.000 | 2.95 | 53% | Perfect ordering but slight inflation. Fei-Fei Li and Koller both scored 10 (expected 8–9). |
| technical_depth | 60% | 0.925 | 3.00 | 73% | Systematic inflation. 6 violations, all scores 1–2 points above expected. Prompt needs tightening. |
The pattern is consistent: ordering is excellent across all dimensions (τ ≥ 0.925 everywhere). The LLM gets relative ranking right. The problem is absolute calibration — it's too generous, especially on technical_depth where 73% of scores cluster in a 2-point band near the top of the scale.
| Person | Overall | Prior Success | Network | Technical |
|---|---|---|---|---|
| Jensen Huang | 10.0 | 10 | 10 | 10 |
| Patrick Collison | 9.7 | 10 | 10 | 9 |
| Daphne Koller | 9.6 | 10 | 10 | 9 |
| Geoffrey Hinton | 9.6 | 9 | 10 | 10 |
| Yann LeCun | 9.6 | 9 | 10 | 10 |
| Satya Nadella | 9.4 | 10 | 10 | 8 |
| Fei-Fei Li | 9.2 | 8 | 10 | 10 |
| Andrej Karpathy | 9.2 | 9 | 9 | 10 |
| Mike Rodriguez | 6.9 | 6 | 7 | 8 |
| Dr. Sarah Chen | 5.1 | 3 | 5 | 8 |
| Dr. Priya Sharma | 5.1 | 4 | 5 | 7 |
| Karen O'Brien | 4.2 | 6 | 5 | 1 |
| David Okonkwo | 3.9 | 3 | 4 | 5 |
| Alex Park | 3.2 | 2 | 3 | 5 |
| Taylor Kim | 1.9 | 1 | 3 | 2 |
| What | Cost | Time |
|---|---|---|
| One gym run (15 entries × 1 assessor) | ~$0.01 | ~40s |
| One gym run (15 entries × 1 scorer, 3 dimensions) | ~$0.01 | ~90s |
| Full calibration (scorer + all assessors) | ~$0.02 | ~2 min |
| Cross-model run (3 models) | ~$0.06 | ~6 min |
This matches the research→analyze split philosophy: the corpus provides cached "research" data, and the gym only exercises the cheap scoring step. Prompt iteration becomes a tight loop: edit prompt → run gym → check report → repeat. No web search, no API calls, no job framework overhead.
| Task | Why |
|---|---|
| Tighten technical_depth rubric | 60% accuracy, 73% cluster. Same "ONLY use provided data" fix + stronger low-end anchoring should help. |
| Add engineering, publication, media assessors | Each is an @assessor("name") function + prompt. Gym calibrates them automatically once added. |
| Cross-model validation | Run cross-model --models gemini-flash,haiku,sonnet to check if ordering holds across models. |
| Harvest real pipeline output | gym harvest SLUG --expected-academic 9-10 captures real enrichment data as corpus entries, replacing synthetic fixtures over time. |