Assessor Calibration Gym

First run — 2026-03-09 · Model: gemini-flash · 15 corpus entries (8 real, 7 synthetic)

TL;DR This gym tunes scoring prompts for a given set of dimensions — it assumes you already know what dimensions to score on. Identifying the right dimensions to assess comes first (and is a separate problem). Here we calibrate how well the LLM applies rubrics it's been given.

First run caught a name-hallucination bug where the LLM ignored provided data and scored based on real people who happened to share a name with synthetic corpus entries (a Sales VP scored 9/10 on academic prowess because the LLM found a real Nobel-winning climate researcher named Karen O'Brien). One prompt fix later: academic accuracy went from 67% to 80%, ordering from 0.647 to 0.977. Total iteration cost: ~$0.02.

1. The Problem

The person_intel job scores people on 4 pre-chosen dimensions using LLM prompts. The gym doesn't question whether these are the right dimensions — it checks whether the scoring prompts produce well-calibrated numbers for the dimensions we've committed to:

Dimension	Prompt Location	What It Scores
prior_success	_stage_score	Exits, company outcomes, leadership trajectory
network_quality	_stage_score	VC connections, operator network, ecosystem centrality
technical_depth	_stage_score	Patents, papers, open source, technical roles
academic	_assess_academic	h-index, publications, citations, institution pedigree

These prompts were written once and never tested against known examples. Without calibration data, we had no idea whether "Jensen Huang = 3 on academic" or "Jensen Huang = 7 on academic." The scoring rubrics existed but the LLM's interpretation of them was unchecked.

2. What Was Built

Ground Truth Corpus

15 people — 8 real, 7 synthetic — spanning the full 0–10 range. Real people (Hinton, LeCun, Karpathy, Jensen Huang, etc.) have accurate enrichment data drawn from public sources. Synthetic people (fictional names, fabricated-but-plausible profiles) fill the lower tiers where no famous person would naturally land. Each entry has hand-built enrichmentData gathered from 12+ APIs: Apollo, Semantic Scholar, USPTO Patents, SEC EDGAR, Wikipedia, arXiv, OpenAlex, DBLP, PubMed, etc. and researchLLM web-search-grounded evidence: career history, exits, network signals, technical contributions, education. data, plus expected score ranges per dimension:

Tier	People	Example Expected (Academic)
9–10	Geoffrey Hinton, Yann LeCun, Fei-Fei Li, Daphne Koller	[9, 10]
7–8	Andrej Karpathy, Dr. Priya Sharma (synthetic)	[6, 8]
5–6	Dr. Sarah Chen (synthetic ML PhD)	[5, 7]
2–4	Jensen Huang, Satya Nadella, Mike Rodriguez (synthetic CTO)	[2, 4]
0–2	Alex Park, Taylor Kim, Karen O'Brien, David Okonkwo (all synthetic)	[0, 2]

Real people have accurate data (Hinton's h-index of 168, LeCun's 25 patents, Koller's Coursera IPO at $5.9B). Synthetic people have fabricated but plausible profiles.

Gym Architecture

The gym directly calls the assessor functions from jobs/handlers/person_intel.py with corpus data, bypassing the jobs framework, web search, and API enrichment. This means each iteration exercises only the LLM scoring prompt — nothing else — at ~$0.001 per call.

# Run all assessors + core scorer on the corpus
python -m intel.people.gym run

# Run just the academic assessor
python -m intel.people.gym run --assessor academic

# Show cached calibration report
python -m intel.people.gym report

# Test with multiple models
python -m intel.people.gym cross-model --models gemini-flash,haiku,sonnet

# Capture real pipeline output as a new corpus entry
python -m intel.people.gym harvest yann-lecun --expected-academic 9-10

Calibration Metrics

Four metrics per dimension, computed from corpus results:

Metric	What It Measures	Target
Range AccuracyPercentage of corpus entries where the actual LLM score falls within the expected [min, max] range set by ground truth.	% of people scored within expected range	>80%
Kendall τRank correlation coefficient (-1 to +1). Measures whether the LLM ranks people in the same relative order as ground truth. 1.0 = perfect agreement.	Does the LLM get the relative ordering right?	>0.8
Spread (σ)	Standard deviation — does it use the full 0–10 scale?	>2.0
Cluster %	% of scores in the densest 2-point band (lower = better spread)	<50%

3. Run 1: The Hallucination Bug

First run of the academic assessor (before any prompt changes):

67%

Range Accuracy

0.647

Kendall τ

2.92

Spread (σ)

53%

Cluster %

5 violations out of 15 entries. Two were catastrophic:

Karen O'Brien (Sales VP) — Expected: [0, 1] — Actual: 9

The LLM was told she had one academic signal: "Education: Boston College." It scored her 9/10 with this evidence:

"Professor at the University of Oslo with a PhD from Pennsylvania State University"
"Co-recipient of the 2007 Nobel Peace Prize for contributions to the IPCC"
"Highly cited researcher with h-index 65"

Root cause: The LLM ignored the provided signal and used its training data to find a real Karen O'Brien — an actual climate change researcher at the University of Oslo who co-won the Nobel Peace Prize.

David Okonkwo (Solo Founder) — Expected: [0, 2] — Actual: 10

Given one signal: "Education: Duke University." The LLM scored him 10/10, citing:

"Professor of Neurological Surgery and Director of Neurotrauma at the University of Pittsburgh"
"h-index: 75, 250 papers"
"Principal Investigator for major clinical trials (TRACK-TBI)"

Root cause: Same bug. The real David Okonkwo is a prominent neurosurgeon at the University of Pittsburgh.

The prompt said "Score this person's academic prowess based on these signals" but never said "only" based on these signals. The LLM treated the signals as hints and supplemented with its own knowledge — which happened to match a different person with the same name.

4. The Fix

Added two constraints to the academic assessor prompt in jobs/handlers/person_intel.py:463:

# Before:
Score this person's academic prowess from 0-10 based on these signals:

# After:
Score this person's academic prowess from 0-10 based ONLY on the signals below.

IMPORTANT: Use ONLY the data provided. Do NOT use your own knowledge
about this person or anyone with a similar name. If the signals show
limited academic evidence, the score should be low regardless of what
you may know about people with this name.

Also tightened the bottom of the rubric from "0-2: No academic background" to "0-2: No academic background (no PhD, no publications, no patents, just bachelor's degree)" to anchor the low end more concretely.

5. Run 2: After the Fix

80%

Range Accuracy

0.977

Kendall τ

3.24

Spread (σ)

53%

Cluster %

Metric	Run 1		Run 2	Target	Status
Range Accuracy	67%	→	80%	>80%	PASS
Ordering (τ)	0.647	→	0.977	>0.8	PASS
Spread (σ)	2.92	→	3.24	>2.0	PASS
Cluster %	53%	→	53%	<50%	WARN

Hallucination eliminated

Karen O'Brien: 9 → 2. David Okonkwo: 10 → 3. The LLM now scores based on provided signals only.

Per-person detail after the fix:

Person	Expected	Run 1	Run 2	In Range?
Geoffrey Hinton	[9, 10]	10	10	Y
Yann LeCun	[9, 10]	10	10	Y
Fei-Fei Li	[9, 10]	10	10	Y
Daphne Koller	[8, 10]	10	10	Y
Andrej Karpathy	[6, 8]	8	8	Y
Dr. Priya Sharma	[6, 8]	8	8	Y
Dr. Sarah Chen	[5, 7]	6	6	Y
Jensen Huang	[2, 4]	4	4	Y
Satya Nadella	[2, 4]	4	3	Y
Mike Rodriguez	[2, 4]	4	4	Y
Patrick Collison	[1, 3]	4	3	Y
Alex Park	[0, 2]	3	2	Y
David Okonkwo	[0, 2]	10	3	N
Taylor Kim	[0, 1]	3	3	N
Karen O'Brien	[0, 1]	9	2	N

3 remaining violations are minor (1–2 points above expected max). The catastrophic name-hallucination errors are gone.

6. Core Scorer Results

Also ran the core 3-dimension scorer (_stage_score) on the same corpus. This scores prior_success, network_quality, and technical_depth from research data:

Dimension	Range Accuracy	Ordering (τ)	Spread (σ)	Cluster %	Assessment
prior_success	93%	1.000	3.25	53%	Best calibrated. Perfect ordering with only 1 violation (Sales VP, 1 point over).
network_quality	73%	1.000	2.95	53%	Perfect ordering but slight inflation. Fei-Fei Li and Koller both scored 10 (expected 8–9).
technical_depth	60%	0.925	3.00	73%	Systematic inflation. 6 violations, all scores 1–2 points above expected. Prompt needs tightening.

The pattern is consistent: ordering is excellent across all dimensions (τ ≥ 0.925 everywhere). The LLM gets relative ranking right. The problem is absolute calibration — it's too generous, especially on technical_depth where 73% of scores cluster in a 2-point band near the top of the scale.

Full core scorer detail (15 people × 3 dimensions)

Person	Overall	Prior Success	Network	Technical
Jensen Huang	10.0	10	10	10
Patrick Collison	9.7	10	10	9
Daphne Koller	9.6	10	10	9
Geoffrey Hinton	9.6	9	10	10
Yann LeCun	9.6	9	10	10
Satya Nadella	9.4	10	10	8
Fei-Fei Li	9.2	8	10	10
Andrej Karpathy	9.2	9	9	10
Mike Rodriguez	6.9	6	7	8
Dr. Sarah Chen	5.1	3	5	8
Dr. Priya Sharma	5.1	4	5	7
Karen O'Brien	4.2	6	5	1
David Okonkwo	3.9	3	4	5
Alex Park	3.2	2	3	5
Taylor Kim	1.9	1	3	2

7. Cost & Iteration Speed

What	Cost	Time
One gym run (15 entries × 1 assessor)	~$0.01	~40s
One gym run (15 entries × 1 scorer, 3 dimensions)	~$0.01	~90s
Full calibration (scorer + all assessors)	~$0.02	~2 min
Cross-model run (3 models)	~$0.06	~6 min

This matches the research→analyze split philosophy: the corpus provides cached "research" data, and the gym only exercises the cheap scoring step. Prompt iteration becomes a tight loop: edit prompt → run gym → check report → repeat. No web search, no API calls, no job framework overhead.

8. What's Next

Task	Why
Tighten technical_depth rubric	60% accuracy, 73% cluster. Same "ONLY use provided data" fix + stronger low-end anchoring should help.
Add engineering, publication, media assessors	Each is an @assessor("name") function + prompt. Gym calibrates them automatically once added.
Cross-model validation	Run cross-model --models gemini-flash,haiku,sonnet to check if ordering holds across models.
Harvest real pipeline output	gym harvest SLUG --expected-academic 9-10 captures real enrichment data as corpus entries, replacing synthetic fixtures over time.

Glossary

Enrichment: Structured data gathered from 12+ free APIs: Apollo (professional profile), Semantic Scholar (publications, h-index), USPTO (patents), SEC EDGAR (filings), Wikipedia, arXiv, OpenAlex, DBLP, PubMed, Wikidata. Saved as enrichment.json per person.
Research: LLM web-search-grounded evidence gathering. An LLM searches the web for a person and returns structured JSON: career history, exits, network signals, technical contributions, education. Costs ~$0.05/person due to web search. Cached in the jobs results table.
Range Accuracy: Percentage of corpus entries where the LLM's actual score falls within the expected [min, max] range set by ground truth. Example: if expected is [9, 10] and actual is 10, that's in-range. If actual is 8, that's a violation.
Kendall τ (tau): Rank correlation coefficient from -1 to +1. Measures whether the LLM ranks people in the same relative order as ground truth, regardless of absolute scores. τ = 1.0 means perfect agreement on ordering. More important than range accuracy — if the ordering is right, absolute calibration can be fixed by adjusting the rubric.