Assessor Calibration Gym

First run — 2026-03-09 · Model: gemini-flash · 15 corpus entries (8 real, 7 synthetic)

TL;DR This gym tunes scoring prompts for a given set of dimensions — it assumes you already know what dimensions to score on. Identifying the right dimensions to assess comes first (and is a separate problem). Here we calibrate how well the LLM applies rubrics it's been given.

First run caught a name-hallucination bug where the LLM ignored provided data and scored based on real people who happened to share a name with synthetic corpus entries (a Sales VP scored 9/10 on academic prowess because the LLM found a real Nobel-winning climate researcher named Karen O'Brien). One prompt fix later: academic accuracy went from 67% to 80%, ordering from 0.647 to 0.977. Total iteration cost: ~$0.02.

1. The Problem

The person_intel job scores people on 4 pre-chosen dimensions using LLM prompts. The gym doesn't question whether these are the right dimensions — it checks whether the scoring prompts produce well-calibrated numbers for the dimensions we've committed to:

DimensionPrompt LocationWhat It Scores
prior_success_stage_scoreExits, company outcomes, leadership trajectory
network_quality_stage_scoreVC connections, operator network, ecosystem centrality
technical_depth_stage_scorePatents, papers, open source, technical roles
academic_assess_academich-index, publications, citations, institution pedigree

These prompts were written once and never tested against known examples. Without calibration data, we had no idea whether "Jensen Huang = 3 on academic" or "Jensen Huang = 7 on academic." The scoring rubrics existed but the LLM's interpretation of them was unchecked.

2. What Was Built

Ground Truth Corpus

15 people — 8 real, 7 synthetic — spanning the full 0–10 range. Real people (Hinton, LeCun, Karpathy, Jensen Huang, etc.) have accurate enrichment data drawn from public sources. Synthetic people (fictional names, fabricated-but-plausible profiles) fill the lower tiers where no famous person would naturally land. Each entry has hand-built enrichmentData gathered from 12+ APIs: Apollo, Semantic Scholar, USPTO Patents, SEC EDGAR, Wikipedia, arXiv, OpenAlex, DBLP, PubMed, etc. and researchLLM web-search-grounded evidence: career history, exits, network signals, technical contributions, education. data, plus expected score ranges per dimension:

TierPeopleExample Expected (Academic)
9–10Geoffrey Hinton, Yann LeCun, Fei-Fei Li, Daphne Koller[9, 10]
7–8Andrej Karpathy, Dr. Priya Sharma (synthetic)[6, 8]
5–6Dr. Sarah Chen (synthetic ML PhD)[5, 7]
2–4Jensen Huang, Satya Nadella, Mike Rodriguez (synthetic CTO)[2, 4]
0–2Alex Park, Taylor Kim, Karen O'Brien, David Okonkwo (all synthetic)[0, 2]

Real people have accurate data (Hinton's h-index of 168, LeCun's 25 patents, Koller's Coursera IPO at $5.9B). Synthetic people have fabricated but plausible profiles.

Gym Architecture

The gym directly calls the assessor functions from jobs/handlers/person_intel.py with corpus data, bypassing the jobs framework, web search, and API enrichment. This means each iteration exercises only the LLM scoring prompt — nothing else — at ~$0.001 per call.

# Run all assessors + core scorer on the corpus
python -m intel.people.gym run

# Run just the academic assessor
python -m intel.people.gym run --assessor academic

# Show cached calibration report
python -m intel.people.gym report

# Test with multiple models
python -m intel.people.gym cross-model --models gemini-flash,haiku,sonnet

# Capture real pipeline output as a new corpus entry
python -m intel.people.gym harvest yann-lecun --expected-academic 9-10

Calibration Metrics

Four metrics per dimension, computed from corpus results:

MetricWhat It MeasuresTarget
Range AccuracyPercentage of corpus entries where the actual LLM score falls within the expected [min, max] range set by ground truth. % of people scored within expected range >80%
Kendall τRank correlation coefficient (-1 to +1). Measures whether the LLM ranks people in the same relative order as ground truth. 1.0 = perfect agreement. Does the LLM get the relative ordering right? >0.8
Spread (σ) Standard deviation — does it use the full 0–10 scale? >2.0
Cluster % % of scores in the densest 2-point band (lower = better spread) <50%

3. Run 1: The Hallucination Bug

First run of the academic assessor (before any prompt changes):

67%
Range Accuracy
0.647
Kendall τ
2.92
Spread (σ)
53%
Cluster %

5 violations out of 15 entries. Two were catastrophic:

Karen O'Brien (Sales VP) — Expected: [0, 1] — Actual: 9

The LLM was told she had one academic signal: "Education: Boston College." It scored her 9/10 with this evidence:

Root cause: The LLM ignored the provided signal and used its training data to find a real Karen O'Brien — an actual climate change researcher at the University of Oslo who co-won the Nobel Peace Prize.

David Okonkwo (Solo Founder) — Expected: [0, 2] — Actual: 10

Given one signal: "Education: Duke University." The LLM scored him 10/10, citing:

Root cause: Same bug. The real David Okonkwo is a prominent neurosurgeon at the University of Pittsburgh.

The prompt said "Score this person's academic prowess based on these signals" but never said "only" based on these signals. The LLM treated the signals as hints and supplemented with its own knowledge — which happened to match a different person with the same name.

4. The Fix

Added two constraints to the academic assessor prompt in jobs/handlers/person_intel.py:463:

# Before:
Score this person's academic prowess from 0-10 based on these signals:

# After:
Score this person's academic prowess from 0-10 based ONLY on the signals below.

IMPORTANT: Use ONLY the data provided. Do NOT use your own knowledge
about this person or anyone with a similar name. If the signals show
limited academic evidence, the score should be low regardless of what
you may know about people with this name.

Also tightened the bottom of the rubric from "0-2: No academic background" to "0-2: No academic background (no PhD, no publications, no patents, just bachelor's degree)" to anchor the low end more concretely.

5. Run 2: After the Fix

80%
Range Accuracy
0.977
Kendall τ
3.24
Spread (σ)
53%
Cluster %
MetricRun 1Run 2TargetStatus
Range Accuracy67%80%>80%PASS
Ordering (τ)0.6470.977>0.8PASS
Spread (σ)2.923.24>2.0PASS
Cluster %53%53%<50%WARN
Hallucination eliminated

Karen O'Brien: 9 → 2. David Okonkwo: 10 → 3. The LLM now scores based on provided signals only.

Per-person detail after the fix:

PersonExpectedRun 1Run 2In Range?
Geoffrey Hinton[9, 10]1010Y
Yann LeCun[9, 10]1010Y
Fei-Fei Li[9, 10]1010Y
Daphne Koller[8, 10]1010Y
Andrej Karpathy[6, 8]88Y
Dr. Priya Sharma[6, 8]88Y
Dr. Sarah Chen[5, 7]66Y
Jensen Huang[2, 4]44Y
Satya Nadella[2, 4]43Y
Mike Rodriguez[2, 4]44Y
Patrick Collison[1, 3]43Y
Alex Park[0, 2]32Y
David Okonkwo[0, 2]103N
Taylor Kim[0, 1]33N
Karen O'Brien[0, 1]92N

3 remaining violations are minor (1–2 points above expected max). The catastrophic name-hallucination errors are gone.

6. Core Scorer Results

Also ran the core 3-dimension scorer (_stage_score) on the same corpus. This scores prior_success, network_quality, and technical_depth from research data:

DimensionRange AccuracyOrdering (τ)Spread (σ)Cluster %Assessment
prior_success 93% 1.000 3.25 53% Best calibrated. Perfect ordering with only 1 violation (Sales VP, 1 point over).
network_quality 73% 1.000 2.95 53% Perfect ordering but slight inflation. Fei-Fei Li and Koller both scored 10 (expected 8–9).
technical_depth 60% 0.925 3.00 73% Systematic inflation. 6 violations, all scores 1–2 points above expected. Prompt needs tightening.

The pattern is consistent: ordering is excellent across all dimensions (τ ≥ 0.925 everywhere). The LLM gets relative ranking right. The problem is absolute calibration — it's too generous, especially on technical_depth where 73% of scores cluster in a 2-point band near the top of the scale.

Full core scorer detail (15 people × 3 dimensions)
PersonOverallPrior SuccessNetworkTechnical
Jensen Huang10.0101010
Patrick Collison9.710109
Daphne Koller9.610109
Geoffrey Hinton9.691010
Yann LeCun9.691010
Satya Nadella9.410108
Fei-Fei Li9.281010
Andrej Karpathy9.29910
Mike Rodriguez6.9678
Dr. Sarah Chen5.1358
Dr. Priya Sharma5.1457
Karen O'Brien4.2651
David Okonkwo3.9345
Alex Park3.2235
Taylor Kim1.9132

7. Cost & Iteration Speed

WhatCostTime
One gym run (15 entries × 1 assessor)~$0.01~40s
One gym run (15 entries × 1 scorer, 3 dimensions)~$0.01~90s
Full calibration (scorer + all assessors)~$0.02~2 min
Cross-model run (3 models)~$0.06~6 min

This matches the research→analyze split philosophy: the corpus provides cached "research" data, and the gym only exercises the cheap scoring step. Prompt iteration becomes a tight loop: edit prompt → run gym → check report → repeat. No web search, no API calls, no job framework overhead.

8. What's Next

TaskWhy
Tighten technical_depth rubric60% accuracy, 73% cluster. Same "ONLY use provided data" fix + stronger low-end anchoring should help.
Add engineering, publication, media assessorsEach is an @assessor("name") function + prompt. Gym calibrates them automatically once added.
Cross-model validationRun cross-model --models gemini-flash,haiku,sonnet to check if ordering holds across models.
Harvest real pipeline outputgym harvest SLUG --expected-academic 9-10 captures real enrichment data as corpus entries, replacing synthetic fixtures over time.

Glossary

Enrichment
Structured data gathered from 12+ free APIs: Apollo (professional profile), Semantic Scholar (publications, h-index), USPTO (patents), SEC EDGAR (filings), Wikipedia, arXiv, OpenAlex, DBLP, PubMed, Wikidata. Saved as enrichment.json per person.
Research
LLM web-search-grounded evidence gathering. An LLM searches the web for a person and returns structured JSON: career history, exits, network signals, technical contributions, education. Costs ~$0.05/person due to web search. Cached in the jobs results table.
Range Accuracy
Percentage of corpus entries where the LLM's actual score falls within the expected [min, max] range set by ground truth. Example: if expected is [9, 10] and actual is 10, that's in-range. If actual is 8, that's a violation.
Kendall τ (tau)
Rank correlation coefficient from -1 to +1. Measures whether the LLM ranks people in the same relative order as ground truth, regardless of absolute scores. τ = 1.0 means perfect agreement on ordering. More important than range accuracy — if the ordering is right, absolute calibration can be fixed by adjusting the rubric.