# Person Intel Scoring Dimension Redesign

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Replace the 3-dimension scoring system (prior_success, network_quality, technical_depth) + separate academic assessor with a unified 6-dimension model grounded in VC research findings. Each dimension gets both a score (0-10) and an evidence_strength rating (0-3) so downstream consumers can distinguish "low score" from "no data."

**Architecture:** The score stage in `jobs/handlers/person_intel.py` gets a new prompt and schema. The assess stage's academic assessor is dissolved — its useful signals fold into `technical_breadth` and `network_position`. The `person_scores` table gets a schema migration. The gym corpus gets updated expected ranges for all 6 dimensions. Each dimension carries an `evidence_strength` (0-3: none/inferred/partial/strong) that the LLM self-reports based on available data.

**Key files:**
- `jobs/handlers/person_intel.py` — scoring prompt, weights, schema, save logic
- `intel/people/gym/gym.py` — gym runner (imports scoring constants)
- `intel/people/gym/corpus.jsonl` — 15-entry ground truth
- `intel/people/data/people.db` — `person_scores` table

**Research basis:** Lazear (2005) on breadth > depth, Gompers et al. on founder-market fit, Kaplan/Sensoy/Stromberg on team complementarity, Ewens/Nanda on early-stage predictors.

---

## Dimension Map: Old to New

| #  | New Dimension               | Weight | Origin                    | What Changed                                                            |
|----|-----------------------------|--------|---------------------------|-------------------------------------------------------------------------|
| 1  | `founder_market_fit`        | 25%    | NEW                       | Relational dimension: person x market. See Task 7 for market resolution |
| 2  | `prior_operational_evidence`| 20%    | prior_success (evolved)   | Granular outcomes, not binary. Scope managed. Zero-to-one experience    |
| 3  | `team_quality`              | 20%    | NEW                       | Co-founder complementarity, prior working relationships, hire quality   |
| 4  | `network_position`          | 15%    | network_quality (evolved) | Ecosystem centrality, talent flow position, not just name-dropping      |
| 5  | `technical_breadth`         | 10%    | technical_depth (evolved) | Breadth > depth (Lazear). Absorbs academic signals. Sector-dependent    |
| 6  | `leadership_magnetism`      | 10%    | NEW                       | Team growth track record, talent attraction, communication skill        |

**Dropped:** `academic_prowess` assessor — useful signals redistributed to `technical_breadth` (papers, patents, h-index) and `network_position` (academic network centrality).

### Evidence Strength (per-dimension confidence signal)

Every dimension score is paired with an `evidence_strength` rating (0-3) that the LLM self-reports. This disambiguates "low score because weak candidate" from "low score because no data available."

| Level | Label      | Meaning                                                       | Example                                      |
|-------|------------|---------------------------------------------------------------|----------------------------------------------|
| 0     | `none`     | No relevant data — score is a default/guess                   | team_quality for someone with zero web presence |
| 1     | `inferred` | Score based on indirect signals only (job titles, org size)   | network_position from employer prestige alone |
| 2     | `partial`  | Some direct evidence, one source or a few data points         | technical_breadth from GitHub + one patent    |
| 3     | `strong`   | Multiple corroborating sources, concrete quantitative data    | prior_operational_evidence with exit amounts + revenue |

**Why this matters:**
- **Downstream trust**: A `team_quality: 3, evidence_strength: 0` means "we don't know" — very different from `team_quality: 3, evidence_strength: 3` ("bad team, well-documented")
- **Data collection prioritization**: Dimensions with low average evidence_strength across the corpus indicate where the pipeline needs better enrichment sources
- **Gym calibration**: The gym can measure whether the model correctly self-reports confidence — flag cases where evidence_strength is high but the score misses the expected range (bad rubric) vs evidence_strength is low (data gap)
- **Score weighting**: Downstream consumers can discount dimensions with evidence_strength < 2 when computing overall scores

---

## Task 1: Schema migration for person_scores table

**Files:**
- Modify: `jobs/handlers/person_intel.py` — `_SCHEMA_DDL`, `_save_score()`

**Step 1: Design the migration**

The `person_scores` table currently has columns: `prior_success`, `network_quality`, `technical_depth`. The new schema needs 6 dimension columns plus `dimension_version` to distinguish old vs new scores.

New DDL:

```sql
CREATE TABLE IF NOT EXISTS person_scores (
    person_slug       TEXT PRIMARY KEY,
    overall           REAL NOT NULL,
    -- v2 dimension scores (0-10)
    founder_market_fit        REAL,
    prior_operational_evidence REAL,
    team_quality              REAL,
    network_position          REAL,
    technical_breadth         REAL,
    leadership_magnetism      REAL,
    -- v2 evidence strength per dimension (0-3: none/inferred/partial/strong)
    es_founder_market_fit        INTEGER,
    es_prior_operational_evidence INTEGER,
    es_team_quality              INTEGER,
    es_network_position          INTEGER,
    es_technical_breadth         INTEGER,
    es_leadership_magnetism      INTEGER,
    -- v1 dimensions (kept for backward compat, NULLed on new scores)
    prior_success     REAL,
    network_quality   REAL,
    technical_depth   REAL,
    -- metadata
    evidence_json     TEXT,
    model             TEXT,
    dimension_version INTEGER DEFAULT 2,
    scored_at         TEXT DEFAULT (datetime('now')),
    handler_version   TEXT
);
```

**Step 2: Write ALTER TABLE migration**

SQLite doesn't drop columns easily, so use additive migration:

```sql
-- dimension scores
ALTER TABLE person_scores ADD COLUMN founder_market_fit REAL;
ALTER TABLE person_scores ADD COLUMN prior_operational_evidence REAL;
ALTER TABLE person_scores ADD COLUMN team_quality REAL;
ALTER TABLE person_scores ADD COLUMN network_position REAL;
ALTER TABLE person_scores ADD COLUMN technical_breadth REAL;
ALTER TABLE person_scores ADD COLUMN leadership_magnetism REAL;
-- evidence strength (0-3: none/inferred/partial/strong)
ALTER TABLE person_scores ADD COLUMN es_founder_market_fit INTEGER;
ALTER TABLE person_scores ADD COLUMN es_prior_operational_evidence INTEGER;
ALTER TABLE person_scores ADD COLUMN es_team_quality INTEGER;
ALTER TABLE person_scores ADD COLUMN es_network_position INTEGER;
ALTER TABLE person_scores ADD COLUMN es_technical_breadth INTEGER;
ALTER TABLE person_scores ADD COLUMN es_leadership_magnetism INTEGER;
-- version tracking
ALTER TABLE person_scores ADD COLUMN dimension_version INTEGER DEFAULT 1;
```

Wrap each in try/except (idempotent — "duplicate column" is fine). Set `dimension_version=1` for existing rows, `dimension_version=2` for new scores.

**Step 3: Update `_save_score()` to write new columns**

The function should detect which dimension set is present in the result dict and write accordingly. New scores set `dimension_version=2` and NULL the v1 columns.

**Step 4: Test migration is idempotent**

Run `_ensure_schema()` twice. Verify existing v1 scores still readable. Verify new v2 scores write correctly.

**Step 5: Commit**

```bash
git commit -m "feat(person_intel): schema migration for 6-dimension scoring model v2"
```

---

## Task 2: Parallel per-dimension scoring architecture + rubrics

**Architecture decision (from variong review + user feedback):** Do NOT put all 6 dimensions in one prompt. Instead, use `cached_content` for the shared research data and fire 6 parallel per-dimension scoring calls. Each call has one focused rubric, one return schema, and can be calibrated independently.

**Why parallel per-dimension beats single wide prompt:**
- **Prompt quality**: Each prompt is short and focused — no rubric interference or context dilution
- **Independent calibration**: Tune each dimension's prompt in the gym without affecting others
- **Cost efficiency**: `cached_content` means research data is sent once and cached server-side; 6 small scoring prompts are cheaper than 1 mega-prompt with 6 rubrics
- **Parallelism**: All 6 calls fire concurrently via `asyncio.gather` — same wall time as 1 call
- **Granular retry**: If one dimension fails, retry just that one
- **Independent model choice**: Could use different models per dimension (e.g., cheaper model for `technical_breadth` which has clearer signals)

**Files:**
- Modify: `jobs/handlers/person_intel.py` — `_stage_score()`, dimension prompts, `DIMENSION_WEIGHTS`

**Step 1: Define weights**

```python
DIMENSION_WEIGHTS = {
    "founder_market_fit": 0.25,
    "prior_operational_evidence": 0.20,
    "team_quality": 0.20,
    "network_position": 0.15,
    "technical_breadth": 0.10,
    "leadership_magnetism": 0.10,
}
```

**Step 2: Write per-dimension rubric prompts**

Each dimension gets its own system prompt + rubric. Shared elements:
- "Use ONLY the provided data" instruction (prevents name-hallucination)
- Calibration anchors specific to the dimension
- evidence_strength rubric (shared across all)

**evidence_strength rubric** (included in every dimension prompt):
- **0 (none)**: No data available for this dimension. Score is a default.
- **1 (inferred)**: Score based on indirect signals only (job titles, org prestige, timing).
- **2 (partial)**: Some direct evidence from one source or a few data points.
- **3 (strong)**: Multiple corroborating sources with concrete, quantitative data.

**Overconfident flag** (from variong review): If the LLM assigns evidence_strength=3 but evidence list has < 3 concrete items, downstream validation should flag it. Add a gym check for this.

Per-dimension return schema (same for all):

```json
{
  "score": 0-10,
  "evidence_strength": 0-3,
  "evidence": ["specific evidence item 1", "..."],
  "notes": "optional dimension-specific context"
}
```

Plus dimension-specific extra fields:
- `founder_market_fit`: `"inferred_market"`, `"market_resolution"` (explicit/inferred/agnostic)
- `team_quality`: `"co_founders_identified": [...]`
- `technical_breadth`: `"domains": [...]`

**Rubrics** (one per dimension):

```
prior_operational_evidence (weight: 0.20)
Evaluates the depth and quality of operational track record.
- 0-2: No operational leadership. Individual contributor roles only.
- 3-4: Team lead / manager at established company. No founding experience.
- 5-6: One founding attempt OR significant scope at large co (P&L, 100+ reports, $100M+ budget).
- 7-8: Successful exit (acquisition >$50M or IPO) OR scaled a company past Series B. Zero-to-one capability.
- 9-10: Multiple successful exits with increasing scale. Built and sold billion-dollar companies.
Anchors: Jensen Huang=10, Patrick Collison=9, typical ML PhD=2

network_position (weight: 0.15)
Evaluates where the person sits in talent and capital flow networks.
- 0-2: Isolated — no visible connections to VCs, operators, or ecosystem nodes.
- 3-4: Connected within one organization or community. Alumni network only.
- 5-6: Known in regional ecosystem. Some VC connections. Conference speaker.
- 7-8: Central in a specific domain network. Multiple tier-1 VC relationships.
- 9-10: Ecosystem kingmaker. Multiple major firm relationships. Talent and capital flow through them.
Anchors: Geoffrey Hinton=10, Jensen Huang=10

technical_breadth (weight: 0.10)
Evaluates range of technical competence. Breadth > depth for founders (Lazear 2005).
- 0-2: Non-technical background. No engineering/science education.
- 3-4: Single-domain technical competence. One specialization.
- 5-6: Multi-domain technical skills OR deep expertise with some breadth.
- 7-8: Cross-functional technical leadership. Patents or papers showing range.
- 9-10: Polymathic technical leader. Contributions spanning multiple fields.
Anchors: Geoffrey Hinton=8 (deep but narrow), Daphne Koller=9 (PGMs + biology + edtech)

founder_market_fit (weight: 0.25)
Evaluates alignment between the person's experience and their target market.
IMPORTANT: This is relational. If target market unknown, infer from current company,
recent roles, and stated focus areas. Report market_resolution: explicit/inferred/agnostic.
- 0-2: No relevant industry experience. Entering market cold.
- 3-4: Tangential experience. Related industry but different function or segment.
- 5-6: 3-5 years in target industry. Relevant but not in the specific problem space.
- 7-8: 5-10 years in target space. Domain publications/patents. Built products in the space.
- 9-10: 10+ years as a practitioner/builder in the exact problem space. Known industry expert.
Anchors: Jensen Huang (chips)=10, Satya Nadella (cloud)=8

team_quality (weight: 0.20)
Evaluates the founding team as a unit, not just the individual.
If solo founder with no visible co-founders, score reflects that risk.
NOTE: Solo evaluation with no team data = low score is CORRECT signal, not missing data.
- 0-2: Solo founder, no visible team. No co-founder search signals.
- 3-4: Solo founder with some early hires OR co-founder with limited complementarity.
- 5-6: Co-founding team with some complementarity. Haven't worked together before.
- 7-8: Complementary co-founders who have worked together. Cover tech + business.
- 9-10: Dream team. Repeat co-founders with prior exit together. Deep complementarity.
Anchors: Patrick + John Collison=9, Daphne Koller + Andrew Ng (Coursera)=9

leadership_magnetism (weight: 0.10)
Evaluates ability to attract and retain talent and attention.
NOTE: Academic leaders count — research group quality, lab alumni success, academic influence
are valid evidence (not just startup team size).
- 0-2: No public presence. No evidence of team-building ability.
- 3-4: Some management experience. Small teams (< 10).
- 5-6: Managed 20+ people. Some evidence of quality hires. Modest public presence.
- 7-8: Track record of attracting strong talent. People follow them across companies.
- 9-10: Magnetic leader. Major talent follows them. Thousands-level audience.
Anchors: Andrej Karpathy=10 (YouTube, Tesla brand), Jensen Huang=10
```

**Step 3: Implement parallel scoring in `_stage_score()`**

```python
async def _stage_score(*, item_key: str, data: dict, job) -> dict:
    # Load cached research data
    research_data = get_result(conn, job.id, item_key, "research")
    research_json = json.dumps(research_data, indent=2, default=str)

    # Fire all 6 dimension calls in parallel with shared cached_content
    async def _score_dimension(dim: str) -> dict:
        prompt = DIMENSION_PROMPTS[dim].format(person_data=research_json)
        result = await call_llm(
            model=MODEL,
            prompt=prompt,
            system=DIMENSION_SYSTEMS[dim],
            cached_content=research_json,  # cached server-side
            temperature=0,
        )
        return json.loads(result.text)

    dim_results = await asyncio.gather(
        *[_score_dimension(dim) for dim in DIMENSION_WEIGHTS],
        return_exceptions=True,
    )

    # Assemble composite score
    scores = {}
    for dim, result in zip(DIMENSION_WEIGHTS, dim_results):
        if isinstance(result, Exception):
            logger.warning(f"dimension {dim} failed: {result}")
            scores[dim] = {"score": 0, "evidence_strength": 0, "evidence": [], "error": str(result)}
        else:
            scores[dim] = result

    overall = sum(
        scores[dim]["score"] * weight
        for dim, weight in DIMENSION_WEIGHTS.items()
    ) / sum(DIMENSION_WEIGHTS.values())

    return {"dimensions": scores, "overall": overall, "dimension_version": 2}
```

**Step 4: Add summary/synthesis call (optional, post-calibration)**

After all 6 dimensions are scored, an optional 7th call synthesizes `summary`, `strengths`, `risks` from the 6 results. This is cheap (no research data needed, just the 6 score dicts) and can be skipped during gym calibration.

**Step 5: Commit**

```bash
git commit -m "feat(person_intel): parallel per-dimension scoring with cached_content + all 6 rubrics"
```

---

## Task 4: Update gym corpus expected ranges

All 15 corpus entries need `expected` ranges for the 6 new dimensions. This is the most labor-intensive task — each person needs careful calibration.

**Files:**
- Modify: `intel/people/gym/corpus.jsonl`

**Step 1: Update expected ranges for all entries**

Below are the proposed expected ranges per person, per dimension. The key for each entry:
- `fmf` = founder_market_fit
- `poe` = prior_operational_evidence
- `tq` = team_quality
- `np` = network_position
- `tb` = technical_breadth
- `lm` = leadership_magnetism

| ID                  | fmf    | poe    | tq     | np     | tb     | lm     | Notes                                               |
|---------------------|--------|--------|--------|--------|--------|--------|-----------------------------------------------------|
| geoffrey-hinton     | [7,9]  | [7,9]  | [7,9]  | [9,10] | [7,9]  | [8,10] | FMF high for AI; team=Hinton lab quality             |
| yann-lecun          | [8,10] | [7,9]  | [7,9]  | [9,10] | [8,10] | [8,10] | Built FAIR; FMF=AI native; breadth across vision+NLP |
| fei-fei-li          | [8,10] | [6,8]  | [6,8]  | [8,9]  | [8,10] | [7,9]  | FMF=AI/vision; less operational evidence than founders|
| andrej-karpathy     | [8,10] | [7,8]  | [5,7]  | [8,10] | [7,9]  | [9,10] | Solo at Eureka; huge magnetism (YouTube, Tesla brand)|
| daphne-koller       | [8,10] | [8,10] | [8,10] | [8,9]  | [8,10] | [7,9]  | FMF=edtech+biotech; Coursera IPO; insitro team      |
| jensen-huang        | [9,10] | [9,10] | [8,10] | [9,10] | [5,7]  | [9,10] | FMF=chips; breadth moderate (EE focused); max magnet |
| patrick-collison    | [8,10] | [9,10] | [8,10] | [9,10] | [5,7]  | [8,10] | FMF=fintech; team=John (brother); breadth moderate   |
| satya-nadella       | [7,9]  | [9,10] | [8,10] | [9,10] | [4,6]  | [9,10] | FMF=cloud/enterprise; breadth limited (MBA not eng)  |
| typical-ml-phd      | [3,5]  | [2,3]  | [2,4]  | [3,5]  | [5,7]  | [1,3]  | FMF lowered: edge ML niche, no clear market yet      |
| startup-cto         | [7,9]  | [4,6]  | [5,7]  | [4,6]  | [6,8]  | [3,5]  | FMF=data infra; has co-founder; Uber→startup path    |
| mid-level-eng       | [2,4]  | [1,2]  | [1,2]  | [1,3]  | [3,5]  | [0,2]  | No clear market; no team; no magnetism               |
| junior-pm           | [1,3]  | [0,1]  | [0,2]  | [1,2]  | [0,1]  | [0,1]  | Too early for any dimension                          |
| sales-vp            | [5,7]  | [3,5]  | [3,5]  | [4,6]  | [0,1]  | [5,7]  | FMF=SaaS sales; manages 200; some magnetism          |
| solo-founder        | [6,8]  | [1,3]  | [1,3]  | [2,4]  | [3,5]  | [1,3]  | FMF=logistics (Flexport exp); solo; early            |
| research-scientist  | [6,8]  | [3,5]  | [3,5]  | [4,6]  | [7,8]  | [3,5]  | FMF=drug discovery; Genentech team; rising star      |

**Step 2: Rewrite corpus entries**

Replace the `expected` object in each JSONL line. Keep the old dimension keys alongside the new ones during transition (gym can test either set). New format:

```json
{
  "expected": {
    "founder_market_fit": [7, 9],
    "prior_operational_evidence": [7, 9],
    "team_quality": [7, 9],
    "network_position": [9, 10],
    "technical_breadth": [7, 9],
    "leadership_magnetism": [8, 10],
    "_v1": {
      "academic": [9, 10],
      "prior_success": [8, 10],
      "network_quality": [9, 10],
      "technical_depth": [9, 10]
    }
  }
}
```

The `_v1` sub-key preserves old ranges so the gym can validate both dimension sets during transition.

**Step 3: Add missing research data for new dimensions**

Some corpus entries lack data needed by the new dimensions:
- `team_quality` needs co-founder information. Add `"co_founders"` array to research data for entries that have them (Collison → John, Koller → Andrew Ng for Coursera, DataSync CTO → unnamed co-founder).
- `leadership_magnetism` needs team size / growth signals. Add `"team_signals"` to research data where known (Nadella → 200K employees, Karen O'Brien → 200-person org).
- `founder_market_fit` needs `"target_market"` field in research data. Add where inferable (Karpathy → AI education, Okonkwo → climate logistics, Rodriguez → real-time data infra).

**Step 4: Commit**

```bash
git commit -m "feat(gym): corpus v2 with 6-dimension expected ranges + enriched research data"
```

---

## Task 5: Update gym runner for 6 dimensions

**Files:**
- Modify: `intel/people/gym/gym.py` — `run_scorer()`, `compute_metrics()`

**Step 1: Update `run_scorer()` to extract 6 dimensions**

Replace the 3-dimension extraction with:

```python
DIMENSIONS_V2 = [
    "founder_market_fit", "prior_operational_evidence", "team_quality",
    "network_position", "technical_breadth", "leadership_magnetism",
]
```

The runner should import `DIMENSION_WEIGHTS` from `person_intel` (already does) and adapt to whichever set is active.

**Step 2: Update `compute_metrics()` dimension detection**

Line 191 currently hardcodes v1 dimension names:
```python
if dimension in ("prior_success", "network_quality", "technical_depth"):
```

Replace with dynamic detection from `DIMENSION_WEIGHTS.keys()`.

**Step 3: Add `--version` flag to CLI**

Allow `python -m intel.people.gym run --version v1` vs `--version v2` to test either dimension set against the corpus.

**Step 4: Commit**

```bash
git commit -m "feat(gym): support v2 6-dimension scoring in gym runner"
```

---

## Task 6: Dissolve the academic assessor

The `assess` stage's `@assessor("academic")` function is no longer needed. Its signals are absorbed into `technical_breadth` (papers, patents, h-index) and `network_position` (academic collaborator networks).

**Files:**
- Modify: `jobs/handlers/person_intel.py` — remove `_assess_academic()`, update `ASSESSORS` dict

**Step 1: Verify no other code depends on academic assessor output**

Search for references to `assessments.json`, `academic` score in:
- `intel/people/` — dashboard, reports, any UI
- `jobs/` — any downstream consumers

**Step 2: Remove `_assess_academic()` function and `@assessor("academic")` decorator**

The `ASSESSORS` dict becomes empty. The `_stage_assess()` function still exists as infrastructure for future assessors (e.g., engineering_depth, media_presence) but runs zero assessors.

**Step 3: Document what moved where**

Add a comment in the scoring prompt noting that academic signals (h-index, paper count, patents) are now inputs to `technical_breadth`, not a separate dimension.

**Step 4: Commit**

```bash
git commit -m "refactor(person_intel): dissolve academic assessor — signals absorbed into technical_breadth"
```

---

## Task 7: The founder_market_fit problem — market resolution strategy

This is the hardest design problem. `founder_market_fit` is relational: it measures person x market. But we don't always know what market the person is targeting.

**Files:**
- Modify: `jobs/handlers/person_intel.py` — research prompt, scoring prompt

**Three-tier market resolution:**

### Tier A: Explicit market (best case)
The work item's `data` dict contains a `target_market` field (e.g., from a VC deal flow where the startup's market is known). The scoring prompt uses this directly.

```python
target_market = data.get("target_market", "")
```

### Tier B: Inferred market (common case)
No explicit market, but the person's current company/role implies one. The research stage already collects career history. Add an instruction to the research prompt:

```
Also identify: What market/industry is this person most likely to start a company in?
Return as "inferred_target_market": "description" based on their career trajectory,
domain expertise, and stated interests.
```

The scoring prompt then uses this inferred market.

### Tier C: Market-agnostic fallback (worst case)
Neither explicit nor inferable. Score `founder_market_fit` as "domain expertise depth" — how deep is their expertise in ANY domain, regardless of whether we know the target market. Flag this in the evidence:

```json
{
  "founder_market_fit": {
    "score": 5,
    "evidence_strength": 1,
    "evidence": ["Strong logistics expertise from Flexport"],
    "market_resolution": "inferred",
    "inferred_market": "supply chain / logistics optimization"
  }
}
```

The `market_resolution` field (`explicit` | `inferred` | `agnostic`) allows downstream consumers to discount the score appropriately.

**Implementation:**

1. Add `inferred_target_market` to the research prompt's return schema
2. Add `target_market` parameter to the scoring prompt (may be empty)
3. Add `market_resolution` to the scoring return schema
4. The gym corpus should include test cases for all 3 tiers:
   - Explicit: Karpathy (AI education), Okonkwo (climate logistics)
   - Inferred: Rodriguez (data infra), Sharma (drug discovery)
   - Agnostic: mid-level-eng, junior-pm (no clear market)

**Step 1: Update research prompt**

Add to `_RESEARCH_PROMPT` return schema:
```json
"inferred_target_market": "Based on career trajectory, the most likely market for a startup"
```

**Step 2: Update scoring prompt**

Add market context block before the founder_market_fit rubric:
```
Target market context:
{market_context}
(If no target market is provided, infer from career trajectory and score domain depth.)
```

**Step 3: Commit**

```bash
git commit -m "feat(person_intel): 3-tier market resolution for founder_market_fit dimension"
```

---

## Task 8: Calibration sequence

Run the gym iteratively to tune prompts. Order matters — start with dimensions that have the most ground truth data.

**Calibration order:**

| Phase | Dimension                    | Why first                                                          | Target metrics              |
|-------|------------------------------|--------------------------------------------------------------------|-----------------------------|
| 1     | prior_operational_evidence   | Cleanest evolution from prior_success. Most objective signals.     | range_accuracy >= 0.80      |
| 2     | network_position             | Evolution from network_quality. Good corpus coverage.              | range_accuracy >= 0.75      |
| 3     | technical_breadth            | Evolution from technical_depth + academic. Absorbs most signals.   | range_accuracy >= 0.75      |
| 4     | leadership_magnetism         | New but uses existing signals (team size, public presence).        | range_accuracy >= 0.70      |
| 5     | founder_market_fit           | Hardest — relational. Needs market resolution working first.       | range_accuracy >= 0.65      |
| 6     | team_quality                 | Least data in current enrichment. Most corpus entries lack co-founder data. | range_accuracy >= 0.60 |

**Per-phase process:**

1. Run gym: `python -m intel.people.gym run --version v2`
2. Check metrics: range_accuracy, Kendall tau, cluster %
3. If range_accuracy < target: examine violations, adjust rubric anchors or prompt wording
4. Re-run. Max 3 iterations per dimension before moving on.
5. After all 6 pass individually, run full gym and check overall weighted score accuracy.

**Acceptance criteria for go-live:**
- All 6 dimensions: range_accuracy >= 0.60
- At least 4 of 6: range_accuracy >= 0.75
- Kendall tau >= 0.50 for all dimensions (ordering is more important than absolute values)
- No single corpus entry has > 2 dimensions out of range
- Evidence strength calibration: for corpus entries with rich data (Hinton, Collison, Nadella), evidence_strength should be >= 2 on most dimensions. For sparse entries (junior-pm, mid-level-eng), evidence_strength should be <= 1 on most dimensions.

---

## Task 9: Data availability audit

Which new dimensions can be scored from data the pipeline already collects?

| Dimension                    | Data sources available                                             | Gap?                                                              | Expected avg evidence_strength |
|------------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------|-------------------------------|
| `founder_market_fit`         | career_history, education, patents (domain), papers (field)       | Need `inferred_target_market` from research stage (Task 7)        | 1-2 (inferred/partial)        |
| `prior_operational_evidence` | career_history, exits_and_outcomes, notable_facts                 | **Good.** Same data as prior_success, just scored differently.    | 2-3 (partial/strong)          |
| `team_quality`               | network_signals (some), notable_facts                             | **Major gap.** No co-founder data collected. No team roster.      | 0-1 (none/inferred)           |
| `network_position`           | network_signals, enrichment (Wikipedia, SEC board seats)          | **Moderate.** Need to extract board co-memberships, co-investment.| 1-2 (inferred/partial)        |
| `technical_breadth`          | technical_contributions, education, patents, papers, enrichment   | **Good.** Combines existing technical_depth + academic signals.   | 2-3 (partial/strong)          |
| `leadership_magnetism`       | career_history (team sizes implied), notable_facts                | **Moderate gap.** No explicit team size/growth data collected.    | 1 (inferred)                  |

**New data to collect (from V2 validation):**

1. **Co-founder information** — **No prompt change needed.** Already structured in `associates` field (filter where relationship contains "co-founder"). Add a post-processing function in `_stage_score()` that extracts co-founders from existing data.
2. **Target market inference** — Add to `DEEP_RESEARCH_PROMPT` YAML schema: `"inferred_target_market": {"sectors": [...], "reasoning": "...", "confidence": "high/medium/low"}`. The LLM already sees career data during web search — just needs the output slot.
3. **Team size signals** — Add to prompt schema but expect mostly null. No enrichment source currently collects headcount/hire data. Format: `"team_signals": {"largest_team": "~200 (context)", "notable_hires": [...], "confidence": "low"}`. **Defer until team-specific enrichment sources are identified.**
4. **Board positions** — Already partially in career_history but not tagged. Add: `"board_positions": [{"org", "role", "period"}]`

Items 1 and 2 are the highest-value changes. Item 3 is low ROI — most profiles will return null.

**Step 1: Update `_RESEARCH_PROMPT` return schema**

Add the 4 new fields above to the JSON schema in the prompt.

**Step 2: The team_quality data problem**

Team quality is the most data-starved dimension. For people evaluated as individuals (not as part of a known startup), the research stage may find little about their team. Mitigations:
- Accept that solo evaluation will produce low team_quality scores (which is correct — it's a real signal)
- When item_data includes `co_founder_slugs`, look up those people's profiles and include their data in the scoring context
- For now, score based on whatever team signals the research stage finds. Don't over-engineer.

**Step 3: Commit**

```bash
git commit -m "feat(person_intel): expand research prompt for team, market, and leadership signals"
```

---

## Task 10: Version the handler and mark existing scores stale

**Files:**
- Modify: `jobs/handlers/person_intel.py` — add `HANDLER_VERSION`

**Step 1: Set handler version**

```python
HANDLER_VERSION = "2.0.0"  # 6-dimension scoring model
```

This triggers automatic staleness detection for the `score` stage on all existing items. They'll be reprocessed with the new prompt on the next runner cycle.

**Step 2: Decide on research stage staleness**

The research prompt also changed (new return fields for team, market, leadership). But research is expensive (~$0.05/item, web search). Options:

- **A: Force research re-run** — Set version on research stage too. Expensive but gets the new data fields.
- **B: Score with what you have** — Only force score stage re-run. New fields will be null. Accept lower accuracy on team_quality and founder_market_fit until items are naturally re-researched.
- **Recommendation: B for now.** Score with available data. Queue re-research for high-value items (overall > 5) as a separate batch.

**Step 3: Commit**

```bash
git commit -m "feat(person_intel): handler version 2.0.0 — triggers score stage re-run"
```

---

## Risks and Open Questions

### R1: Prompt length — RESOLVED
~~6 dimensions with full rubrics may push the scoring prompt past the sweet spot.~~

**Resolution:** Parallel per-dimension architecture (Task 2) eliminates this risk. Each prompt has exactly one rubric. Research data shared via `cached_content`.

### R2: team_quality data sparsity
Most people in the pipeline are evaluated individually, not as teams. Team quality scores will be low across the board, compressing the range and reducing the dimension's discriminative power.

**Mitigation:** Accept this for now. The dimension becomes more useful when the system evaluates startups (founder + co-founders together). Add a note in the rubric that solo founder is a 0-3, which is correct signal.

### R3: founder_market_fit instability
Market inference by the LLM may be inconsistent across runs (different inferred markets for the same person). This makes the dimension noisy.

**Mitigation:** Pin `temperature=0` (already set). Add `inferred_target_market` to the research stage output (cached), so the scoring stage uses a stable market inference rather than inferring fresh each time.

### R4: Academic community may score low on leadership_magnetism
Researchers like Hinton and Li are clearly magnetic leaders but their signals look different from startup founders (lab size vs company headcount, paper citations vs revenue growth). The rubric needs academic-friendly anchors.

**Mitigation:** Include "research group quality" and "academic influence" as valid evidence for leadership_magnetism. Calibrate against Hinton/LeCun/Li corpus entries.

### R5: Backward compatibility of overall score
The weighted overall score will change for every person because weights and dimensions changed. Any downstream consumer comparing scores across versions will get misleading results.

**Mitigation:** The `dimension_version` column distinguishes v1 vs v2 scores. Add a utility function `is_v2_score(row)` that checks the version. Dashboard should display version badge.

### R6: No rollback plan (from variong review)
If gym calibration fails across multiple dimensions, there's no defined rollback.

**Mitigation:** Keep v1 scoring logic behind a `dimension_version` flag. If v2 calibration fails acceptance criteria after 3 iterations, revert to v1 and re-examine rubrics. The `dimension_version` column already supports mixed v1/v2 scores in the same table.

### R7: Overconfident evidence_strength (from variong review)
The LLM may assign evidence_strength=3 despite thin data.

**Mitigation:** Post-scoring validation: if evidence_strength >= 2 but `len(evidence) < 2`, flag as potentially overconfident. Add this as a gym metric (evidence_strength_calibration).

### R8: Corpus range realism (from variong review)
Some corpus expected ranges may be too generous (e.g., `typical-ml-phd` at `[5,7]` for founder_market_fit). Ranges set before rubric realism check may need adjustment.

**Mitigation:** Before gym calibration (Task 8), manually score 3 corpus entries (one high, one mid, one low) against rubrics. Adjust ranges if manual scores disagree with expected ranges.

---

## Execution Order Summary

| Task | Description                                  | Depends On | Effort |
|------|----------------------------------------------|------------|--------|
| 1    | Schema migration                             | —          | S      |
| 2    | Parallel per-dimension scoring + all rubrics | 1          | L      |
| 4    | Gym corpus expected ranges                   | —          | L      |
| 5    | Gym runner v2 support                        | 4          | S      |
| 6    | Dissolve academic assessor                   | 2          | S      |
| 7    | Market resolution for founder_market_fit     | 9          | M      |
| 8    | Calibration sequence (iterative)             | 2,4,5      | L      |
| 9    | Research prompt data expansion               | —          | M      |
| 10   | Handler versioning + staleness               | 8          | S      |

---

## Pre-Implementation Validation (do before coding)

These checks validate the plan's assumptions offline. Each can be done in a subagent.

### V1: Rubric realism check — DONE
Manually scored 3 corpus entries (Jensen Huang, startup-cto, junior-pm) across all 6 dimensions. **All 18 manual scores fall within expected ranges.** No adjustments needed.

Key findings:
- Rubric anchors provide good calibration guidance across all tiers
- Data sparsity handled well (Jensen's team_quality [8,10] accounts for unnamed co-founders in corpus)
- Some ranges could be tightened (junior-pm/team_quality [0,2] → likely always 0-1) but wider bands are prudent for LLM tolerance

### V2: Research schema prototype — DONE
Tested extraction of 3 new fields across 3 profiles (gilelbaz=rich, rohangupta=medium, txzhuo=sparse).

| Field                    | Feasibility | Source                                      | Decision                          |
|--------------------------|-------------|---------------------------------------------|-----------------------------------|
| `co_founders`            | Easy        | Already in `associates` + `career` + `hint` | **No prompt change needed** — derive from existing `associates` field where relationship contains "co-founder". Post-processing function, not LLM extraction. |
| `inferred_target_market` | Possible    | `career` + `goals` + `interests`            | **Add to research prompt** — LLM already sees the data, just needs the output slot. High confidence for rich profiles, low for sparse. |
| `team_signals`           | Unlikely    | No current source has headcount/hire data   | **Defer** — add to prompt schema but expect mostly null. No enrichment source collects team size. Low ROI until interview transcripts or press releases are ingested. |

All 3 fields should be **optional** in the research schema. `co_founders` is the highest-value, lowest-effort win.

### V3: Success criteria definition
**Go-live criteria** (updated from variong review):
- Per-dimension: range_accuracy >= 0.60 for all 6, >= 0.75 for at least 4
- Per-dimension: Kendall tau >= 0.50 (ordering matters more than absolute values)
- **Cross-version**: Weighted overall Kendall tau >= 0.60 vs v1 scores (ensures v2 doesn't destroy ordering that v1 got right)
- Evidence strength calibration: for rich-data corpus entries (Hinton, Collison, Nadella), avg evidence_strength >= 2. For sparse entries (junior-pm, mid-level-eng), avg evidence_strength <= 1.
- No single corpus entry has > 2 dimensions out of range

**Rollback trigger:** If after 3 calibration iterations, < 3 dimensions meet range_accuracy >= 0.75, revert to v1 and redesign rubrics.

### V4: Team data gap plan
**Decision:** Accept that solo evaluations will produce chronic low `team_quality` scores (0-3). This is correct signal — a solo founder with no visible team IS a team quality risk.

When the system has startup-context (e.g., evaluating a YC company with known co-founders):
- `co_founder_slugs` in item_data triggers lookup of co-founder profiles
- Their data is injected into the `team_quality` scoring context
- `team_quality` evidence_strength rises from 0 (solo eval) to 2-3 (team eval)

For now (v2 launch): solo eval is the norm. Don't over-engineer team lookup. The evidence_strength signal tells downstream consumers to discount team_quality when evidence is thin.

---

**Changes from v1 plan:**
- Tasks 2+3 merged → single Task 2 (parallel architecture means all 6 rubrics are independent, no sequencing needed)
- Task 9 moved before Task 7 (research schema expansion needed before market resolution can use it — per variong review)

**Parallelizable:** Tasks 1+4+9 can all run in parallel. Task 2 depends only on Task 1. Task 6 after Task 2. Task 7 after Task 9.

**Critical path:** Task 1 → Task 2 → Task 8 → Task 10 (schema → prompts → calibration → go-live). Tasks 4, 5, 9 can run in parallel alongside 1→2.
