# Entity Schema Redesign — people.db

**Date**: 2026-03-10
**Status**: Approved
**Goal**: Migrate from per-person YAML files to SQLite as single source of truth. Universal entity model for people, orgs, and their relationships.

## Schema (5 tables)

### `entities` — Base identity table
```sql
CREATE TABLE entities (
    id INTEGER PRIMARY KEY,
    type TEXT NOT NULL CHECK(type IN ('person', 'org')),
    slug TEXT UNIQUE NOT NULL,
    created_at TEXT DEFAULT (datetime('now'))
);
```

### `people` — Person-specific attributes
```sql
CREATE TABLE people (
    entity_id INTEGER PRIMARY KEY REFERENCES entities(id),
    name TEXT NOT NULL,
    photo_url TEXT,
    hometown TEXT,
    location TEXT,
    headline TEXT,
    current_company TEXT,
    current_title TEXT,
    linkedin_url TEXT,
    bio TEXT,
    dossier_text TEXT,
    dossier_model TEXT,
    profile_json TEXT,  -- {education, social_links, writings, focus_areas, metrics}
    source TEXT,
    updated_at TEXT DEFAULT (datetime('now'))
);
```

### `orgs` — Organization-specific attributes (VC firms, startups, public companies)
```sql
CREATE TABLE orgs (
    entity_id INTEGER PRIMARY KEY REFERENCES entities(id),
    name TEXT NOT NULL,
    name_normalized TEXT,
    website TEXT,
    industry TEXT,
    founded_year INTEGER,
    city TEXT,
    state TEXT,
    country TEXT DEFAULT 'US',
    org_type TEXT,  -- 'vc_firm', 'startup', 'public', 'nonprofit', 'university'
    profile_json TEXT,  -- {fund_size, stages, focus_tags, ...}
    source TEXT,
    updated_at TEXT DEFAULT (datetime('now'))
);
```

### `entity_links` — Universal relationships
```sql
CREATE TABLE entity_links (
    id INTEGER PRIMARY KEY,
    source_entity_id INTEGER NOT NULL REFERENCES entities(id),
    target_entity_id INTEGER NOT NULL REFERENCES entities(id),
    relationship TEXT NOT NULL,
    -- employee, founder, co_founder, investor, board_member, advisor,
    -- venture_partner, mentor, portfolio_company, acquirer, lp, talent_partner
    role_title TEXT,
    start_date TEXT,
    end_date TEXT,
    period_text TEXT,
    is_current INTEGER DEFAULT 0,
    notes TEXT,
    source TEXT,
    created_at TEXT DEFAULT (datetime('now')),
    UNIQUE(source_entity_id, target_entity_id, relationship, start_date)
);
CREATE INDEX idx_el_source ON entity_links(source_entity_id);
CREATE INDEX idx_el_target ON entity_links(target_entity_id);
CREATE INDEX idx_el_rel ON entity_links(relationship);
CREATE INDEX idx_el_current ON entity_links(is_current) WHERE is_current = 1;
```

### `entity_raw` — Raw enrichment/discovery data per source
```sql
CREATE TABLE entity_raw (
    id INTEGER PRIMARY KEY,
    entity_id INTEGER NOT NULL REFERENCES entities(id),
    source TEXT NOT NULL,
    -- fec, arxiv, semantic_scholar, apollo, wikidata, wikipedia,
    -- openalex, pubmed, sec, grokipedia, discovery, dblp
    data_json TEXT NOT NULL,
    fetched_at TEXT NOT NULL,
    UNIQUE(entity_id, source)
);
CREATE INDEX idx_er_entity ON entity_raw(entity_id);
CREATE INDEX idx_er_source ON entity_raw(source);
```

## Migration Plan

### Phase 1: Create new schema
- Add new tables alongside existing ones (non-destructive)
- Keep old tables until migration verified

### Phase 2: Migrate data
1. Parse each `data/{slug}/profile_quick.yaml` → entities + people + entity_raw (enrichment, discovery)
2. Parse each `data/{slug}/profile_deep.yaml` → merge into people (dossier_text, career)
3. Parse each `data/{slug}/dossier.xml.md` → people.dossier_text
4. Career entries from YAML → entity_links (create org entities as needed)
5. Migrate `person_roles` → entity_links
6. Migrate `vc_firms_archived` → entities + orgs
7. Derive metrics from enrichment data → people.profile_json.metrics

### Phase 3: Update code
- `lib/discover.py`: save_profile_quick() → write to DB
- `profile_data.py`: load_person() → read from DB + entity_links
- `render.py`: no change (receives dict from profile_data)
- `common.py`: hint_for() → read from DB
- `app.py`: transparent (uses profile_data)

### Phase 4: Cleanup
- Trash `data/{slug}/` YAML/JSON/MD files
- Drop old tables: `person_roles`, `vc_people`
- Delete `cleanup_db.py` (replaced by new migration)

## What stays unchanged
- `person_scores` table (separate concern — founder evaluation)
- `lookup_cache` table (transient search cache)
- `linkedin_profiles`, `linkedin_scrape_jobs`, `linkedin_campaigns` (LinkedIn scraping infra)
- `web_presence` table (transient)

## Key decisions
- `entity_raw` replaces per-source YAML sections — one row per (entity, source) pair, UPSERT on re-fetch
- `profile_json` on people/orgs is a convenience blob for structured-but-rarely-queried data (education, social links, writings). High-value fields stay as columns.
- VC firm vs startup is `orgs.org_type`, not a separate entity type
- Unique constraint on entity_links prevents duplicate relationship ingestion
- Foreign keys enforced via entities base table