# Supply Chain Graph: Mentions-First Architecture

Date: 2026-03-09, revised 2026-03-10
Supersedes: 2026-03-06-supplychain-pipeline-stages-design.md

## Problem

The current supply chain pipeline writes LLM-extracted company names directly into the canonical `companies` table. This conflates two fundamentally different things:

1. **"The LLM said 'Foo Corp' is a customer of TSMC"** — a raw observation with provenance
2. **"Foo Corp (NASDAQ: FOO) is a real company with market cap $X"** — a canonical entity

Result: ~14,300 entries where ~45% have no roles, and banks/utilities/mining companies pollute a semiconductor supply chain graph. The LLM hallucinated divisions ("Google TPU Division"), garbled names, and vague placeholders ("Various Chinese Suppliers") all got treated as real companies.

## Design Principles

1. **Mentions are cheap, canonical entities are earned.** Record everything the LLM says, but don't promote to the graph until confirmed.
2. **Provenance is non-negotiable.** Every fact traces back to: who said it, in what context, from what source.
3. **Evidence accumulates.** Multiple independent sightings of "Foo Corp" as a supplier strengthen the case for promotion.
4. **The graph expands deliberately.** Next-wave candidates are scored for relevance before spending LLM calls.
5. **Novel discovery is the goal.** Track what we find that other sources miss — that's the value.

## Architecture: Three Layers

```
Layer 1: Mentions (raw observations)
    ↓ canonicalize
Layer 2: Canonical Entities (confirmed companies in the graph)
    ↓ expand
Layer 3: Graph (entity_links, roles, scores)
```

### Layer 1: Mentions

A new `mentions` table stores raw LLM extractions before any cleanup:

```sql
CREATE TABLE mentions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    raw_name TEXT NOT NULL,           -- exactly as the LLM said it
    normalized_name TEXT,             -- computed from raw_name at insert via normalize_name()
    seen_as TEXT NOT NULL,            -- 'supplier', 'customer', 'competitor'
    seen_as_detail TEXT,              -- structured context: "photoresist chemicals", "EUV lithography"
    seen_with_company_id INTEGER,     -- the company being researched (canonical)
    context TEXT,                     -- freetext LLM context about this mention
    source_url TEXT,                  -- URL the LLM cited
    source_content_id TEXT,           -- FK to lib/ingest content.db (if cached)
    llm_ticker TEXT,                  -- ticker the LLM suggested (may be wrong)
    llm_exchange TEXT,                -- exchange the LLM suggested
    llm_country TEXT,
    llm_market_cap_m REAL,            -- LLM-stated value, no currency guarantee
    llm_is_public INTEGER,
    confidence TEXT CHECK(confidence IN ('high', 'medium', 'low')),
    importance TEXT CHECK(importance IN ('critical', 'major', 'minor')),
    resolved_company_id INTEGER,      -- NULL until canonicalized
    resolution_method TEXT,           -- 'ticker_match', 'finnhub', 'name_match', 'alias', 'evidence', 'manual'
    resolution_score REAL,            -- confidence in the resolution (0-1)
    resolution_attempts INTEGER DEFAULT 0, -- how many times resolution was attempted
    rejected INTEGER DEFAULT 0,       -- 1 if determined to be garbage/hallucination
    rejection_reason TEXT,            -- 'garbage_name', 'not_in_scope', 'hallucination', etc.
    batch_id TEXT,                    -- which pipeline run produced this
    created_at TEXT DEFAULT (datetime('now')),
    resolved_at TEXT,                 -- when resolution succeeded or mention was rejected
    updated_at TEXT DEFAULT (datetime('now')),
    FOREIGN KEY (seen_with_company_id) REFERENCES companies(id),
    FOREIGN KEY (resolved_company_id) REFERENCES companies(id),
    UNIQUE(raw_name, seen_as, seen_with_company_id, batch_id)
);

CREATE INDEX idx_mentions_raw_name ON mentions(raw_name);
CREATE INDEX idx_mentions_norm_name ON mentions(normalized_name) WHERE resolved_company_id IS NULL AND rejected = 0;
CREATE INDEX idx_mentions_resolved ON mentions(resolved_company_id);
CREATE INDEX idx_mentions_unresolved ON mentions(resolved_company_id) WHERE resolved_company_id IS NULL AND rejected = 0;
CREATE INDEX idx_mentions_seen_with ON mentions(seen_with_company_id);
```

**Key properties:**
- Raw name preserved exactly as extracted — no normalization
- `normalized_name` computed at insert time for fast evidence grouping
- `UNIQUE` constraint prevents duplicate mentions within a batch
- `resolved_at` + `updated_at` timestamps for state transition tracking
- `resolution_attempts` counter prevents infinite retry loops
- `confidence`/`importance` constrained to valid values; LLM outputs normalized on insert
- `seen_as_detail` captures structured "supplies: X" data separately from freetext `context`
- Links to `lib/ingest/data/content.db` via `source_content_id` for cached web content
- `rejected` flag preserves garbage for learning (not deleted)
- `batch_id` ties back to the job run for reproducibility

### Resolution Cache

Avoid re-resolving the same name across batches:

```sql
CREATE TABLE mention_resolutions (
    normalized_name TEXT PRIMARY KEY,
    resolved_company_id INTEGER REFERENCES companies(id),
    resolution_method TEXT,
    resolution_score REAL,
    resolved_at TEXT DEFAULT (datetime('now'))
);
```

Resolve "TSMC" once → cached. Next 49 mentions resolve in O(1) instead of hitting APIs.

### Layer 2: Canonicalization Pipeline

Mentions flow through a resolution pipeline before entering the graph:

```
mention.raw_name
    ↓
0. Preprocess: compute normalized_name, strip divisions
    ↓
1. Garbage filter (GARBAGE_PREFIXES, GARBAGE_SUBSTRINGS) → reject
    ↓
2. Resolution cache lookup (normalized_name → mention_resolutions) → instant resolve
    ↓
3. Ticker match + name similarity gate (if LLM provided ticker) → DB lookup by UPPER(ticker)
    ↓
4. Alias lookup (company_aliases table) → cheap DB match
    ↓
5. Name normalization match → normalized_name against companies.name_normalized index
    ↓
6. Finnhub validation → stock/profile2 for canonical name, market cap (Phase 2)
    ↓
7. GLEIF LEI lookup → fuzzy match for private companies (Phase 2.5)
    ↓
8. PermID lookup → 3.5M orgs (Phase 2.5)
    ↓
9. Evidence check → multiple unresolved mentions with same normalized name?
    ↓
10. Website discovery → if name is sensible, search for company website (Phase 2.5)
    ↓
resolved_company_id OR stays unresolved
```

**Key changes from original design (post-review):**
- **Preprocessing first**: strip divisions + compute normalized name before any resolution
- **Resolution cache** (step 2): avoids redundant API calls for previously-resolved names
- **Alias lookup** (step 4): `company_aliases` table already exists in `canonicalize.py` — promote to main schema and resolution path
- **Name match before external APIs** (step 5 before 6-8): resolves majority of mentions for free using `companies.name_normalized` index
- **Ticker match includes name similarity gate**: prevents LLM-hallucinated tickers (e.g., "AI" matching C3.ai instead of Air Liquide) using `merge_confidence()` from dedup.py
- **GLEIF/PermID deferred to Phase 2.5**: no implementation exists, system works without them
- On resolution success: update `mention_resolutions` cache for future lookups

Resolution writes back to `mentions.resolved_company_id`, `mentions.resolution_method`, and `mentions.resolved_at`. If a new canonical entity is created, it enters `companies` with full provenance.

### Layer 3: Graph (consolidated)

**`entity_links`** is the single relationship table (consolidation in progress — see Phase 0). It supports:
- company → company (supplier, customer, competitor, partner)
- person → company (CEO, founder, partner)
- company → url (portfolio pages, filings)

The `relationships` table is being retired and merged into `entity_links`.

Companies only enter `companies` after earning their place through canonicalization. The `company_aliases` table is promoted to the main schema:

```sql
CREATE TABLE IF NOT EXISTS company_aliases (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    company_id INTEGER REFERENCES companies(id),
    alias TEXT NOT NULL,
    alias_normalized TEXT NOT NULL,
    source TEXT,              -- 'georgetown', 'finnhub', 'manual', 'llm'
    created_at TEXT DEFAULT (datetime('now')),
    UNIQUE(company_id, alias_normalized)
);

CREATE INDEX idx_aliases_normalized ON company_aliases(alias_normalized);
```

**Indexed name matching** on `companies`:

```sql
ALTER TABLE companies ADD COLUMN name_normalized TEXT;
CREATE INDEX idx_companies_name_norm ON companies(name_normalized);
-- Populated via Python normalize_name() on insert/update (SQLite generated columns
-- can't handle the full regex)
```

## Provenance Status (replaces needs_confirmation)

Instead of a boolean flag, companies have a provenance lifecycle:

```sql
ALTER TABLE companies ADD COLUMN provenance_status TEXT DEFAULT 'unconfirmed'
    CHECK(provenance_status IN ('anchor', 'confirmed', 'unconfirmed', 'disputed'));
```

- **`anchor`**: Wave 0 seed companies — confirmed by definition
- **`confirmed`**: Evidence threshold met (see below)
- **`unconfirmed`**: Exists but lacks sufficient evidence
- **`disputed`**: Conflicting resolution evidence

**Confirmation threshold** (defined as a queryable view):

```sql
CREATE VIEW confirmed_companies AS
SELECT c.id, c.name FROM companies c
WHERE c.provenance_status = 'anchor'
   OR (SELECT COUNT(DISTINCT m.seen_with_company_id)
       FROM mentions m
       WHERE m.resolved_company_id = c.id AND m.rejected = 0) >= 2
   OR (c.ticker IS NOT NULL AND EXISTS (
       SELECT 1 FROM mentions m
       WHERE m.resolved_company_id = c.id
         AND m.resolution_method = 'finnhub'));
```

A company is confirmed when: it's an anchor, OR it has >=2 resolved mentions from distinct sources, OR it has a Finnhub-validated mention.

## Pipeline Stages (revised)

```
profile → extract → canonicalize → expand → enrich → score
```

| Stage           | Call           | What it does                                                |
|-----------------|----------------|-------------------------------------------------------------|
| `profile`       | 1 LLM + web    | Company description, ticker, roles, segments (for anchors)  |
| `extract`       | 1 LLM + web    | All suppliers, customers, competitors → writes to `mentions`|
| `canonicalize`  | DB + Finnhub   | Resolve mentions → canonical entities, reject garbage       |
| `expand`        | none (DB)      | Score candidates for next wave, prioritize semi-relevant    |
| `enrich`        | Finnhub API    | Ground-truth ticker/market_cap/exchange for canonical cos   |
| `score`         | none (SQL)     | Quality score 0-100 on data completeness                    |

### `extract` stage (replaces suppliers/customers/competitors)

One LLM call returns all relationship types. Output goes to `mentions` table, NOT directly to `companies`:

```json
{
  "suppliers": [
    {"name": "...", "ticker": "...", "exchange": "...", "country": "...",
     "is_public": true, "market_cap_m": 1234, "supplies": "...",
     "category": "equipment|materials|EDA|...", "importance": "critical|major|minor",
     "confidence": "high|medium|low", "source_url": "..."}
  ],
  "customers": [...],
  "competitors": [...]
}
```

Soft cap of 100 per relationship type per company — beyond this, log a warning and track garbage-to-signal ratio per batch. Everything goes into `mentions`. The canonicalization stage decides what's real.

On insert, normalize confidence/importance to constrained values (map "very high" → "high", "moderate" → "medium", etc.). Store `supplies`/`buys`/`competes_in` in `seen_as_detail`.

### `canonicalize` stage (new)

Per-mention resolution (see pipeline in Layer 2 above):
1. Garbage name filter → `rejected = 1, rejection_reason = 'garbage_name'`
2. Resolution cache → instant resolve for previously-seen names
3. Ticker + alias + name resolution → match existing company
4. Finnhub validation → for remaining with tickers (Phase 2)
5. If resolved: set `resolved_company_id`, create entity_link in graph, update cache
6. If not resolved but name is sensible: leave unresolved, increment `resolution_attempts`
7. If not resolved and name is nonsensical: `rejected = 1`

### `expand` stage (revised)

Before expanding to next wave, score unprocessed companies on expansion value:

- Has roles assigned? (+)
- Is in semiconductor/adjacent industry? (+)
- Multiple independent mentions? (+)
- Has relationships to core companies? (+)
- Is a bank/utility/mining company? (-)
- Provenance status is 'confirmed'? (+)

Only expand companies above a relevance threshold. This prevents wasting LLM calls on banks-as-customers.

## Migration: Existing Data

### Correct baseline

The DB has ~14,300 companies (not 9,400 as previously stated). Wave breakdown: wave 0 = 5,890 (includes anchors + Finnhub-imported), wave 1 = 8,395 (LLM-discovered), wave 99 = 28 (bulk imports), wave -1 = 3.

Note: Wave 0 has 5,890 companies, far more than the ~33 seed anchors. Companies imported via Finnhub screener also landed in wave 0. Only the original seed companies in `SEED_COMPANIES` get `provenance_status = 'anchor'`; the rest of wave 0 get `'unconfirmed'` like everyone else.

### Migration steps

1. Add `provenance_status` column: seed companies → `'anchor'`, all others → `'unconfirmed'`
2. Add `name_normalized` column, populate from `normalize_name()`, index it
3. Promote `company_aliases` to main schema in `db.py`
4. Consolidate `entity_links` / `relationships` (Phase 0, in progress)
5. Add `mentions` + `mention_resolutions` tables

### Legacy data treatment

Existing companies and relationships are **grandfathered** — they stay as-is without back-populating into `mentions`. Creating synthetic mention records with fake provenance adds complexity for no value. Instead:
- Legacy companies start as `'unconfirmed'` and get confirmed organically as new mentions resolve to them
- The `quality_score` column (already populated) provides a secondary quality signal
- Companies with 0 roles (9,205 of them) are flagged for priority re-evaluation

## Extraction Gym (future — Phase 4)

Infrastructure for tuning the extraction process:

1. **Ground truth set**: Manually curated list of "correct" extractions for known companies
2. **Precision/recall tracking**: For each LLM run, how many mentions resolved to real companies vs. garbage?
3. **Prompt comparison**: Run different prompts against the same companies, compare extraction quality
4. **Model comparison**: Track extraction quality across models (grok-4-1-fast vs. others)
5. **A/B testing**: New prompts run on a subset before rolling out
6. **Garbage-to-signal ratio**: Track per batch, alert on regression

This uses `lib/tune` (wraptune) for decision tracking and comparison.

## Source Discovery (future — Phase 5)

Track what we find that other sources miss:

1. **Novelty scoring**: For each canonical company, is it in Georgetown ChipExplorer? In Finnhub's semiconductor category? In our seed list? Companies found ONLY through our LLM extraction are novel discoveries.
2. **Source-type tracking**: When we discover that certain companies file pollution permits, or appear in customs data, or show up in patent filings — record the source type as a potential expansion strategy.
3. **Expansion strategy proposals**: When a pattern emerges (e.g., "companies that supply photoresist also appear in EPA TSCA filings"), propose it to the user as a new source to explore.

## Content Cache Integration

Web content fetched during extraction is already cached in `lib/ingest/data/content.db` (46MB). The `mentions.source_content_id` field links to this cache, preserving the full source text for:
- Re-extraction with improved prompts (gym)
- Provenance verification
- Context retrieval for unresolved mentions

No changes needed to `lib/ingest` — it already handles URL fetching, caching, and text extraction.

## What Stays as Standalone Scripts

These don't fit as per-item pipeline stages:

- `cleanup_garbage.py` `WAVE99_DELETE_NAMES` — curated one-time blocklist (but its pattern lists feed canonicalize)
- `dedup.py` full run — bulk historical dedup (but its logic feeds canonicalize)
- `finnhub_screener.py` search/screen commands — discovery tool, not per-item enrichment
- `export_universe.py` — export/ranking tool, runs when needed

## Implementation Order

### Phase 0: Consolidate entity_links / relationships (in progress)
- Migrate `relationships` rows into `entity_links`
- Migrate `relationship_topics` references
- Update all code to use `entity_links`
- Drop `relationships` table

### Phase 1: Mentions table + extract stage
- Add `mentions` + `mention_resolutions` tables to schema
- Add `name_normalized` column + index to `companies`
- Promote `company_aliases` to main schema in `db.py`
- Add `provenance_status` column to `companies`
- Modify `extract` stage to write mentions instead of companies
- Flag existing companies: seed → `'anchor'`, rest → `'unconfirmed'`

### Phase 2: Canonicalization pipeline
- Build resolution pipeline: cache → garbage → ticker+alias → name match → Finnhub → evidence
- Absorb garbage detection from `cleanup_garbage.py`
- Absorb dedup logic from `dedup.py`
- Wire up as `canonicalize` pipeline stage
- Add `_merge_company` cascade for mentions table

### Phase 2.5: External resolution APIs
- GLEIF LEI integration (rate-limited, cached)
- PermID integration (requires Thomson Reuters account)
- Website discovery for unresolved mentions

### Phase 3: Prioritized expansion
- Score expansion candidates on relevance
- Filter expansion to semiconductor-relevant companies
- Track expansion yield (% of expanded companies that produce useful mentions)

### Phase 4: Extraction gym
- Ground truth curation
- Precision/recall metrics per batch
- Garbage-to-signal ratio monitoring
- Prompt comparison infrastructure via `lib/tune`

### Phase 5: Source discovery
- Novelty scoring vs. external sources
- Source-type tracking
- Expansion strategy proposals

## Review Log

- **2026-03-10**: Multi-model review (Claude Opus + GPT). Key changes:
  - Added `UNIQUE` constraint, `resolved_at`, `updated_at`, `resolution_attempts` to mentions
  - Added `normalized_name` indexed column to mentions and companies
  - Added `mention_resolutions` cache table
  - Reordered pipeline: name match before external APIs
  - Promoted `company_aliases` to main schema
  - Replaced `needs_confirmation` boolean with `provenance_status` enum + confirmation view
  - Added Phase 0: consolidate `entity_links` / `relationships` (dual tables → single)
  - Deferred GLEIF/PermID to Phase 2.5
  - Dropped legacy back-population into mentions (grandfather instead)
  - Added soft cap (100/type) on extract list sizes with garbage-ratio monitoring
  - Fixed migration assumptions (14,300 companies, wave 0 = 5,890 not 33)
  - Added `seen_as_detail` column for structured relationship context
  - Added CHECK constraints on confidence/importance
