# Multi-Source Provenance Pattern

How a job evolves from single-source to multi-source with quality-aware merging.

## When to Use

A job needs this pattern when:

- **Multiple sources** can provide data for the same entity (e.g., direct scrape + Wayback archive + manual import)
- **Sources have different reliability** — one may return junk while another has good data
- **You need to avoid clobbering** — a bad source overwriting a good one loses data
- **You want auditability** — know which source provided which data, and when

## Architecture

```
source A ──→ entity_sources ──→ compute_quality() ──→ merge_entity() ──→ main table
source B ──→ entity_sources ──→ compute_quality() ──↗
source C ──→ entity_sources ──→ compute_quality() ──↗
```

### 1. Source Table

One row per (entity_id, source). Stores raw data + extracted fields + quality score.

```sql
CREATE TABLE entity_sources (
    entity_id    TEXT NOT NULL,
    source       TEXT NOT NULL,    -- 'direct', 'wayback', 'manual', ...
    raw_data     TEXT,             -- original content (HTML, JSON, etc.)
    -- All extracted fields (same as parser output)
    field_a TEXT, field_b TEXT, ...
    -- Source metadata
    quality      REAL,             -- computed score (0-100)
    fetched_at   TEXT NOT NULL,
    PRIMARY KEY (entity_id, source)
);
```

### 2. Quality Scoring

Deterministic formula — no LLM needed. Score based on:
- Content completeness (description length, field count)
- Domain-specific signals (financial data present, author attributed)
- Penalize known-bad patterns (too short, missing key fields)

### 3. Merge Function

Field-level best-of-breed: for each field, pick the value from the highest-quality source that has it.

```python
def merge_entity(conn, entity_id):
    sources = get_sources_by_quality_desc(conn, entity_id)
    merged = {}
    for field in FIELDS:
        for src in sources:
            if src[field] not in (None, "", 0):
                merged[field] = src[field]
                break
    upsert_main_table(conn, entity_id, merged)
```

### 4. Main Table = Materialized Best-of View

The main table is always derivable from the source table. Enrichment columns (LLM-derived summaries, scores) live only on the main table and are not touched by the merge.

## Migration Path

1. **Create source table** alongside existing main table (additive, no breakage)
2. **Backfill** from existing data: infer source label from timestamps/flags, compute quality
3. **Switch handlers** to write sources first, then merge. One handler at a time.
4. **Downstream unchanged**: enrichment/analysis stages still read from main table

## Key Principles

| Principle | Rationale |
|-----------|-----------|
| Each source owns its row (upsert by PK) | Sources never interfere with each other |
| Main table is always derivable from sources | Can rebuild with different merge rules |
| Enrichment stays on main table only | Not per-source; runs on merged best data |
| Quality scoring is fast and deterministic | Can re-score anytime, no LLM cost |
| Raw data preserved per source | Re-extract with improved parsers later |

## Real Example: VIC Ideas

**Problem**: Two jobs (`vic_ideas` direct scrape, `vic_wayback` Wayback Machine) wrote to the same `ideas` table. Direct scrape could overwrite good Wayback data with login wall junk.

**Solution**: `idea_sources` table with `(idea_id, source)` PK. Both handlers write to `idea_sources`, then `merge_idea()` picks best per field.

**Bonus**: Added `idea_snapshots` to track all Wayback CDX snapshots with paywall status (tried/paywalled/open). Enables paywall date discovery — binary search for when VIC went behind a paywall, then skip post-paywall snapshots automatically.

**Files**: `projects/vic/db.py` (shared DB ops), handlers import `upsert_source()` + `merge_idea()`.

## Extending to New Sources

Adding a third source (e.g., `signup` for VIC account-based access):

1. Handler calls `upsert_source(conn, idea_id, "signup", raw_html=html, parsed=parsed)`
2. Quality scoring and merge handle it automatically
3. No changes to existing handlers or enrichment pipeline

## Anti-Patterns

- **Don't use COALESCE on the main table directly** — loses data when a better source arrives later
- **Don't duplicate raw data** — store once in source table, main table gets best raw for enrichment
- **Don't score quality with LLM** — too slow, too expensive, non-deterministic. Use content signals.
