# Versioned Content Cache Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Add snapshot versioning to ContentStore so every `put()` preserves a historical copy, enabling diff detection and time-travel queries for any content store consumer (URL cache, future stores).

**Architecture:** Add a `{table}_snapshots` sibling table to ContentStore with composite PK `(id, fetched_at)`. On every `put()`, if the content_hash changed (or no prior snapshot exists), append a row to snapshots. The main table continues to work as-is (latest-wins for fast lookups). New query methods: `get_history()`, `get_snapshot()`, `diff_snapshots()`. The ingest content_store.py wrapper gets thin convenience methods.

**Tech Stack:** SQLite, zstandard compression (reuse existing), hashlib for content_hash

---

### Task 1: Snapshot table creation in ContentStore

**Files:**
- Modify: `lib/content_store/store.py:107-158` (`_create_schema`)
- Test: `lib/content_store/tests/test_snapshots.py` (create)

**Step 1: Write the failing test**

```python
"""Tests for ContentStore snapshot versioning."""
import tempfile
from pathlib import Path

from lib.content_store.store import ContentStore


def _make_store(tmp_path: Path, **kwargs) -> ContentStore:
    return ContentStore(
        db_path=str(tmp_path / "test.db"),
        table="docs",
        id_field="doc_id",
        fields={
            "doc_id": "TEXT PRIMARY KEY",
            "title": "TEXT",
            "body": "TEXT",
            "fetched_at": "TEXT NOT NULL",
            "content_hash": "TEXT",
        },
        compress_fields=["body"],
        **kwargs,
    )


def test_snapshots_table_created(tmp_path):
    """Snapshot table exists when enable_snapshots=True."""
    store = _make_store(tmp_path, enable_snapshots=True)
    tables = store._conn.execute(
        "SELECT name FROM sqlite_master WHERE type='table'"
    ).fetchall()
    table_names = [t["name"] for t in tables]
    assert "docs_snapshots" in table_names


def test_snapshots_table_not_created_by_default(tmp_path):
    """No snapshot table when enable_snapshots not set."""
    store = _make_store(tmp_path)
    tables = store._conn.execute(
        "SELECT name FROM sqlite_master WHERE type='table'"
    ).fetchall()
    table_names = [t["name"] for t in tables]
    assert "docs_snapshots" not in table_names
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py -v`
Expected: FAIL — `__init__` doesn't accept `enable_snapshots`

**Step 3: Implement snapshot table creation**

In `lib/content_store/store.py`:

1. Add `enable_snapshots: bool = False` parameter to `__init__`
2. Store as `self._enable_snapshots`
3. In `_create_schema`, after main table creation, add:

```python
if self._enable_snapshots:
    # Snapshot table: same fields as main, but composite PK (id, fetched_at)
    snap_cols = []
    for col_name, col_type in cols:
        typedef = col_type
        # Strip PRIMARY KEY from id field — it becomes part of composite
        orig_name = col_name.removesuffix("_zstd")
        orig_typedef = self._fields.get(orig_name, "")
        if "PRIMARY KEY" in orig_typedef:
            typedef = col_type.replace("PRIMARY KEY", "").strip() or "TEXT"
        snap_cols.append(f"{col_name} {typedef}")
    snap_col_str = ", ".join(snap_cols)
    snap_table = f"{self._table}_snapshots"
    db.execute(
        f"CREATE TABLE IF NOT EXISTS {snap_table} ("
        f"{snap_col_str}, "
        f"PRIMARY KEY ({self._id_field}, fetched_at))"
    )
    db.execute(
        f"CREATE INDEX IF NOT EXISTS idx_{snap_table}_fetched "
        f"ON {snap_table}(fetched_at)"
    )
```

**Step 4: Run test to verify it passes**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py -v`
Expected: PASS

**Step 5: Commit**

```bash
git add lib/content_store/store.py lib/content_store/tests/test_snapshots.py
git commit -m "feat(content_store): add snapshot table creation with enable_snapshots flag"
```

---

### Task 2: Append snapshot on put() when content changes

**Files:**
- Modify: `lib/content_store/store.py:331-421` (`put` method)
- Test: `lib/content_store/tests/test_snapshots.py`

**Step 1: Write the failing test**

```python
def test_put_creates_snapshot(tmp_path):
    """First put() creates a snapshot."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")

    snaps = store._conn.execute("SELECT * FROM docs_snapshots WHERE doc_id = 'd1'").fetchall()
    assert len(snaps) == 1


def test_put_appends_snapshot_on_change(tmp_path):
    """Second put() with different content_hash appends a new snapshot."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    store.put(doc_id="d1", title="V2", body="second", fetched_at="2026-01-02T00:00:00Z", content_hash="bbb")

    snaps = store._conn.execute(
        "SELECT * FROM docs_snapshots WHERE doc_id = 'd1' ORDER BY fetched_at"
    ).fetchall()
    assert len(snaps) == 2


def test_put_skips_snapshot_on_same_hash(tmp_path):
    """put() with same content_hash does NOT create a new snapshot."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    store.put(doc_id="d1", title="V1b", body="first", fetched_at="2026-01-02T00:00:00Z", content_hash="aaa")

    snaps = store._conn.execute("SELECT * FROM docs_snapshots WHERE doc_id = 'd1'").fetchall()
    assert len(snaps) == 1  # same hash, no new snapshot


def test_put_without_snapshots_unchanged(tmp_path):
    """put() works normally when snapshots disabled."""
    store = _make_store(tmp_path, enable_snapshots=False)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    result = store.get("d1")
    assert result["title"] == "V1"
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py::test_put_creates_snapshot -v`
Expected: FAIL — no snapshot insertion in `put()`

**Step 3: Implement snapshot append in put()**

At the end of `put()`, after the main table upsert and FTS update, add:

```python
# Snapshot: append if content changed
if self._enable_snapshots:
    self._maybe_append_snapshot(kwargs, col_names, values)
```

Add new method:

```python
def _maybe_append_snapshot(self, kwargs: dict, col_names: list[str], values: list) -> None:
    """Append a snapshot row if content_hash changed or no prior snapshot exists."""
    db = self._get_db()
    id_value = kwargs[self._id_field]
    new_hash = kwargs.get("content_hash")
    snap_table = f"{self._table}_snapshots"

    # Check last snapshot hash
    last = db.execute(
        f"SELECT content_hash FROM {snap_table} "
        f"WHERE {self._id_field} = ? ORDER BY fetched_at DESC LIMIT 1",
        (id_value,),
    ).fetchone()

    if last and last["content_hash"] == new_hash and new_hash is not None:
        return  # No change

    # Insert snapshot with same columns/values as main table
    placeholders = ", ".join(["?"] * len(col_names))
    col_str = ", ".join(col_names)
    db.execute(
        f"INSERT OR REPLACE INTO {snap_table} ({col_str}) VALUES ({placeholders})",
        values,
    )
    db.commit()
```

**Step 4: Run tests to verify they pass**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py -v`
Expected: All PASS

**Step 5: Commit**

```bash
git add lib/content_store/store.py lib/content_store/tests/test_snapshots.py
git commit -m "feat(content_store): append snapshot on put() when content changes"
```

---

### Task 3: Query methods — get_history(), get_snapshot()

**Files:**
- Modify: `lib/content_store/store.py`
- Test: `lib/content_store/tests/test_snapshots.py`

**Step 1: Write the failing test**

```python
def test_get_history(tmp_path):
    """get_history returns all snapshots for an ID, newest first."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    store.put(doc_id="d1", title="V2", body="second", fetched_at="2026-01-02T00:00:00Z", content_hash="bbb")
    store.put(doc_id="d1", title="V3", body="third", fetched_at="2026-01-03T00:00:00Z", content_hash="ccc")

    history = store.get_history("d1")
    assert len(history) == 3
    assert history[0]["title"] == "V3"  # newest first
    assert history[2]["title"] == "V1"  # oldest last
    # Compressed body is decompressed
    assert history[0]["body"] == "third"


def test_get_history_with_limit(tmp_path):
    """get_history respects limit parameter."""
    store = _make_store(tmp_path, enable_snapshots=True)
    for i in range(5):
        store.put(doc_id="d1", title=f"V{i}", body=f"body{i}",
                  fetched_at=f"2026-01-0{i+1}T00:00:00Z", content_hash=f"h{i}")

    history = store.get_history("d1", limit=2)
    assert len(history) == 2


def test_get_snapshot_by_timestamp(tmp_path):
    """get_snapshot retrieves a specific version by timestamp."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    store.put(doc_id="d1", title="V2", body="second", fetched_at="2026-01-02T00:00:00Z", content_hash="bbb")

    snap = store.get_snapshot("d1", "2026-01-01T00:00:00Z")
    assert snap is not None
    assert snap["title"] == "V1"
    assert snap["body"] == "first"


def test_get_history_empty(tmp_path):
    """get_history returns empty list for unknown ID."""
    store = _make_store(tmp_path, enable_snapshots=True)
    assert store.get_history("nope") == []


def test_get_history_disabled(tmp_path):
    """get_history returns empty list when snapshots disabled."""
    store = _make_store(tmp_path, enable_snapshots=False)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    assert store.get_history("d1") == []
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py::test_get_history -v`
Expected: FAIL — `get_history` not defined

**Step 3: Implement query methods**

```python
def get_history(
    self, id_value: str, *, limit: int | None = None
) -> list[dict]:
    """Return all snapshots for an ID, newest first.

    Returns empty list if snapshots disabled or no snapshots exist.
    """
    if not self._enable_snapshots:
        return []
    db = self._get_db()
    snap_table = f"{self._table}_snapshots"
    sql = (
        f"SELECT * FROM {snap_table} "
        f"WHERE {self._id_field} = ? ORDER BY fetched_at DESC"
    )
    params: list = [id_value]
    if limit:
        sql += " LIMIT ?"
        params.append(limit)
    rows = db.execute(sql, params).fetchall()
    return [self._row_to_dict(r) for r in rows]

def get_snapshot(self, id_value: str, fetched_at: str) -> dict | None:
    """Retrieve a specific snapshot by ID and timestamp."""
    if not self._enable_snapshots:
        return None
    db = self._get_db()
    snap_table = f"{self._table}_snapshots"
    row = db.execute(
        f"SELECT * FROM {snap_table} "
        f"WHERE {self._id_field} = ? AND fetched_at = ?",
        (id_value, fetched_at),
    ).fetchone()
    return self._row_to_dict(row) if row else None
```

**Step 4: Run tests to verify they pass**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py -v`
Expected: All PASS

**Step 5: Commit**

```bash
git add lib/content_store/store.py lib/content_store/tests/test_snapshots.py
git commit -m "feat(content_store): add get_history() and get_snapshot() query methods"
```

---

### Task 4: Enable snapshots in ingest content_store.py + convenience methods

**Files:**
- Modify: `lib/ingest/content_store.py:22-51` (store init)
- Test: `lib/ingest/tests/test_content_store.py`

**Step 1: Write the failing test**

Add to existing `test_content_store.py`:

```python
def test_snapshot_history(tmp_store):
    """Content store tracks snapshots with different content."""
    cs = tmp_store
    cs.cache_store(
        url="https://example.com/portfolio",
        html="<html>V1</html>",
        content_hash="hash1",
    )
    cs.cache_store(
        url="https://example.com/portfolio",
        html="<html>V2</html>",
        content_hash="hash2",
    )
    history = cs.cache_history("https://example.com/portfolio")
    assert len(history) == 2
    assert history[0]["raw_html"] == "<html>V2</html>"  # newest first
    assert history[1]["raw_html"] == "<html>V1</html>"


def test_snapshot_same_content_no_dup(tmp_store):
    """Same content_hash does not create duplicate snapshot."""
    cs = tmp_store
    cs.cache_store(url="https://example.com/same", html="<html>Same</html>", content_hash="xxx")
    cs.cache_store(url="https://example.com/same", html="<html>Same</html>", content_hash="xxx")
    history = cs.cache_history("https://example.com/same")
    assert len(history) == 1
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/ingest/tests/test_content_store.py::test_snapshot_history -v`
Expected: FAIL — `cache_history` not defined

**Step 3: Implement**

In `lib/ingest/content_store.py`:

1. Add `enable_snapshots=True` to the `ContentStore(...)` call in `_get_store()`
2. Add convenience methods:

```python
def cache_history(url: str, limit: int | None = None) -> list[dict]:
    """Return all snapshots for a URL, newest first."""
    return _get_store().get_history(url, limit=limit)


def cache_snapshot(url: str, fetched_at: str) -> dict | None:
    """Retrieve a specific cached snapshot by URL and timestamp."""
    return _get_store().get_snapshot(url, fetched_at)
```

**Step 4: Run tests to verify they pass**

Run: `python -m pytest lib/ingest/tests/test_content_store.py -v`
Expected: All PASS (old tests + new tests)

**Step 5: Commit**

```bash
git add lib/ingest/content_store.py lib/ingest/tests/test_content_store.py
git commit -m "feat(ingest): enable snapshots in URL cache, add cache_history/cache_snapshot"
```

---

### Task 5: Snapshot count in stats + migration of existing data

**Files:**
- Modify: `lib/content_store/store.py` (stats method)
- Modify: `lib/ingest/content_store.py` (stats)
- Test: `lib/content_store/tests/test_snapshots.py`

**Step 1: Write the failing test**

```python
def test_stats_includes_snapshot_count(tmp_path):
    """stats() reports snapshot count when snapshots enabled."""
    store = _make_store(tmp_path, enable_snapshots=True)
    store.put(doc_id="d1", title="V1", body="first", fetched_at="2026-01-01T00:00:00Z", content_hash="aaa")
    store.put(doc_id="d1", title="V2", body="second", fetched_at="2026-01-02T00:00:00Z", content_hash="bbb")

    s = store.stats()
    assert s["count"] == 1  # main table: 1 doc
    assert s["snapshot_count"] == 2  # snapshot table: 2 versions
```

**Step 2: Run test to verify it fails**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py::test_stats_includes_snapshot_count -v`
Expected: FAIL — no `snapshot_count` in stats

**Step 3: Implement**

In `ContentStore.stats()`, after computing `result`, add:

```python
if self._enable_snapshots:
    snap_table = f"{self._table}_snapshots"
    snap_row = db.execute(f"SELECT COUNT(*) as count FROM {snap_table}").fetchone()
    result["snapshot_count"] = snap_row["count"]
```

**Step 4: Run tests to verify they pass**

Run: `python -m pytest lib/content_store/tests/test_snapshots.py -v`
Expected: All PASS

**Step 5: Commit**

```bash
git add lib/content_store/store.py lib/content_store/tests/test_snapshots.py
git commit -m "feat(content_store): include snapshot_count in stats()"
```

---

### Notes

**What this does NOT do (YAGNI):**
- No automatic pruning/retention policy — add when storage becomes an issue
- No FTS on snapshots — search the main table, then get_history() for the match
- No vector embeddings on snapshots — same reasoning
- No diff_snapshots() method — consumers can diff the returned dicts themselves
- No migration of existing `url_content` rows into snapshots — existing rows have no prior versions to reconstruct; new snapshots start accumulating from the moment `enable_snapshots=True` takes effect

**Schema evolution:** When ContentStore opens an existing DB and `enable_snapshots=True`, `_create_schema` will `CREATE TABLE IF NOT EXISTS` the snapshot table. Existing main-table data is untouched. Future `put()` calls start appending snapshots.

**Storage efficiency:** Snapshots use the same zstd compression as the main table. The content_hash dedup means identical re-fetches don't balloon storage. For a VC portfolio page re-scraped weekly: ~52 fetches/year, but maybe 4-6 actual changes → 4-6 snapshots stored.
