# Retroactive Principle Application Study — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Scan past 30 days of Claude Code sessions to identify where learning.db principles would have applied, populating `principle_applications` and generating a utility report.

**Architecture:** Single CLI script (`learning/session_review/retroactive_study.py`) that: (1) loads principles + their embeddings, (2) mines error contexts from session transcripts via existing `failure_mining`, (3) uses embedding similarity to shortlist candidate principles per error, (4) calls Gemini Flash as judge to determine applicability, (5) records matches to `principle_applications` table, (6) generates HTML summary report.

**Tech Stack:** Python, sqlite3, `lib.llm.embed` (Gemini embedding-001), `lib.llm.call_llm` (Gemini Flash for judging), existing `failure_mining.py` infrastructure, `learning_store.py` for DB writes.

---

## Inventory

| Item | Count | Notes |
|------|-------|-------|
| Active principles | 137 | 98 have embeddings, 39 need embedding first |
| Session files (30d) | 3,403 | `~/.claude/projects/**/*.jsonl` |
| Existing failure_pairs | 1,032 | 841 confirmed repairs |
| Existing principle_applications | 785 | 0 from retroactive study |
| Embedding model | `gemini/gemini-embedding-001` | Already used for principles |
| Judge model | `gemini/gemini-3-flash` | Cheap, fast, good enough for binary judgement |

## Key Reuse

- `failure_mining.find_transcripts(days=30)` — session discovery
- `failure_mining.parse_transcript(path)` — JSONL loading
- `failure_mining.mine_from_transcript(path)` — error extraction (returns `FailurePair` objects)
- `failure_mining.extract_session_metadata(entries)` — session ID, project
- `LearningStore.get_embeddings("principles")` — pre-computed vectors
- `LearningStore.record_application()` — DB write
- `lib.llm.embed.embed_texts()` — embed error contexts
- `lib.llm.call_llm()` — Gemini Flash judge calls
- `_cosine_similarity()` from `learning/cli.py` — vector similarity

---

### Task 1: Ensure All Principles Have Embeddings

**Files:**
- Run: `learning/cli.py` (existing `learn embed` command)

**Step 1: Check current embedding coverage**

Run: `python -m learning.cli embed --stats-only`
Expected: Shows 98/137 principles embedded

**Step 2: Generate missing embeddings**

Run: `python -m learning.cli embed`
Expected: Embeds remaining 39 principles. All 137 now covered.

**Step 3: Verify**

Run: `sqlite3 learning/data/learning.db "SELECT COUNT(*) FROM principles WHERE embedding IS NOT NULL AND status='active'"`
Expected: 137

---

### Task 2: Create the Retroactive Study Script — Core Structure

**Files:**
- Create: `learning/session_review/retroactive_study.py`

**Step 1: Write the script skeleton with imports and CLI**

```python
#!/usr/bin/env python
"""Retroactive principle application study.

Scans past N days of Claude Code session transcripts to identify where
learning.db principles would have applied — cases where following a principle
would have prevented an error.

Usage:
    python -m learning.session_review.retroactive_study --days 30
    python -m learning.session_review.retroactive_study --days 30 --dry-run
    python -m learning.session_review.retroactive_study --report
"""

import asyncio
import json
import math
import sys
import uuid
from datetime import datetime
from pathlib import Path

import click
from loguru import logger

# Add rivus root to path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))

from learning.schema.learning_store import (
    ApplicationOutcome,
    LearningStore,
    PrincipleApplication,
)
from learning.session_review.failure_mining import (
    find_transcripts,
    mine_from_transcript,
    parse_transcript,
    extract_session_metadata,
)
from lib.llm import call_llm
from lib.llm.embed import embed_texts


# ── Constants ──────────────────────────────────────────────────────
TOP_K_PRINCIPLES = 10       # candidate principles per error context
BATCH_SIZE = 5              # concurrent judge calls
JUDGE_MODEL = "gemini/gemini-3-flash"
RECORDED_BY = "retroactive_study"
MAX_ERRORS_PER_SESSION = 10  # cap to avoid runaway sessions


def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)
```

**Step 2: Run to verify imports work**

Run: `python -c "import learning.session_review.retroactive_study"`
Expected: No import errors

---

### Task 3: Principle Loading and Embedding Index

**Files:**
- Modify: `learning/session_review/retroactive_study.py`

**Step 1: Add principle loading with embeddings**

```python
def load_principle_index(store: LearningStore) -> list[dict]:
    """Load all active principles with their embeddings.

    Returns list of {id, name, text, rationale, anti_pattern, embedding}.
    """
    principles = store.list_principles(status="active", limit=500)
    embeddings = dict(store.get_embeddings("principles"))

    index = []
    skipped = 0
    for p in principles:
        emb = embeddings.get(p.id)
        if emb is None:
            skipped += 1
            continue
        index.append({
            "id": p.id,
            "name": p.name,
            "text": p.text or "",
            "rationale": p.rationale or "",
            "anti_pattern": p.anti_pattern or "",
            "embedding": emb,
        })
    if skipped:
        logger.warning(f"Skipped {skipped} principles without embeddings")
    logger.info(f"Loaded {len(index)} principles with embeddings")
    return index
```

**Step 2: Add error-to-embedding function**

```python
async def embed_error_context(error_text: str, context_prompt: str) -> list[float]:
    """Embed an error context for principle retrieval."""
    combined = f"Error: {error_text[:300]}\nContext: {context_prompt[:200]}"
    vectors = await embed_texts([combined], task_type="RETRIEVAL_QUERY")
    return vectors[0]
```

**Step 3: Add top-K retrieval**

```python
def find_candidate_principles(
    error_embedding: list[float],
    principle_index: list[dict],
    top_k: int = TOP_K_PRINCIPLES,
) -> list[dict]:
    """Find top-K most similar principles to an error context."""
    scored = []
    for p in principle_index:
        sim = cosine_similarity(error_embedding, p["embedding"])
        scored.append({**p, "similarity": sim})
    scored.sort(key=lambda x: x["similarity"], reverse=True)
    return scored[:top_k]
```

**Step 4: Verify loading works**

Run: `python -c "from learning.schema.learning_store import LearningStore; from learning.session_review.retroactive_study import load_principle_index; idx = load_principle_index(LearningStore()); print(f'{len(idx)} principles loaded')"`
Expected: ~137 principles loaded

---

### Task 4: LLM Judge Prompt and Caller

**Files:**
- Modify: `learning/session_review/retroactive_study.py`

**Step 1: Add the judge prompt**

```python
JUDGE_PROMPT = """\
You are evaluating whether a software development principle would have prevented an error in a Claude Code session.

## Error Context
Tool: {tool_name}
Error: {error_text}
Session context: {context_prompt}
{repair_info}

## Candidate Principle
ID: {principle_id}
Name: {principle_name}
Description: {principle_text}
Anti-pattern: {anti_pattern}

## Task
Would following this principle have PREVENTED this specific error?

Answer with JSON:
{{
  "applicable": true/false,
  "confidence": "high" | "medium" | "low",
  "reasoning": "1-2 sentence explanation",
  "outcome": "would_have_prevented" | "partially_relevant" | "irrelevant"
}}

Be strict: only mark "would_have_prevented" if the principle DIRECTLY addresses the root cause.
"partially_relevant" means the principle is related but wouldn't have prevented THIS specific error.
"""


async def judge_principle_applicability(
    error: dict,
    principle: dict,
) -> dict | None:
    """Ask LLM judge whether a principle would have prevented an error.

    Returns parsed JSON response or None on failure.
    """
    repair_info = ""
    if error.get("repair_text"):
        repair_info = f"Repair action: {error['repair_text'][:200]}"

    prompt = JUDGE_PROMPT.format(
        tool_name=error["tool_name"],
        error_text=error["error_text"][:500],
        context_prompt=error["context_prompt"][:300],
        repair_info=repair_info,
        principle_id=principle["id"],
        principle_name=principle["name"],
        principle_text=principle["text"][:300],
        anti_pattern=principle.get("anti_pattern", "N/A")[:200],
    )

    try:
        response = await call_llm(
            prompt,
            model=JUDGE_MODEL,
            temperature=0.0,
            response_format={"type": "json_object"},
        )
        return json.loads(response)
    except Exception as e:
        logger.warning(f"Judge call failed for {principle['id']}: {e}")
        return None
```

**Step 2: Smoke-test the judge with a synthetic error**

Run: `python -c "..."`  (quick asyncio.run test with one hardcoded error + principle)
Expected: Returns valid JSON with applicable/confidence/reasoning fields

---

### Task 5: Session Processing Pipeline

**Files:**
- Modify: `learning/session_review/retroactive_study.py`

**Step 1: Add error extraction from sessions**

```python
def extract_errors_from_session(path: Path) -> list[dict]:
    """Extract error contexts from a session transcript.

    Returns list of {tool_name, error_text, context_prompt, repair_text,
    session_id, project, error_category}.
    """
    entries = parse_transcript(path)
    if not entries:
        return []

    metadata = extract_session_metadata(entries)
    failure_pairs, _ = mine_from_transcript(path)

    errors = []
    for fp in failure_pairs[:MAX_ERRORS_PER_SESSION]:
        first = fp.first_failure  # dict with tool_name, input_summary, error_text
        repair_text = ""
        if fp.repair_candidates:
            best = fp.repair_candidates[0]
            repair_text = f"{best.get('tool_name', '')}: {best.get('result', '')[:200]}"

        errors.append({
            "tool_name": first.get("tool_name", "unknown"),
            "error_text": first.get("error_text", "")[:500],
            "context_prompt": fp.context_prompt or "",
            "repair_text": repair_text,
            "session_id": metadata.get("session_id", ""),
            "project": metadata.get("project", ""),
            "error_category": fp.error_category,
        })
    return errors
```

**Step 2: Add the main async pipeline**

```python
async def process_session(
    path: Path,
    principle_index: list[dict],
    store: LearningStore,
    dry_run: bool = False,
) -> dict:
    """Process one session: extract errors, find candidate principles, judge.

    Returns {session_id, errors_found, judgements_made, applications_recorded}.
    """
    errors = extract_errors_from_session(path)
    if not errors:
        return {"session_id": "", "errors_found": 0, "judgements_made": 0, "applications_recorded": 0}

    stats = {
        "session_id": errors[0]["session_id"],
        "errors_found": len(errors),
        "judgements_made": 0,
        "applications_recorded": 0,
    }

    for error in errors:
        # Embed error context and find candidate principles
        error_emb = await embed_error_context(error["error_text"], error["context_prompt"])
        candidates = find_candidate_principles(error_emb, principle_index)

        # Judge each candidate
        for principle in candidates:
            judgement = await judge_principle_applicability(error, principle)
            stats["judgements_made"] += 1

            if judgement is None:
                continue

            if judgement.get("outcome") == "irrelevant":
                continue

            outcome_map = {
                "would_have_prevented": ApplicationOutcome.SUCCESS,
                "partially_relevant": ApplicationOutcome.PARTIAL,
            }
            outcome = outcome_map.get(judgement.get("outcome"), ApplicationOutcome.UNKNOWN)

            if dry_run:
                logger.info(
                    f"  [DRY] {principle['id']}: {judgement['outcome']} "
                    f"({judgement.get('confidence', '?')})"
                )
                stats["applications_recorded"] += 1
                continue

            app = PrincipleApplication(
                id=str(uuid.uuid4()),
                principle_id=principle["id"],
                session_id=error["session_id"],
                project=error["project"],
                context_snippet=f"{error['tool_name']}: {error['error_text'][:200]}",
                outcome=outcome,
                outcome_notes=judgement.get("reasoning", ""),
                prevented_error=error["error_category"],
                recorded_by=RECORDED_BY,
            )
            try:
                store.record_application(app)
                stats["applications_recorded"] += 1
            except Exception as e:
                logger.warning(f"Failed to record application: {e}")

    return stats
```

**Step 3: Verify with one session**

Run: `python -c "import asyncio; from learning.session_review.retroactive_study import *; from learning.schema.learning_store import LearningStore; from learning.session_review.failure_mining import find_transcripts; store = LearningStore(); idx = load_principle_index(store); paths = find_transcripts(days=30); print(f'{len(paths)} sessions'); result = asyncio.run(process_session(paths[0], idx, store, dry_run=True)); print(result)"`
Expected: Shows dry-run judgements for first session

---

### Task 6: CLI Entry Point and Session Sampling

**Files:**
- Modify: `learning/session_review/retroactive_study.py`

**Step 1: Add session sampling (prefer sessions with errors)**

```python
def sample_sessions_with_errors(
    paths: list[Path],
    target: int = 30,
) -> list[Path]:
    """Sample sessions that have errors, up to target count.

    Scans sessions, keeps those with >=1 failure pair.
    Stops early once target reached.
    """
    selected = []
    scanned = 0
    for path in reversed(paths):  # newest first
        scanned += 1
        try:
            failures, _ = mine_from_transcript(path)
            if failures:
                selected.append(path)
                if len(selected) >= target:
                    break
        except Exception as e:
            logger.debug(f"Skip {path.name}: {e}")
            continue
    logger.info(f"Scanned {scanned} sessions, found {len(selected)} with errors")
    return selected
```

**Step 2: Add the CLI with click**

```python
async def run_study(days: int, target_sessions: int, dry_run: bool):
    """Main study entry point."""
    store = LearningStore()
    principle_index = load_principle_index(store)

    paths = find_transcripts(days=days)
    logger.info(f"Found {len(paths)} sessions from last {days} days")

    selected = sample_sessions_with_errors(paths, target=target_sessions)

    all_stats = []
    for i, path in enumerate(selected):
        logger.info(f"[{i+1}/{len(selected)}] Processing {path.name}")
        stats = await process_session(path, principle_index, store, dry_run=dry_run)
        all_stats.append(stats)
        if stats["applications_recorded"] > 0:
            logger.info(
                f"  → {stats['errors_found']} errors, "
                f"{stats['applications_recorded']} applications recorded"
            )

    # Summary
    total_errors = sum(s["errors_found"] for s in all_stats)
    total_judgements = sum(s["judgements_made"] for s in all_stats)
    total_apps = sum(s["applications_recorded"] for s in all_stats)
    logger.info(f"\nDone. {len(selected)} sessions, {total_errors} errors, "
                f"{total_judgements} judgements, {total_apps} applications recorded")
    return all_stats


@click.command()
@click.option("--days", default=30, help="Look back N days")
@click.option("--sessions", default=30, help="Target number of sessions with errors")
@click.option("--dry-run", is_flag=True, help="Don't write to DB, just log")
@click.option("--report", is_flag=True, help="Generate report from existing data")
def main(days: int, sessions: int, dry_run: bool, report: bool):
    """Retroactive principle application study."""
    if report:
        generate_report()
        return
    asyncio.run(run_study(days, sessions, dry_run))


if __name__ == "__main__":
    main()
```

**Step 3: Test CLI help**

Run: `python -m learning.session_review.retroactive_study --help`
Expected: Shows usage with --days, --sessions, --dry-run, --report options

---

### Task 7: Report Generator

**Files:**
- Modify: `learning/session_review/retroactive_study.py`

**Step 1: Add report generation function**

```python
def generate_report():
    """Generate HTML summary report from retroactive study data."""
    store = LearningStore()
    db_path = store.db_path

    import sqlite3
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row

    # Principles ranked by retroactive hit rate
    rows = conn.execute("""
        SELECT
            p.id, p.name, p.text,
            COUNT(pa.id) as hit_count,
            SUM(CASE WHEN pa.outcome = 'success' THEN 1 ELSE 0 END) as prevented_count,
            SUM(CASE WHEN pa.outcome = 'partial' THEN 1 ELSE 0 END) as partial_count,
            GROUP_CONCAT(DISTINCT pa.prevented_error) as error_types
        FROM principles p
        LEFT JOIN principle_applications pa
            ON pa.principle_id = p.id AND pa.recorded_by = 'retroactive_study'
        WHERE p.status = 'active'
        GROUP BY p.id
        ORDER BY hit_count DESC
    """).fetchall()

    # Zero-hit principles (prune candidates)
    zero_hit = [r for r in rows if r["hit_count"] == 0]
    has_hits = [r for r in rows if r["hit_count"] > 0]

    # Error category breakdown
    category_rows = conn.execute("""
        SELECT prevented_error as category, COUNT(*) as cnt
        FROM principle_applications
        WHERE recorded_by = 'retroactive_study'
        GROUP BY prevented_error
        ORDER BY cnt DESC
    """).fetchall()

    conn.close()

    # Build HTML
    html_parts = [
        "<!DOCTYPE html><html><head>",
        "<title>Retroactive Principle Application Study</title>",
        "<style>",
        "body { font-family: system-ui; max-width: 1000px; margin: 2em auto; padding: 0 1em; }",
        "table { border-collapse: collapse; width: 100%; margin: 1em 0; }",
        "th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }",
        "th { background: #f5f5f5; }",
        "tr:nth-child(even) { background: #fafafa; }",
        ".stat { font-size: 2em; font-weight: bold; color: #333; }",
        ".stat-label { color: #666; font-size: 0.9em; }",
        ".stats-row { display: flex; gap: 2em; margin: 1em 0; }",
        ".zero-hit { color: #999; }",
        "</style>",
        "</head><body>",
        f"<h1>Retroactive Principle Application Study</h1>",
        f"<p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>",

        '<div class="stats-row">',
        f'<div><div class="stat">{len(has_hits)}</div><div class="stat-label">Principles with hits</div></div>',
        f'<div><div class="stat">{len(zero_hit)}</div><div class="stat-label">Zero-hit (prune candidates)</div></div>',
        f'<div><div class="stat">{sum(r["hit_count"] for r in rows)}</div><div class="stat-label">Total applications</div></div>',
        '</div>',

        "<h2>Principles Ranked by Retroactive Hits</h2>",
        "<table><tr><th>Principle</th><th>Hits</th><th>Prevented</th><th>Partial</th><th>Error Types</th></tr>",
    ]
    for r in has_hits:
        html_parts.append(
            f"<tr><td><b>{r['name']}</b><br><small>{r['id']}</small></td>"
            f"<td>{r['hit_count']}</td><td>{r['prevented_count']}</td>"
            f"<td>{r['partial_count']}</td>"
            f"<td><small>{r['error_types'] or ''}</small></td></tr>"
        )
    html_parts.append("</table>")

    html_parts.append("<h2>Error Category Breakdown</h2>")
    html_parts.append("<table><tr><th>Category</th><th>Applications</th></tr>")
    for r in category_rows:
        html_parts.append(f"<tr><td>{r['category']}</td><td>{r['cnt']}</td></tr>")
    html_parts.append("</table>")

    html_parts.append(f"<h2>Zero-Hit Principles ({len(zero_hit)} — prune candidates)</h2>")
    html_parts.append("<table><tr><th>Principle</th><th>Description</th></tr>")
    for r in zero_hit[:50]:
        html_parts.append(
            f'<tr class="zero-hit"><td>{r["name"]}<br><small>{r["id"]}</small></td>'
            f'<td><small>{(r["text"] or "")[:150]}</small></td></tr>'
        )
    html_parts.append("</table></body></html>")

    out_path = Path(__file__).parent / "data" / "retroactive_study_report.html"
    out_path.write_text("\n".join(html_parts))
    logger.info(f"Report written to {out_path}")
    click.echo(f"Report: {out_path}")
```

**Step 2: Test report generation (empty initially)**

Run: `python -m learning.session_review.retroactive_study --report`
Expected: Generates HTML file (with 0 hits since we haven't run the study yet)

---

### Task 8: Dry Run on 5 Sessions

**Step 1: Run dry-run on small sample**

Run: `python -m learning.session_review.retroactive_study --days 30 --sessions 5 --dry-run`
Expected: Shows principle matches for ~5 sessions without writing to DB. Verify output makes sense.

**Step 2: Review output quality**

Check that:
- Candidate principles are semantically relevant to the errors
- Judge verdicts seem reasonable (not matching everything or nothing)
- No crashes on edge cases

---

### Task 9: Full Run (30 Sessions)

**Step 1: Run the full study**

Run: `python -m learning.session_review.retroactive_study --days 30 --sessions 30`
Expected: Processes 30 sessions, records applications to DB.

**Step 2: Verify DB entries**

Run: `sqlite3 learning/data/learning.db "SELECT COUNT(*), outcome FROM principle_applications WHERE recorded_by='retroactive_study' GROUP BY outcome"`
Expected: Non-zero counts for success/partial outcomes

**Step 3: Generate the report**

Run: `python -m learning.session_review.retroactive_study --report`
Expected: HTML report with ranked principles, error breakdown, and prune candidates.

---

### Task 10: Commit

**Step 1: Commit new script and report**

```bash
git add learning/session_review/retroactive_study.py
git commit -m "feat(learning): retroactive principle application study

Scans recent Claude Code sessions for errors and judges which principles
would have prevented them. Uses embedding similarity for candidate
retrieval and Gemini Flash as judge.

Populates principle_applications table with recorded_by='retroactive_study'."
```

---

## Cost Estimate

| Operation | Count | Unit Cost | Total |
|-----------|-------|-----------|-------|
| Embed error contexts | ~150 errors | ~free (Gemini embed) | ~$0.01 |
| Judge calls (Gemini Flash) | ~1,500 (150 errors × 10 principles) | ~$0.002/call | ~$3 |
| **Total** | | | **~$3** |

## Risk Mitigations

- **Dry-run first**: `--dry-run` flag lets us inspect judgements before committing
- **Recorded_by tag**: All entries tagged `retroactive_study` — easy to delete if quality is poor: `DELETE FROM principle_applications WHERE recorded_by='retroactive_study'`
- **Cap per session**: `MAX_ERRORS_PER_SESSION=10` prevents runaway on noisy sessions
- **Embedding pre-filter**: Only 10 candidates per error (not all 137) keeps costs bounded
