# Learning Eval System — Design

**Date**: 2026-02-24
**Status**: Approved design
**Supersedes**: `2026-02-24-learning-eval-vision.md` (brainstorm)

## Core Idea

The learning system's value is measured by one question: **are the principles actually helping?** Everything else — extraction, generalization, linking — serves that question.

The eval system consolidates existing tools (`retroactive_study.py`, `sandbox_replay.py`, `search_eval.py`) into a unified pipeline that writes to `learning.db`, with a UI driven by the questions users need answered.

## Architecture

### Two-tier eval pipeline

```
         Tier 1: Retroactive Annotation          Tier 2: Sandbox Execution
         ~$0.001/episode, hundreds of eps         ~$0.05/run, tens of runs
         "Would this principle have helped?"      "Did injecting it actually help?"
                        │                                      │
                        └──────────┬───────────────────────────┘
                                   ▼
                             learning.db
                           (single truth store)
                                   │
                           ┌───────┴───────┐
                           ▼               ▼
                     Gradio UI        materialize.py
                   (eval dashboard)    → principles/
```

**Tier 1** (retroactive annotation): Cheap, broad. Scans session episodes against all principles. Writes `eval_annotations` to learning.db. Evolves from existing `retroactive_study.py`.

**Tier 2** (sandbox execution): Expensive, targeted. Runs Claude Code in Docker with/without principles injected. Measures real wall-clock, tool counts, quality. Existing `sandbox_replay.py` — integration deferred until better understood. Component tests now.

Tier 1 filters what goes to Tier 2 — only sandbox-test principle×error combinations that Tier 1 flags as relevant.

### Pipeline-centric design (reconstructability)

The pipeline is the product, not the schema. Learning should be reconstructable into a new schema by replaying the same processes with different readers/writers.

```
Readers              Processors            Writers

┌────────────────┐
│ Session JSONL  │      ┌───────────────────┐
│ YouTube        │      │  extract()        │     ┌────────────────┐
│ Papers         │─────▶│  generalize()     │────▶│ future store   │
│ Code review    │      │  link()           │     │ learning.db    │
│ Manual         │      │  evaluate()       │     │ semanticnet    │
└────────────────┘      └───────────────────┘     └────────────────┘

                                                  ┌────────────────┐
                                                  │ principles/    │
                                                  │ YAML export    │
                                                  └────────────────┘
```

Processors are stateless — take input, produce output, no DB coupling in the transform logic. This enables SemanticNet unification later: swap readers and writers, keep processors.

## Data Integrity

### Principle: never delete, always deprecate

Principles are the curated, high-value layer (239 today). They are never deleted.

**Lifecycle:**
```
proposed → active → updated (changelog tracks diffs)
                  → deprecated (reason + superseded_by)
                  → rejected (if proven harmful)
```

**For corrections:** Update principle text in place — changelog captures the diff, old version lives in the changelog. If the correction is substantial enough to be a different principle, deprecate the old one with `superseded_by` pointing to the new.

### Three layers of protection

**Layer 1: Changelog table** (append-only, populated by triggers)

```sql
CREATE TABLE IF NOT EXISTS principle_changelog (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    principle_id TEXT NOT NULL,
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    action TEXT NOT NULL,      -- created | updated | deprecated | rejected | restored
    field_changed TEXT,        -- null for 'created', specific field for 'updated'
    old_value TEXT,
    new_value TEXT,
    reason TEXT,
    changed_by TEXT            -- 'claude-code' | 'manual' | session_id
);
```

Triggers on `principles`: AFTER INSERT logs `created`, AFTER UPDATE logs per-field diffs for `text`, `name`, `status`, `full_text`, `rationale`, `anti_pattern`.

**Layer 2: Git-tracked YAML export**

`learn export` → `learning/data/principles_export.yaml`. Full principle records with links, human-readable, diffable. 239 principles ≈ 50KB YAML. Run after any batch operation.

```yaml
exported_at: "2026-02-24T10:30:00-08:00"
schema_version: 1
principles:
  - id: "dev/read-before-edit"
    name: "Read Before Edit"
    text: "Always read a file before editing it..."
    status: active
    learning_type: principle
    abstraction_level: 4
    links:
      - instance_id: "inst-abc123"
        link_type: supports
    application_count: 12
    success_rate: 0.83
```

**Layer 3: SQLite WAL mode + periodic backups**

Verify WAL is enabled. Backups already happening (`learning.db.bak-{date}`). Formalize cadence.

### Enforcement

- Remove any `DELETE FROM principles` code paths from `learning_store.py`
- Add `deprecate_principle(id, reason, superseded_by=None)` method
- `principle_refine.py` writes to learning.db (not directly to `~/.claude/principles/`)

## Data Model Changes

### New: `principle_changelog`

See Layer 1 above. Populated by triggers — zero effort from callers.

### New: `eval_annotations`

Replaces standalone `retroactive_study_results.json`.

```sql
CREATE TABLE IF NOT EXISTS eval_annotations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    principle_id TEXT NOT NULL REFERENCES principles(id),
    session_id TEXT,
    episode_index INTEGER,
    episode_text TEXT,          -- truncated context
    annotation TEXT NOT NULL,   -- followed | violated | not_applicable
    confidence REAL,
    annotator_model TEXT,
    annotated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT               -- JSON: refinement suggestions, new_patterns
);
```

### No changes to existing tables

`principle_applications` already captures outcome data. `eval_annotations` is the "would this have applied?" signal that feeds into application tracking.

## UI: Driven by Questions

The UI answers six questions, ordered by urgency.

### Q1: Are we learning useful things?

New instances this period → how many linked to principles vs orphaned vs contradicting? High orphan rate = extraction is noisy or principles have gaps. Contradictions are the most interesting signal.

### Q2: Are the learnings helping?

Prevention rate: % of recent errors where a relevant principle existed and following it would have avoided the error. The single headline metric.

### Q3: Most/least useful principles?

Ranked by `application_count × success_rate`. Top = proven valuable. Bottom = either untested (98 with zero applications — unknown, not bad) or tested and failing (actively harmful).

### Q4: Merge candidates?

Principles with embedding similarity > 0.85. Example: "dev" ↔ "development" (0.91), "Fail Loud" ↔ "No Silent Failures" (0.88). Action: side-by-side diff → merge into one, deprecate the other.

### Q5: Tweak candidates?

Low average link strength between a principle and its instances. The instances don't clearly map to the principle text — it's too vague or too abstract. Action: refine the principle text to better match its evidence.

### Q6: Overgeneralizing? Where is applying the principle making things worse?

Principles with high application count but low success rate. The system retrieves them, follows them, and gets worse results. **This is the most urgent signal** — an actively harmful principle. Action: narrow scope, add caveats, or deprecate.

### Layout

```
┌─ Landing: Learning Health ─────────────────────────────┐
│                                                        │
│  This week: 34 new instances                           │
│    → 21 linked to principles (62%)                     │
│    → 8 orphaned (no principle match)                   │
│    → 5 contradicting existing principles ⚠️            │
│                                                        │
│  Prevention rate: 23%                                  │
│                                                        │
├─ Ranked: Most → Least Useful ──────────────────────────┤
│                                                        │
│  Principle              Apps  Success  Trend           │
│  Read Before Edit         12    83%    ↑               │
│  Parallel Tool Calls       7    71%    →               │
│  ...                                                   │
│  Match Log to Intent       3    33%    ↓ ⚠️ hurting?   │
│  Bootstrap via Meta-Doc    2     0%    ↓ ⚠️ harmful    │
│                                                        │
├─ Merge Candidates (similarity > 0.85) ─────────────────┤
│                                                        │
│  "dev" ↔ "development"  (0.91)                         │
│  "Fail Loud" ↔ "No Silent Failures" (0.88)             │
│                                                        │
├─ Tweak Candidates (weak links) ────────────────────────┤
│                                                        │
│  "Importance Ordering" — 4 instances, avg strength 0.3 │
│  Text too abstract? Instances don't clearly map.       │
│                                                        │
├─ Overgeneralizing? (applied but failing) ──────────────┤
│                                                        │
│  "Match Log to Intent" — followed 3×, helped 1×,       │
│    hurt 2×. Over-verbose logging obscured errors.      │
│                                                        │
└─ Drill-down (click any principle) ─────────────────────┘
│  Instances: 4 supporting, 1 contradicting              │
│  Applications: 12 total, 83% success                   │
│  Changelog: created Feb 1 → text updated Feb 15 → ...  │
│  Recent episodes where this was relevant               │
└────────────────────────────────────────────────────────┘
```

### Actionable outputs per section

| Section | Action |
|---|---|
| Orphaned instances | Link to principle, or propose new one |
| Contradicting instances | Review — principle wrong, or edge case? |
| Low success rate | Narrow scope, add caveats, or deprecate |
| Merge candidates | Side-by-side diff → merge, deprecate duplicate |
| Weak links | Refine principle text to match evidence |
| Harmful principles | Most urgent — deprecate or restrict domain |

## Eval Pipeline (Tier 1)

Evolve `retroactive_study.py` to write to `eval_annotations`:

1. Load active principles from learning.db
2. Find recent sessions (`--days`, `--sessions`)
3. Break into episodes (reuse `session_extract` chunking)
4. For each (episode, principle): LLM annotate → `eval_annotations` row
5. Update `principle_applications` with new evidence
6. Update `principles.application_count` and `success_rate`

Key change: use `session_extract`'s episode chunking instead of the simpler "user prompt → tool sequence" splitting.

## CLI

```bash
learn eval                    # Run Tier 1 on recent sessions (default: 30 days)
learn eval --days 7           # Scope to last week
learn eval --sessions 10      # Limit session count
learn eval --summary          # Print headline stats to stdout
learn export                  # YAML export of principles + links
```

## Testing

| Test | What | Dependencies |
|---|---|---|
| `test_changelog_triggers` | INSERT/UPDATE principles → changelog populated | SQLite in-memory |
| `test_yaml_export_roundtrip` | Export → reimport → identical principles | SQLite in-memory |
| `test_no_hard_deletes` | No `DELETE FROM principles` in codebase | Grep (static) |
| `test_eval_annotation_pipeline` | Fixture session → episodes → annotations → DB | LLM (flash) |
| `test_docker_sandbox_builds` | Image builds, CC runs trivial prompt | Docker |
| `test_docker_sandbox_repo_clone` | Mount repo, clone at SHA, verify files | Docker |
| `test_merge_candidate_detection` | Similar principles → flagged | Embeddings |

Docker tests tagged `@pytest.mark.docker` — skipped in CI, run explicitly.

## Deferred

| Item | Why | When |
|---|---|---|
| Sandbox replay integration into unified pipeline | Need to understand it better first | After Tier 1 is working |
| SemanticNet unification | Processors designed reusable; shared interface later | When a second domain needs eval |
| Beam search over extraction configs | Needs both tiers working | After Tier 2 integration |
| Full CLI report formatting | Gradio dashboard is the primary interface | If needed |

## Existing Infrastructure Reused

| Component | Location | Role |
|---|---|---|
| `retroactive_study.py` | `learning/session_review/` | Evolves into Tier 1 |
| `sandbox_replay.py` | `learning/session_review/` | Tier 2 (deferred integration) |
| `search_eval.py` | `learning/` | Retrieval quality measurement |
| `session_extract` | `learning/session_extract/` | Episode chunking for Tier 1 |
| `learning_store.py` | `learning/schema/` | DB API — add changelog + annotations |
| `app.py` | `learning/schema/` | Gradio UI — add eval dashboard tab |
| `materialize.py` | `learning/schema/` | Export principles to `~/.claude/principles/` |
| `link_instances.py` | `learning/schema/` | Auto-linking instances to principles |
| `failures.db` | `learning/session_review/data/` | 1,032 error→repair pairs as eval corpus |