Learning System Guide

How the system captures knowledge from sessions, stores it in a unified DB, materializes it as principles Claude can act on, and runs gyms to evaluate and improve components.

Contents

Part I — What & Why
Why This Exists
Core Concepts
Walkthrough
Self-Improvement Loops
Status: What Works & What Doesn't
Part II — Technical Reference
Quick Start
Data Flows
Runtime Components
CLI Reference
Principle Files
Session Review
Gyms (detailed)
Skills
Directory Map
DB Schema
Design Decisions
Troubleshooting

Part I

What & Why

For anyone new to the project — understand the problem, the approach, and current status before diving into implementation details.

1. Why This Exists

Every Claude Code session starts with limited memory. Hard-won insights from yesterday's debugging session are gone.

The problem, concretely

On Feb 3, Claude spent 12 minutes debugging why a Gradio app's head= parameter wasn't injecting CSS. Root cause: Gradio 6 silently ignores head= on gr.Blocks() — it must be passed to launch(). On Feb 8, the same bug. On Feb 14, again. Three sessions, same gotcha, zero memory between them.

With the learning system: the first occurrence was captured as a learning instance, linked to the "No Silent Failures" principle, materialized into ~/.claude/principles/dev.md, and injected into every subsequent session that touched Gradio code. The second and third occurrences never happened.

The learning system closes this loop. It mines sessions for patterns, stores them as evidence-backed principles, and injects relevant ones into future sessions before the same mistakes are repeated.

The gap isn't intelligence — it's institutional memory. A senior engineer doesn't just know more facts; they've internalized principles about when complexity is warranted, when to verify assumptions, when a workaround signals a wrong abstraction. This system builds that kind of accumulated judgment for an AI agent.

1,436

learning instances

192

active principles

3,117

principle applications

principle files

2. Core Concepts

Four nouns you need to understand:

Instance

A single observation. "Gradio 6 ignores head= on gr.Blocks" — one concrete thing that happened.

Principle

A generalized rule backed by instances. "No Silent Failures" — the pattern across many observations.

Application

A record of a principle being used in a session, with outcome (followed, prevented error, violated).

Materialized file

Read-only .md file generated from the DB. What Claude actually sees in its context.

Observation → Instance → linked to Principle → materialized to .md file → injected into Claude's context

3. Walkthrough: Observation to Behavior Change

Tracing one real example through the entire system:

Observe — During a session, you discover that gr.Blocks(head="<style>...") silently ignores the CSS in Gradio 6.

Capture — Run: learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"

Classify — Gemini Flash (~0.3s) classifies it as type=pattern, project=rivus, and links it to the existing "No Silent Failures" principle with strength=0.8

Store — Instance goes into learning_instances, link goes into instance_principle_links, embedding auto-generated

Materialize — Next run of materialize.py writes the updated principle (with +1 instance count) to ~/.claude/principles/dev.md

Apply — In a future session, when Claude is about to edit a Gradio app, principles_worker.py matches the "No Silent Failures" principle and injects it as context. Claude uses launch(head=...) instead of gr.Blocks(head=...).

What the CLI output actually looks like

$ learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"
Classifying with gemini-2.5-flash...
  Type: pattern | Project: rivus | Tags: gradio, css, gotcha
  Linked to: No Silent Failures (strength: 0.8)
  Linked to: Prefer Context-Native Mechanisms (strength: 0.6)
  Instance ID: inst_a8f3c2
  Embedding generated ✓

$ learn show no-silent-failures
  Principle: No Silent Failures
  Status: active | Type: principle | File: dev.md
  Instances: 31 | Applications: 219
  Full text: "Never let a function silently ignore, drop, or swallow input.
  If something can't be processed, raise an error or log a warning..."

4. Self-Improvement Loops

Beyond capturing knowledge, the system actively improves itself through two mechanisms: auto-advise (learning what to do proactively) and gyms (testing and optimizing components).

Auto-Advise: Learning Proactive Behavior

Mines follow-up patterns from sessions — cases where the user's next request was a predictable action Claude should have done proactively.

Example pattern: After implementing a new CLI tool in a project with invoke tasks → Claude should proactively run it to verify, add an inv task, and commit. Instead of waiting for three separate user requests.

The pipeline: SessionEnd hook → Opus extracts follow-up patterns → stored as principles in auto-advise.md → injected into future sessions like any other principle. Currently 8 auto-advise patterns active.

Gyms: Generate → Evaluate → Learn

Offline evaluation environments that test and optimize system sub-components using historical data. Each gym tries variants of a prompt or method, scores them with LLM judges, and picks the winner.

Gym	Status	What it improves
Badge	Done	iTerm2 session badge text quality
Fetchability	Done	Proxy/fetch method selection per site
KB	Done	Knowledge extraction from web sources
Code Cleanup	Planned	Dead code, boilerplate, convention drift
Sidekick	Planned	Intervention timing & helpfulness

Gym findings feed back into the main learning loop as new instances or principle refinements. See Section 12 for detailed gym documentation.

5. Status: What Works & What Doesn't

An honest assessment of where the system stands, drawn from production stats, the presentation report ENHANCE markers, and learning/TODO.md.

What works well

Done Knowledge capture: 1,436 instances from sessions + manual input

Done Principle abstraction: 192 active principles across 10 domains

Done Materialization pipeline: DB → .md files → Claude context

Done Auto-extraction: SessionEnd hook mines learnings + follow-ups automatically

Done CLI: 13 commands covering full CRUD + search + health checks

Done Gyms: Badge, Fetchability, KB gyms completed with gen→eval→learn loops

Done Multi-model repair scoring: 66% unanimous across 6 models on 997 pairs

Done Skills: /learn, /reflect, /recall consolidated

In progress

WIP Hybrid retrieval: FTS + semantic search not yet combined (learn find uses one or the other)

WIP LLM re-ranker: Negation queries plateau at P@1=0.769 — needs re-ranking pass

WIP Principle category cleanup: dev/ and development/ need consolidation

WIP Retroactive study: 72 episodes analyzed but judge accuracy needs calibration (false positives)

Planned / known gaps

Planned Decision-point hooks: auto-inject learnings at UserPromptSubmit + PreToolCall

Planned Learning ↔ Skillz unification: shared DB, pluggable sources, gap analysis

Planned Hypothesis & UX experiments: forward-looking A/B testing of UX choices

Planned Embedding-based principle retrieval: needed past ~200 principles

Planned Task detection → Vario routing: auto-fan-out design tasks to multi-model exploration

Planned Sidekick & Code Cleanup gyms

Measurement gaps (from report ENHANCE markers)

The presentation report flags 9 areas where the system lacks measurement:

No causal metrics — stats are descriptive (1,436 instances), not causal (error rates before/after)
No before/after session comparison — the killer metric ("sessions with principles had X% fewer errors") doesn't exist yet
Application tracking sparse — most principles have few tracked applications; automation needed
Sandbox results not aggregated — per-principle A/B data exists in sandbox_results.db but isn't summarized
No worked failure→repair example — the 8-candidate scoring pipeline needs a concrete trace
Gym results under-documented — Badge gym ran 5 variants but results aren't in the report
Verification levels unproven — L0-L1 claimed but evidence not shown
Prior work comparison too high-level — needs specific paper references (Voyager, Reflexion, LATS)
KB gym design choice rationale missing — 5-case test and 70% threshold need justification

Part II

Technical Reference

For developers working on or debugging the system — data flows, CLI commands, runtime components, DB schema, and troubleshooting.

6. Quick Start

Note: learn add calls the LLM hot server (Gemini Flash) for classification. If no LLM server is running, use --raw to store without classification. Start the server with ops restart llm.

Add knowledge

# LLM auto-classifies, links to existing principles
learn add "Gradio 6 silently ignores head= on gr.Blocks"

# Store directly, no LLM classification
learn add "Always use python not python3" --raw --type convention

# Add with project context
learn add "Redis TimeSeries keys use ts:{SYMBOL}:price:raw" --project moneygun

Search and browse

# Keyword search
learn find "gradio gotchas"

# Semantic search (uses embeddings — finds conceptual matches)
learn find -s "silent failures"

# List principles with full text
learn principles -v

# Show a specific principle + evidence chain
learn show no-silent-failures

Regenerate materialized files

# Preview changes without writing
python learning/schema/materialize.py --principles-only --dry-run

# Write all outputs
python learning/schema/materialize.py

Health check

learn health    # embedding coverage, orphans, staleness
learn stats     # summary statistics

7. How Data Flows

There are two main data paths: automatic (from sessions) and manual (from the CLI).

Automatic: Session → Principle

Runs automatically on every session end, as a fire-and-forget subprocess.

SessionEnd hook triggers learning_worker.py

Parse — reads JSONL transcript, extracts user messages, assistant reasoning, errors, thinking blocks

Condense (Haiku) — compresses raw narrative to ~3K words, keeping decisions and dead ends, dropping routine tool output

Extract learnings (Opus) — identifies 0-3 learnings worth recording: non-obvious bugs, abandoned approaches, user corrections, gotchas

Extract follow-ups (Opus) — identifies predictable follow-up patterns: actions Claude should do proactively next time (e.g., "test after implementation")

Store — learnings via learn add --raw; follow-up patterns written directly as Principle entries with file_path=auto-advise.md

Manual: Observation → Classified Instance

User runs learn add "observation"

LLM classifies (Gemini Flash, ~$0.0003/call) — assigns type, project, domain tags

LLM links — matches to existing principles or proposes new ones

Store — instance in learning_instances, links in instance_principle_links, embeddings auto-generated

Materialization → Claude's Context

materialize.py reads learning.db, generates ~/.claude/learnings.md and ~/.claude/principles/*.md

UserPromptSubmit hook fires principles_worker.py — matches user prompt against principles

Relevant principles injected into Claude's context; applications logged to principle_applications table

Sources Store Outputs ───────────────────── ───── ─────── Manual (learn add) ──→ ┌─→ ~/.claude/learnings.md SessionEnd hook ──→ learning.db ├─→ ~/.claude/principles/*.md learning_worker.py (single source ├─→ principle_applications table - extract learnings of truth) └─→ Gym reports - extract follow-ups ──→ Session review miners ──→ Gyms (badge, fetchability) ──→

8. Runtime Components

Two background workers handle the automation. Understanding them is key to debugging.

learning_worker.py

Trigger: SessionEnd hook (fire-and-forget subprocess)

Location: supervisor/sidekick/hooks/learning_worker.py

What it does: Parse JSONL transcript → condense (Haiku) → extract learnings (Opus) → extract follow-ups (Opus) → store

Logs: supervisor/logs/learning_worker.log

Manual re-run: python -m supervisor.sidekick.hooks.learning_worker <session-id>
Skip heuristics: <3 user messages, <2 min duration, or wrapup already ran

principles_worker.py

Trigger: UserPromptSubmit hook (async, non-blocking)

Location: supervisor/sidekick/hooks/principles_worker.py

What it does: Matches user prompt against materialized principles → injects relevant ones into Claude's context

Logs: helm/logs/hooks.log

Applications tracked in principle_applications table for observability
Hook registered in ~/.claude/settings.json under hooks.UserPromptSubmit

Why multiple LLMs?

The system uses 3 different models for cost/quality tradeoffs: Haiku ($0.25/M tok) for cheap, fast transcript condensation. Opus ($15/M tok) for high-reasoning extraction where quality matters. Gemini Flash (~$0.08/M tok) for sub-cent classification on every learn add. Total cost per session: ~$0.05-0.15.

9. CLI Reference

The learn command (~/.local/bin/learn → tools/bin/learn) is the primary interface. All commands operate on learning/data/learning.db.

Command	What it does	Key flags
`learn add`	Add an observation; LLM classifies + links to principles	`--raw` (skip LLM), `--type`, `--project`, `--tags`
`learn find`	Search by keyword (FTS) or semantically	`-s` (semantic), `--type`, `--project`
`learn list`	List instances with filters	`--type`, `--project`, `--source`, `--limit`
`learn show <id>`	Show instance or principle details + evidence chain
`learn principles`	List all principles	`-v` (full text), `--type`
`learn apply`	Record a principle application + outcome	`--outcome` (followed/prevented_error/etc.)
`learn link`	Link an instance to a principle	`--strength`, `--link-type`
`learn link-parent`	Set parent-child between principles
`learn rename`	Rename principle category prefix
`learn embed`	Generate/refresh embeddings for all entries
`learn health`	Embedding coverage, orphans, staleness
`learn stats`	Summary statistics
`learn provenance`	Full provenance chain for an instance

10. Principle Files

materialize.py generates one .md file per domain from the DB. These live at ~/.claude/principles/ and are read-only views — edit via the CLI or DB, then re-materialize.

File	Principles	Instances	Applications	Description
`dev.md`	95	269	1,944	Development practices (largest — covers architecture, error handling, API usage, tooling)
`ux.md`	22	32	311	UX design: layout, information hierarchy, user signals
`batch-jobs.md`	15	45	357	Observability, fault isolation, checkpointing at scale
`testing.md`	15	17	93	Testing practices and verification patterns
`observability.md`	14	25	279	Logging, monitoring, debugging visibility
`data-quality.md`	9	32	116	Data validation and pipeline robustness
`auto-advise.md`	8	1	0	Follow-up patterns mined from sessions NEW
`parallelism.md`	7	3	12	When and how to parallelize work
`knowledge-accumulation.md`	4	4	5	Meta-principles about the learning system itself
`backtesting.md`	2	0	0	No future snooping, detection vs prediction

Top 10 Principles by Evidence

Principle	Instances	Applications
No Silent Failures	31	219
Prefer Context-Native Mechanisms	24	50
Verify API Details, Don't Fabricate	19	164
Respect Abstraction Boundaries	13	28
Look One Layer Out	13	45
Errors Compound Downstream	12	23
Verifying Early Beats Correcting Later	12	29
Fail Loud, Never Fake	10	42
Workarounds Piling Up = Wrong Abstraction	10	20
Recognize Poor Fit, Find Better Tools	10	57

11. Session Review

Analyzes how work was done (efficiency, errors, patterns). Complements doctor/chronicle which analyzes what was done (topics, accomplishments). All tools read from the same ~/.claude/projects/ JSONL transcripts.

Tool	What it measures	Data
`failure_mining.py`	Error → repair pairs from transcripts. Collects next 8 successes as candidates.	`data/failures.db`
`pair_judge_compare.py`	Multi-model scoring of repair pairs (Gemini Flash + Grok 4.1, ~$2/1K pairs). 66% unanimous across 6 models.	`data/failures.db`
`tool_error_analysis.py`	Categorizes tool errors by type, frequency, and time cost. Found 90%+ tool success rate.	`data/tool_errors.db`
`parallelism_analysis.py`	Detects missed parallelism. Found 661 opportunities, ~497 min wasted.	Stdout report
`sandbox_replay.py`	A/B tests principles in Docker sandboxes. Runs Claude Code at a specific commit with/without principles.	`data/sandbox_results.db`
`retroactive_study.py`	Scans existing transcripts for where principles would have applied (fills `principle_applications`).	DB direct

Common session review commands

# Mine failures from last 30 days
python -m learning.session_review.failure_mining --clear --days 30

# Score with defaults (flash + grok_think, majority vote)
python -m learning.session_review.pair_judge_compare

# Analyze tool errors
python -m learning.session_review.tool_error_analysis --days 3

# Run principle A/B test in Docker
python -m learning.session_review.sandbox_replay \
  --prompt "Find all Gradio apps" --commit HEAD

12. Gyms (Detailed)

Offline evaluation environments that test and optimize system sub-components using historical data. Each gym runs a generate → evaluate → learn loop: try variants of a prompt/method, score them with LLM judges, pick the winner. Gym findings can become new learning instances or principle refinements, feeding back into the main loop. All gyms extend lib/gym/GymBase.

Gym	Status	What it improves	Method
Badge `gyms/badge/`	Done	Badge text quality (iTerm2 session badges)	Replay sessions → score prompt variants → pick best
Fetchability `gyms/fetchability/`	Done	Proxy/fetch method selection per site	Probe URLs with 3 methods → classify outcomes → pick winner
KB `kb/scenario.py`	Done	Knowledge extraction from web sources	Extract → score quality → iterate
Code Cleanup `gyms/code_cleanup/`	Planned	Dead code, boilerplate, convention drift	Scan → generate fixes → evaluate acceptance
Sidekick	Planned	Intervention timing & helpfulness	Replay sessions → test policies → measure signal-to-noise

Badge gym example

# Quick test (5 sessions)
python -m learning.gyms.badge.gym --max-sessions 5

# Full run with HTML report
python -m learning.gyms.badge.gym --max-sessions 20 --report

Scores badge variants on: topic accuracy, stability across prompts, abstraction level, transition quality.

Fetchability gym example

# Bulk URL comparison
python learning/gyms/fetchability/fetchability_tool.py \
  --url-file /tmp/urls.txt --parallelism 8

Probes each URL with httpx_proxy, browser_proxy, and browser_unlocker. LLM judge verifies content quality. Picks winner method per URL.

Auto-Advise Pipeline (detailed)

How it works:

SessionEnd → learning_worker.py runs Stage 3: extract_follow_ups(condensed)
Opus identifies 0-3 follow-up patterns from the condensed transcript
Patterns stored as Principle entries directly in learning.db (with dedup by auto-advise/{slug} ID)
materialize.py generates ~/.claude/principles/auto-advise.md
principles_worker.py (unchanged) matches auto-advise principles like any other principle

Why direct DB write? The learn add --type principle path always triggers LLM re-classification, which is wasteful since Opus already classified the pattern. store_follow_ups() writes Principle entries via LearningStore.add_principle() directly.

13. Skills

Three Claude Code skills cover the learning loop:

Skill	Trigger	What it does
`/learn`	Encountering gotchas, discovering patterns	Add, lookup, list, rename knowledge
`/reflect`	After completing a task or fixing a bug	Step back → extract principles → review against existing
`/recall`	Before starting work in a familiar domain	Retrieve relevant past learnings for current context

14. Directory Map

learning/ ├── cli.py # `learn` CLI — 13 commands ├── search_eval.py # Retrieval quality evaluation (FTS vs semantic vs hybrid) ├── tasks.py # Invoke tasks for the learning module │ ├── schema/ │ ├── learning_store.py # Core: LearningStore, Principle, LearningInstance, enums │ ├── init.sql # DB schema (tables, indexes, triggers) │ ├── materialize.py # DB → ~/.claude/principles/*.md + learnings.md │ ├── import_principles.py # Bulk import from YAML/markdown │ ├── link_instances.py # Auto-link instances to principles │ ├── app.py # Gradio UI for browsing the DB │ └── rebuild.py # Full DB rebuild │ ├── session_review/ │ ├── failure_mining.py # Mine error→repair pairs │ ├── pair_judge.py # LLM judge prompts + helpers │ ├── pair_judge_compare.py# Multi-model scoring │ ├── tool_error_analysis.py# Error categorization + impact │ ├── parallelism_analysis.py# Missed parallelism detection │ ├── sandbox_replay.py # Docker-based principle A/B tests │ ├── retroactive_study.py # Principle adherence scanning │ ├── principle_propose.py # Propose new principles from evidence │ └── principle_refine.py # Refine existing principle text │ ├── gyms/ │ ├── badge/ # Badge prompt variant testing │ ├── fetchability/ # Proxy/method comparison per site │ └── code_cleanup/ # (planned) Codebase hygiene │ ├── memory/ # pgvector-based knowledge store (hybrid retrieval) ├── pond/ # Visual concept generation experiments │ ├── data/ │ └── learning.db # The single source of truth │ ├── CLAUDE.md # Instructions for Claude ├── HOWTO.md # Process improvement methodology └── TODO.md # Roadmap

15. DB Schema

Table	Rows	Purpose
`learning_instances`	1,436	Raw observations: content, type, project, source, tags
`principles`	194	Generalized rules with status (active/proposed/deprecated), full text, evidence counts
`instance_principle_links`	428	Evidence chain: which instances support which principles (with strength, link type)
`principle_applications`	3,117	When a principle was applied + outcome (followed, prevented_error, violated, etc.)
`principle_embeddings`	—	Vector embeddings for semantic search over principles
`instance_embeddings`	—	Vector embeddings for semantic search over instances

Key relationships

learning_instances ──(instance_principle_links)──→ principles │ principle_applications (tracked by principles_worker)

Instance sources: 1,324 from session_reflection (automatic), 112 from manual (CLI).

16. Design Decisions

DB-native principles

learning.db is the single source of truth. The ~/.claude/principles/*.md files are materialized views — generated by materialize.py. Never edit .md files directly; use the CLI or DB, then re-materialize.

LLM classification on add

learn add uses Gemini Flash (~$0.0003/call) to classify observations and link them to existing principles. Flash agrees with frontier models 90%+ on structured classification. Use --raw to skip when classification is predetermined.

Dedup by abstraction

Before proposing a new principle, check if an existing one covers the case at a higher abstraction level. "Order tabs by user frequency" is an instance of "Importance Ordering & Attention Budget" — link the instance, don't create a near-duplicate principle.

Auto-advise: direct DB write

Follow-up patterns bypass learn add --type principle (which re-classifies via LLM) and write Principle entries directly. Opus already classified them during extraction — re-classification would be wasteful.

Session skipping heuristics

learning_worker.py skips sessions with <3 user messages or <2 minutes duration. Also checks whether a manual wrapup already extracted learnings for the same session.

17. Troubleshooting

Symptom	Cause	Fix
`learn add` hangs	LLM hot server not running	`ops restart llm` or pass `--raw` to skip LLM
Principles not updating in sessions	Materialized files stale	`python learning/schema/materialize.py`
Semantic search returns nothing	Embeddings missing	`learn embed`, then `learn health` to verify
SessionEnd not extracting learnings	Session too short or too few messages	Check `supervisor/logs/learning_worker.log` for skip reason
Auto-advise patterns not appearing	Not yet materialized after extraction	`python learning/schema/materialize.py --principles-only`
Duplicate principles	Similar observations classified separately	`learn link-parent` to merge, or mark duplicate as deprecated
`learn find -s` slow	Large embedding table or missing index	`learn health` to check; consider `learn embed --refresh`

Generated Feb 2026. Stats from learning/data/learning.db. Source: present/learning/system_guide.html.

See also: Learning System Presentation (the "why" for external audiences) • learning/TODO.md (roadmap) • learning/HOWTO.md (process improvement methodology)