How the system captures knowledge from sessions, stores it in a unified DB, materializes it as principles Claude can act on, and runs gyms to evaluate and improve components.
Every Claude Code session starts with limited memory. Hard-won insights from yesterday's debugging session are gone.
On Feb 3, Claude spent 12 minutes debugging why a Gradio app's head= parameter wasn't injecting CSS. Root cause: Gradio 6 silently ignores head= on gr.Blocks() — it must be passed to launch(). On Feb 8, the same bug. On Feb 14, again. Three sessions, same gotcha, zero memory between them.
With the learning system: the first occurrence was captured as a learning instance, linked to the "No Silent Failures" principle, materialized into ~/.claude/principles/dev.md, and injected into every subsequent session that touched Gradio code. The second and third occurrences never happened.
The learning system closes this loop. It mines sessions for patterns, stores them as evidence-backed principles, and injects relevant ones into future sessions before the same mistakes are repeated.
Four nouns you need to understand:
head= on gr.Blocks" — one concrete thing that happened..md file generated from the DB. What Claude actually sees in its context.Tracing one real example through the entire system:
gr.Blocks(head="<style>...") silently ignores the CSS in Gradio 6.learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"pattern, project=rivus, and links it to the existing "No Silent Failures" principle with strength=0.8learning_instances, link goes into instance_principle_links, embedding auto-generatedmaterialize.py writes the updated principle (with +1 instance count) to ~/.claude/principles/dev.mdprinciples_worker.py matches the "No Silent Failures" principle and injects it as context. Claude uses launch(head=...) instead of gr.Blocks(head=...).$ learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"
Classifying with gemini-2.5-flash...
Type: pattern | Project: rivus | Tags: gradio, css, gotcha
Linked to: No Silent Failures (strength: 0.8)
Linked to: Prefer Context-Native Mechanisms (strength: 0.6)
Instance ID: inst_a8f3c2
Embedding generated ✓
$ learn show no-silent-failures
Principle: No Silent Failures
Status: active | Type: principle | File: dev.md
Instances: 31 | Applications: 219
Full text: "Never let a function silently ignore, drop, or swallow input.
If something can't be processed, raise an error or log a warning..."
Beyond capturing knowledge, the system actively improves itself through two mechanisms: auto-advise (learning what to do proactively) and gyms (testing and optimizing components).
Mines follow-up patterns from sessions — cases where the user's next request was a predictable action Claude should have done proactively.
inv task, and commit. Instead of waiting for three separate user requests.
The pipeline: SessionEnd hook → Opus extracts follow-up patterns → stored as principles in auto-advise.md → injected into future sessions like any other principle. Currently 8 auto-advise patterns active.
Offline evaluation environments that test and optimize system sub-components using historical data. Each gym tries variants of a prompt or method, scores them with LLM judges, and picks the winner.
| Gym | Status | What it improves |
|---|---|---|
| Badge | Done | iTerm2 session badge text quality |
| Fetchability | Done | Proxy/fetch method selection per site |
| KB | Done | Knowledge extraction from web sources |
| Code Cleanup | Planned | Dead code, boilerplate, convention drift |
| Sidekick | Planned | Intervention timing & helpfulness |
Gym findings feed back into the main learning loop as new instances or principle refinements. See Section 12 for detailed gym documentation.
An honest assessment of where the system stands, drawn from production stats, the presentation report ENHANCE markers, and learning/TODO.md.
.md files → Claude context/learn, /reflect, /recall consolidatedlearn find uses one or the other)dev/ and development/ need consolidationUserPromptSubmit + PreToolCallThe presentation report flags 9 areas where the system lacks measurement:
sandbox_results.db but isn't summarizedlearn add calls the LLM hot server (Gemini Flash) for classification. If no LLM server is running, use --raw to store without classification. Start the server with ops restart llm.
# LLM auto-classifies, links to existing principles
learn add "Gradio 6 silently ignores head= on gr.Blocks"
# Store directly, no LLM classification
learn add "Always use python not python3" --raw --type convention
# Add with project context
learn add "Redis TimeSeries keys use ts:{SYMBOL}:price:raw" --project moneygun
# Keyword search
learn find "gradio gotchas"
# Semantic search (uses embeddings — finds conceptual matches)
learn find -s "silent failures"
# List principles with full text
learn principles -v
# Show a specific principle + evidence chain
learn show no-silent-failures
# Preview changes without writing
python learning/schema/materialize.py --principles-only --dry-run
# Write all outputs
python learning/schema/materialize.py
learn health # embedding coverage, orphans, staleness
learn stats # summary statistics
There are two main data paths: automatic (from sessions) and manual (from the CLI).
Runs automatically on every session end, as a fire-and-forget subprocess.
learning_worker.pylearn add --raw; follow-up patterns written directly as Principle entries with file_path=auto-advise.mdlearn add "observation"learning_instances, links in instance_principle_links, embeddings auto-generatedlearning.db, generates ~/.claude/learnings.md and ~/.claude/principles/*.mdprinciples_worker.py — matches user prompt against principlesprinciple_applications tableTwo background workers handle the automation. Understanding them is key to debugging.
Trigger: SessionEnd hook (fire-and-forget subprocess)
Location: supervisor/sidekick/hooks/learning_worker.py
What it does: Parse JSONL transcript → condense (Haiku) → extract learnings (Opus) → extract follow-ups (Opus) → store
Logs: supervisor/logs/learning_worker.log
python -m supervisor.sidekick.hooks.learning_worker <session-id>Trigger: UserPromptSubmit hook (async, non-blocking)
Location: supervisor/sidekick/hooks/principles_worker.py
What it does: Matches user prompt against materialized principles → injects relevant ones into Claude's context
Logs: helm/logs/hooks.log
principle_applications table for observability~/.claude/settings.json under hooks.UserPromptSubmit
learn add. Total cost per session: ~$0.05-0.15.
The learn command (~/.local/bin/learn → tools/bin/learn) is the primary interface. All commands operate on learning/data/learning.db.
| Command | What it does | Key flags |
|---|---|---|
learn add | Add an observation; LLM classifies + links to principles | --raw (skip LLM), --type, --project, --tags |
learn find | Search by keyword (FTS) or semantically | -s (semantic), --type, --project |
learn list | List instances with filters | --type, --project, --source, --limit |
learn show <id> | Show instance or principle details + evidence chain | |
learn principles | List all principles | -v (full text), --type |
learn apply | Record a principle application + outcome | --outcome (followed/prevented_error/etc.) |
learn link | Link an instance to a principle | --strength, --link-type |
learn link-parent | Set parent-child between principles | |
learn rename | Rename principle category prefix | |
learn embed | Generate/refresh embeddings for all entries | |
learn health | Embedding coverage, orphans, staleness | |
learn stats | Summary statistics | |
learn provenance | Full provenance chain for an instance |
materialize.py generates one .md file per domain from the DB. These live at ~/.claude/principles/ and are read-only views — edit via the CLI or DB, then re-materialize.
| File | Principles | Instances | Applications | Description |
|---|---|---|---|---|
dev.md | 95 | 269 | 1,944 | Development practices (largest — covers architecture, error handling, API usage, tooling) |
ux.md | 22 | 32 | 311 | UX design: layout, information hierarchy, user signals |
batch-jobs.md | 15 | 45 | 357 | Observability, fault isolation, checkpointing at scale |
testing.md | 15 | 17 | 93 | Testing practices and verification patterns |
observability.md | 14 | 25 | 279 | Logging, monitoring, debugging visibility |
data-quality.md | 9 | 32 | 116 | Data validation and pipeline robustness |
auto-advise.md | 8 | 1 | 0 | Follow-up patterns mined from sessions NEW |
parallelism.md | 7 | 3 | 12 | When and how to parallelize work |
knowledge-accumulation.md | 4 | 4 | 5 | Meta-principles about the learning system itself |
backtesting.md | 2 | 0 | 0 | No future snooping, detection vs prediction |
| Principle | Instances | Applications |
|---|---|---|
| No Silent Failures | 31 | 219 |
| Prefer Context-Native Mechanisms | 24 | 50 |
| Verify API Details, Don't Fabricate | 19 | 164 |
| Respect Abstraction Boundaries | 13 | 28 |
| Look One Layer Out | 13 | 45 |
| Errors Compound Downstream | 12 | 23 |
| Verifying Early Beats Correcting Later | 12 | 29 |
| Fail Loud, Never Fake | 10 | 42 |
| Workarounds Piling Up = Wrong Abstraction | 10 | 20 |
| Recognize Poor Fit, Find Better Tools | 10 | 57 |
Analyzes how work was done (efficiency, errors, patterns). Complements doctor/chronicle which analyzes what was done (topics, accomplishments). All tools read from the same ~/.claude/projects/ JSONL transcripts.
| Tool | What it measures | Data |
|---|---|---|
failure_mining.py | Error → repair pairs from transcripts. Collects next 8 successes as candidates. | data/failures.db |
pair_judge_compare.py | Multi-model scoring of repair pairs (Gemini Flash + Grok 4.1, ~$2/1K pairs). 66% unanimous across 6 models. | data/failures.db |
tool_error_analysis.py | Categorizes tool errors by type, frequency, and time cost. Found 90%+ tool success rate. | data/tool_errors.db |
parallelism_analysis.py | Detects missed parallelism. Found 661 opportunities, ~497 min wasted. | Stdout report |
sandbox_replay.py | A/B tests principles in Docker sandboxes. Runs Claude Code at a specific commit with/without principles. | data/sandbox_results.db |
retroactive_study.py | Scans existing transcripts for where principles would have applied (fills principle_applications). | DB direct |
# Mine failures from last 30 days
python -m learning.session_review.failure_mining --clear --days 30
# Score with defaults (flash + grok_think, majority vote)
python -m learning.session_review.pair_judge_compare
# Analyze tool errors
python -m learning.session_review.tool_error_analysis --days 3
# Run principle A/B test in Docker
python -m learning.session_review.sandbox_replay \
--prompt "Find all Gradio apps" --commit HEAD
Offline evaluation environments that test and optimize system sub-components using historical data. Each gym runs a generate → evaluate → learn loop: try variants of a prompt/method, score them with LLM judges, pick the winner. Gym findings can become new learning instances or principle refinements, feeding back into the main loop. All gyms extend lib/gym/GymBase.
| Gym | Status | What it improves | Method |
|---|---|---|---|
Badge gyms/badge/ | Done | Badge text quality (iTerm2 session badges) | Replay sessions → score prompt variants → pick best |
Fetchability gyms/fetchability/ | Done | Proxy/fetch method selection per site | Probe URLs with 3 methods → classify outcomes → pick winner |
KB kb/scenario.py | Done | Knowledge extraction from web sources | Extract → score quality → iterate |
Code Cleanup gyms/code_cleanup/ | Planned | Dead code, boilerplate, convention drift | Scan → generate fixes → evaluate acceptance |
| Sidekick | Planned | Intervention timing & helpfulness | Replay sessions → test policies → measure signal-to-noise |
# Quick test (5 sessions)
python -m learning.gyms.badge.gym --max-sessions 5
# Full run with HTML report
python -m learning.gyms.badge.gym --max-sessions 20 --report
Scores badge variants on: topic accuracy, stability across prompts, abstraction level, transition quality.
# Bulk URL comparison
python learning/gyms/fetchability/fetchability_tool.py \
--url-file /tmp/urls.txt --parallelism 8
Probes each URL with httpx_proxy, browser_proxy, and browser_unlocker. LLM judge verifies content quality. Picks winner method per URL.
How it works:
learning_worker.py runs Stage 3: extract_follow_ups(condensed)Principle entries directly in learning.db (with dedup by auto-advise/{slug} ID)materialize.py generates ~/.claude/principles/auto-advise.mdprinciples_worker.py (unchanged) matches auto-advise principles like any other principleWhy direct DB write? The learn add --type principle path always triggers LLM re-classification, which is wasteful since Opus already classified the pattern. store_follow_ups() writes Principle entries via LearningStore.add_principle() directly.
Three Claude Code skills cover the learning loop:
| Skill | Trigger | What it does |
|---|---|---|
/learn | Encountering gotchas, discovering patterns | Add, lookup, list, rename knowledge |
/reflect | After completing a task or fixing a bug | Step back → extract principles → review against existing |
/recall | Before starting work in a familiar domain | Retrieve relevant past learnings for current context |
| Table | Rows | Purpose |
|---|---|---|
learning_instances | 1,436 | Raw observations: content, type, project, source, tags |
principles | 194 | Generalized rules with status (active/proposed/deprecated), full text, evidence counts |
instance_principle_links | 428 | Evidence chain: which instances support which principles (with strength, link type) |
principle_applications | 3,117 | When a principle was applied + outcome (followed, prevented_error, violated, etc.) |
principle_embeddings | — | Vector embeddings for semantic search over principles |
instance_embeddings | — | Vector embeddings for semantic search over instances |
Instance sources: 1,324 from session_reflection (automatic), 112 from manual (CLI).
learning.db is the single source of truth. The ~/.claude/principles/*.md files are materialized views — generated by materialize.py. Never edit .md files directly; use the CLI or DB, then re-materialize.
learn add uses Gemini Flash (~$0.0003/call) to classify observations and link them to existing principles. Flash agrees with frontier models 90%+ on structured classification. Use --raw to skip when classification is predetermined.
Before proposing a new principle, check if an existing one covers the case at a higher abstraction level. "Order tabs by user frequency" is an instance of "Importance Ordering & Attention Budget" — link the instance, don't create a near-duplicate principle.
Follow-up patterns bypass learn add --type principle (which re-classifies via LLM) and write Principle entries directly. Opus already classified them during extraction — re-classification would be wasteful.
learning_worker.py skips sessions with <3 user messages or <2 minutes duration. Also checks whether a manual wrapup already extracted learnings for the same session.
| Symptom | Cause | Fix |
|---|---|---|
learn add hangs | LLM hot server not running | ops restart llm or pass --raw to skip LLM |
| Principles not updating in sessions | Materialized files stale | python learning/schema/materialize.py |
| Semantic search returns nothing | Embeddings missing | learn embed, then learn health to verify |
| SessionEnd not extracting learnings | Session too short or too few messages | Check supervisor/logs/learning_worker.log for skip reason |
| Auto-advise patterns not appearing | Not yet materialized after extraction | python learning/schema/materialize.py --principles-only |
| Duplicate principles | Similar observations classified separately | learn link-parent to merge, or mark duplicate as deprecated |
learn find -s slow | Large embedding table or missing index | learn health to check; consider learn embed --refresh |
Generated Feb 2026. Stats from learning/data/learning.db. Source: present/learning/system_guide.html.
See also: Learning System Presentation (the "why" for external audiences) • learning/TODO.md (roadmap) • learning/HOWTO.md (process improvement methodology)