Learning System Guide

How the system captures knowledge from sessions, stores it in a unified DB, materializes it as principles Claude can act on, and runs gyms to evaluate and improve components.

Contents
  1. Part I — What & Why
  2. Why This Exists
  3. Core Concepts
  4. Walkthrough
  5. Self-Improvement Loops
  6. Status: What Works & What Doesn't
  7. Part II — Technical Reference
  8. Quick Start
  9. Data Flows
  10. Runtime Components
  11. CLI Reference
  12. Principle Files
  13. Session Review
  14. Gyms (detailed)
  15. Skills
  16. Directory Map
  17. DB Schema
  18. Design Decisions
  19. Troubleshooting
Part I
What & Why
For anyone new to the project — understand the problem, the approach, and current status before diving into implementation details.

1. Why This Exists

Every Claude Code session starts with limited memory. Hard-won insights from yesterday's debugging session are gone.

The problem, concretely

On Feb 3, Claude spent 12 minutes debugging why a Gradio app's head= parameter wasn't injecting CSS. Root cause: Gradio 6 silently ignores head= on gr.Blocks() — it must be passed to launch(). On Feb 8, the same bug. On Feb 14, again. Three sessions, same gotcha, zero memory between them.

With the learning system: the first occurrence was captured as a learning instance, linked to the "No Silent Failures" principle, materialized into ~/.claude/principles/dev.md, and injected into every subsequent session that touched Gradio code. The second and third occurrences never happened.

The learning system closes this loop. It mines sessions for patterns, stores them as evidence-backed principles, and injects relevant ones into future sessions before the same mistakes are repeated.

The gap isn't intelligence — it's institutional memory. A senior engineer doesn't just know more facts; they've internalized principles about when complexity is warranted, when to verify assumptions, when a workaround signals a wrong abstraction. This system builds that kind of accumulated judgment for an AI agent.
1,436
learning instances
192
active principles
3,117
principle applications
10
principle files

2. Core Concepts

Four nouns you need to understand:

Instance
A single observation. "Gradio 6 ignores head= on gr.Blocks" — one concrete thing that happened.
Principle
A generalized rule backed by instances. "No Silent Failures" — the pattern across many observations.
Application
A record of a principle being used in a session, with outcome (followed, prevented error, violated).
Materialized file
Read-only .md file generated from the DB. What Claude actually sees in its context.
Observation → Instance → linked to Principle → materialized to .md file → injected into Claude's context

3. Walkthrough: Observation to Behavior Change

Tracing one real example through the entire system:

1
Observe — During a session, you discover that gr.Blocks(head="<style>...") silently ignores the CSS in Gradio 6.
2
Capture — Run: learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"
3
Classify — Gemini Flash (~0.3s) classifies it as type=pattern, project=rivus, and links it to the existing "No Silent Failures" principle with strength=0.8
4
Store — Instance goes into learning_instances, link goes into instance_principle_links, embedding auto-generated
5
Materialize — Next run of materialize.py writes the updated principle (with +1 instance count) to ~/.claude/principles/dev.md
6
Apply — In a future session, when Claude is about to edit a Gradio app, principles_worker.py matches the "No Silent Failures" principle and injects it as context. Claude uses launch(head=...) instead of gr.Blocks(head=...).
What the CLI output actually looks like
$ learn add "Gradio 6 silently ignores head= on gr.Blocks — must pass to launch()"
Classifying with gemini-2.5-flash...
  Type: pattern | Project: rivus | Tags: gradio, css, gotcha
  Linked to: No Silent Failures (strength: 0.8)
  Linked to: Prefer Context-Native Mechanisms (strength: 0.6)
  Instance ID: inst_a8f3c2
  Embedding generated ✓

$ learn show no-silent-failures
  Principle: No Silent Failures
  Status: active | Type: principle | File: dev.md
  Instances: 31 | Applications: 219
  Full text: "Never let a function silently ignore, drop, or swallow input.
  If something can't be processed, raise an error or log a warning..."

4. Self-Improvement Loops

Beyond capturing knowledge, the system actively improves itself through two mechanisms: auto-advise (learning what to do proactively) and gyms (testing and optimizing components).

Auto-Advise: Learning Proactive Behavior

Mines follow-up patterns from sessions — cases where the user's next request was a predictable action Claude should have done proactively.

Example pattern: After implementing a new CLI tool in a project with invoke tasks → Claude should proactively run it to verify, add an inv task, and commit. Instead of waiting for three separate user requests.

The pipeline: SessionEnd hook → Opus extracts follow-up patterns → stored as principles in auto-advise.md → injected into future sessions like any other principle. Currently 8 auto-advise patterns active.

Gyms: Generate → Evaluate → Learn

Offline evaluation environments that test and optimize system sub-components using historical data. Each gym tries variants of a prompt or method, scores them with LLM judges, and picks the winner.

GymStatusWhat it improves
BadgeDoneiTerm2 session badge text quality
FetchabilityDoneProxy/fetch method selection per site
KBDoneKnowledge extraction from web sources
Code CleanupPlannedDead code, boilerplate, convention drift
SidekickPlannedIntervention timing & helpfulness

Gym findings feed back into the main learning loop as new instances or principle refinements. See Section 12 for detailed gym documentation.

5. Status: What Works & What Doesn't

An honest assessment of where the system stands, drawn from production stats, the presentation report ENHANCE markers, and learning/TODO.md.

What works well

Done Knowledge capture: 1,436 instances from sessions + manual input
Done Principle abstraction: 192 active principles across 10 domains
Done Materialization pipeline: DB → .md files → Claude context
Done Auto-extraction: SessionEnd hook mines learnings + follow-ups automatically
Done CLI: 13 commands covering full CRUD + search + health checks
Done Gyms: Badge, Fetchability, KB gyms completed with gen→eval→learn loops
Done Multi-model repair scoring: 66% unanimous across 6 models on 997 pairs
Done Skills: /learn, /reflect, /recall consolidated

In progress

WIP Hybrid retrieval: FTS + semantic search not yet combined (learn find uses one or the other)
WIP LLM re-ranker: Negation queries plateau at P@1=0.769 — needs re-ranking pass
WIP Principle category cleanup: dev/ and development/ need consolidation
WIP Retroactive study: 72 episodes analyzed but judge accuracy needs calibration (false positives)

Planned / known gaps

Planned Decision-point hooks: auto-inject learnings at UserPromptSubmit + PreToolCall
Planned Learning ↔ Skillz unification: shared DB, pluggable sources, gap analysis
Planned Hypothesis & UX experiments: forward-looking A/B testing of UX choices
Planned Embedding-based principle retrieval: needed past ~200 principles
Planned Task detection → Vario routing: auto-fan-out design tasks to multi-model exploration
Planned Sidekick & Code Cleanup gyms

Measurement gaps (from report ENHANCE markers)

The presentation report flags 9 areas where the system lacks measurement:

  1. No causal metrics — stats are descriptive (1,436 instances), not causal (error rates before/after)
  2. No before/after session comparison — the killer metric ("sessions with principles had X% fewer errors") doesn't exist yet
  3. Application tracking sparse — most principles have few tracked applications; automation needed
  4. Sandbox results not aggregated — per-principle A/B data exists in sandbox_results.db but isn't summarized
  5. No worked failure→repair example — the 8-candidate scoring pipeline needs a concrete trace
  6. Gym results under-documented — Badge gym ran 5 variants but results aren't in the report
  7. Verification levels unproven — L0-L1 claimed but evidence not shown
  8. Prior work comparison too high-level — needs specific paper references (Voyager, Reflexion, LATS)
  9. KB gym design choice rationale missing — 5-case test and 70% threshold need justification
Part II
Technical Reference
For developers working on or debugging the system — data flows, CLI commands, runtime components, DB schema, and troubleshooting.

6. Quick Start

Note: learn add calls the LLM hot server (Gemini Flash) for classification. If no LLM server is running, use --raw to store without classification. Start the server with ops restart llm.

Add knowledge

# LLM auto-classifies, links to existing principles
learn add "Gradio 6 silently ignores head= on gr.Blocks"

# Store directly, no LLM classification
learn add "Always use python not python3" --raw --type convention

# Add with project context
learn add "Redis TimeSeries keys use ts:{SYMBOL}:price:raw" --project moneygun

Search and browse

# Keyword search
learn find "gradio gotchas"

# Semantic search (uses embeddings — finds conceptual matches)
learn find -s "silent failures"

# List principles with full text
learn principles -v

# Show a specific principle + evidence chain
learn show no-silent-failures

Regenerate materialized files

# Preview changes without writing
python learning/schema/materialize.py --principles-only --dry-run

# Write all outputs
python learning/schema/materialize.py

Health check

learn health    # embedding coverage, orphans, staleness
learn stats     # summary statistics

7. How Data Flows

There are two main data paths: automatic (from sessions) and manual (from the CLI).

Automatic: Session → Principle

Runs automatically on every session end, as a fire-and-forget subprocess.

1
SessionEnd hook triggers learning_worker.py
2
Parse — reads JSONL transcript, extracts user messages, assistant reasoning, errors, thinking blocks
3
Condense (Haiku) — compresses raw narrative to ~3K words, keeping decisions and dead ends, dropping routine tool output
4
Extract learnings (Opus) — identifies 0-3 learnings worth recording: non-obvious bugs, abandoned approaches, user corrections, gotchas
5
Extract follow-ups (Opus) — identifies predictable follow-up patterns: actions Claude should do proactively next time (e.g., "test after implementation")
6
Store — learnings via learn add --raw; follow-up patterns written directly as Principle entries with file_path=auto-advise.md

Manual: Observation → Classified Instance

1
User runs learn add "observation"
2
LLM classifies (Gemini Flash, ~$0.0003/call) — assigns type, project, domain tags
3
LLM links — matches to existing principles or proposes new ones
4
Store — instance in learning_instances, links in instance_principle_links, embeddings auto-generated

Materialization → Claude's Context

1
materialize.py reads learning.db, generates ~/.claude/learnings.md and ~/.claude/principles/*.md
2
UserPromptSubmit hook fires principles_worker.py — matches user prompt against principles
3
Relevant principles injected into Claude's context; applications logged to principle_applications table
Sources Store Outputs ───────────────────── ───── ─────── Manual (learn add) ──→ ┌─→ ~/.claude/learnings.md SessionEnd hook ──→ learning.db ├─→ ~/.claude/principles/*.md learning_worker.py (single source ├─→ principle_applications table - extract learnings of truth) └─→ Gym reports - extract follow-ups ──→ Session review miners ──→ Gyms (badge, fetchability) ──→

8. Runtime Components

Two background workers handle the automation. Understanding them is key to debugging.

learning_worker.py

Trigger: SessionEnd hook (fire-and-forget subprocess)

Location: supervisor/sidekick/hooks/learning_worker.py

What it does: Parse JSONL transcript → condense (Haiku) → extract learnings (Opus) → extract follow-ups (Opus) → store

Logs: supervisor/logs/learning_worker.log

Manual re-run: python -m supervisor.sidekick.hooks.learning_worker <session-id>
Skip heuristics: <3 user messages, <2 min duration, or wrapup already ran
principles_worker.py

Trigger: UserPromptSubmit hook (async, non-blocking)

Location: supervisor/sidekick/hooks/principles_worker.py

What it does: Matches user prompt against materialized principles → injects relevant ones into Claude's context

Logs: helm/logs/hooks.log

Applications tracked in principle_applications table for observability
Hook registered in ~/.claude/settings.json under hooks.UserPromptSubmit

Why multiple LLMs?

The system uses 3 different models for cost/quality tradeoffs: Haiku ($0.25/M tok) for cheap, fast transcript condensation. Opus ($15/M tok) for high-reasoning extraction where quality matters. Gemini Flash (~$0.08/M tok) for sub-cent classification on every learn add. Total cost per session: ~$0.05-0.15.

9. CLI Reference

The learn command (~/.local/bin/learntools/bin/learn) is the primary interface. All commands operate on learning/data/learning.db.

CommandWhat it doesKey flags
learn addAdd an observation; LLM classifies + links to principles--raw (skip LLM), --type, --project, --tags
learn findSearch by keyword (FTS) or semantically-s (semantic), --type, --project
learn listList instances with filters--type, --project, --source, --limit
learn show <id>Show instance or principle details + evidence chain
learn principlesList all principles-v (full text), --type
learn applyRecord a principle application + outcome--outcome (followed/prevented_error/etc.)
learn linkLink an instance to a principle--strength, --link-type
learn link-parentSet parent-child between principles
learn renameRename principle category prefix
learn embedGenerate/refresh embeddings for all entries
learn healthEmbedding coverage, orphans, staleness
learn statsSummary statistics
learn provenanceFull provenance chain for an instance

10. Principle Files

materialize.py generates one .md file per domain from the DB. These live at ~/.claude/principles/ and are read-only views — edit via the CLI or DB, then re-materialize.

FilePrinciplesInstancesApplicationsDescription
dev.md952691,944Development practices (largest — covers architecture, error handling, API usage, tooling)
ux.md2232311UX design: layout, information hierarchy, user signals
batch-jobs.md1545357Observability, fault isolation, checkpointing at scale
testing.md151793Testing practices and verification patterns
observability.md1425279Logging, monitoring, debugging visibility
data-quality.md932116Data validation and pipeline robustness
auto-advise.md810Follow-up patterns mined from sessions NEW
parallelism.md7312When and how to parallelize work
knowledge-accumulation.md445Meta-principles about the learning system itself
backtesting.md200No future snooping, detection vs prediction

Top 10 Principles by Evidence

PrincipleInstancesApplications
No Silent Failures31219
Prefer Context-Native Mechanisms2450
Verify API Details, Don't Fabricate19164
Respect Abstraction Boundaries1328
Look One Layer Out1345
Errors Compound Downstream1223
Verifying Early Beats Correcting Later1229
Fail Loud, Never Fake1042
Workarounds Piling Up = Wrong Abstraction1020
Recognize Poor Fit, Find Better Tools1057

11. Session Review

Analyzes how work was done (efficiency, errors, patterns). Complements doctor/chronicle which analyzes what was done (topics, accomplishments). All tools read from the same ~/.claude/projects/ JSONL transcripts.

ToolWhat it measuresData
failure_mining.pyError → repair pairs from transcripts. Collects next 8 successes as candidates.data/failures.db
pair_judge_compare.pyMulti-model scoring of repair pairs (Gemini Flash + Grok 4.1, ~$2/1K pairs). 66% unanimous across 6 models.data/failures.db
tool_error_analysis.pyCategorizes tool errors by type, frequency, and time cost. Found 90%+ tool success rate.data/tool_errors.db
parallelism_analysis.pyDetects missed parallelism. Found 661 opportunities, ~497 min wasted.Stdout report
sandbox_replay.pyA/B tests principles in Docker sandboxes. Runs Claude Code at a specific commit with/without principles.data/sandbox_results.db
retroactive_study.pyScans existing transcripts for where principles would have applied (fills principle_applications).DB direct
Common session review commands
# Mine failures from last 30 days
python -m learning.session_review.failure_mining --clear --days 30

# Score with defaults (flash + grok_think, majority vote)
python -m learning.session_review.pair_judge_compare

# Analyze tool errors
python -m learning.session_review.tool_error_analysis --days 3

# Run principle A/B test in Docker
python -m learning.session_review.sandbox_replay \
  --prompt "Find all Gradio apps" --commit HEAD

12. Gyms (Detailed)

Offline evaluation environments that test and optimize system sub-components using historical data. Each gym runs a generate → evaluate → learn loop: try variants of a prompt/method, score them with LLM judges, pick the winner. Gym findings can become new learning instances or principle refinements, feeding back into the main loop. All gyms extend lib/gym/GymBase.

GymStatusWhat it improvesMethod
Badge gyms/badge/DoneBadge text quality (iTerm2 session badges)Replay sessions → score prompt variants → pick best
Fetchability gyms/fetchability/DoneProxy/fetch method selection per siteProbe URLs with 3 methods → classify outcomes → pick winner
KB kb/scenario.pyDoneKnowledge extraction from web sourcesExtract → score quality → iterate
Code Cleanup gyms/code_cleanup/PlannedDead code, boilerplate, convention driftScan → generate fixes → evaluate acceptance
SidekickPlannedIntervention timing & helpfulnessReplay sessions → test policies → measure signal-to-noise
Badge gym example
# Quick test (5 sessions)
python -m learning.gyms.badge.gym --max-sessions 5

# Full run with HTML report
python -m learning.gyms.badge.gym --max-sessions 20 --report

Scores badge variants on: topic accuracy, stability across prompts, abstraction level, transition quality.

Fetchability gym example
# Bulk URL comparison
python learning/gyms/fetchability/fetchability_tool.py \
  --url-file /tmp/urls.txt --parallelism 8

Probes each URL with httpx_proxy, browser_proxy, and browser_unlocker. LLM judge verifies content quality. Picks winner method per URL.

Auto-Advise Pipeline (detailed)

How it works:

  1. SessionEndlearning_worker.py runs Stage 3: extract_follow_ups(condensed)
  2. Opus identifies 0-3 follow-up patterns from the condensed transcript
  3. Patterns stored as Principle entries directly in learning.db (with dedup by auto-advise/{slug} ID)
  4. materialize.py generates ~/.claude/principles/auto-advise.md
  5. principles_worker.py (unchanged) matches auto-advise principles like any other principle

Why direct DB write? The learn add --type principle path always triggers LLM re-classification, which is wasteful since Opus already classified the pattern. store_follow_ups() writes Principle entries via LearningStore.add_principle() directly.

13. Skills

Three Claude Code skills cover the learning loop:

SkillTriggerWhat it does
/learnEncountering gotchas, discovering patternsAdd, lookup, list, rename knowledge
/reflectAfter completing a task or fixing a bugStep back → extract principles → review against existing
/recallBefore starting work in a familiar domainRetrieve relevant past learnings for current context

14. Directory Map

learning/ ├── cli.py # `learn` CLI — 13 commands ├── search_eval.py # Retrieval quality evaluation (FTS vs semantic vs hybrid) ├── tasks.py # Invoke tasks for the learning module │ ├── schema/ │ ├── learning_store.py # Core: LearningStore, Principle, LearningInstance, enums │ ├── init.sql # DB schema (tables, indexes, triggers) │ ├── materialize.py # DB → ~/.claude/principles/*.md + learnings.md │ ├── import_principles.py # Bulk import from YAML/markdown │ ├── link_instances.py # Auto-link instances to principles │ ├── app.py # Gradio UI for browsing the DB │ └── rebuild.py # Full DB rebuild │ ├── session_review/ │ ├── failure_mining.py # Mine error→repair pairs │ ├── pair_judge.py # LLM judge prompts + helpers │ ├── pair_judge_compare.py# Multi-model scoring │ ├── tool_error_analysis.py# Error categorization + impact │ ├── parallelism_analysis.py# Missed parallelism detection │ ├── sandbox_replay.py # Docker-based principle A/B tests │ ├── retroactive_study.py # Principle adherence scanning │ ├── principle_propose.py # Propose new principles from evidence │ └── principle_refine.py # Refine existing principle text │ ├── gyms/ │ ├── badge/ # Badge prompt variant testing │ ├── fetchability/ # Proxy/method comparison per site │ └── code_cleanup/ # (planned) Codebase hygiene │ ├── memory/ # pgvector-based knowledge store (hybrid retrieval) ├── pond/ # Visual concept generation experiments │ ├── data/ │ └── learning.db # The single source of truth │ ├── CLAUDE.md # Instructions for Claude ├── HOWTO.md # Process improvement methodology └── TODO.md # Roadmap

15. DB Schema

TableRowsPurpose
learning_instances1,436Raw observations: content, type, project, source, tags
principles194Generalized rules with status (active/proposed/deprecated), full text, evidence counts
instance_principle_links428Evidence chain: which instances support which principles (with strength, link type)
principle_applications3,117When a principle was applied + outcome (followed, prevented_error, violated, etc.)
principle_embeddingsVector embeddings for semantic search over principles
instance_embeddingsVector embeddings for semantic search over instances

Key relationships

learning_instances ──(instance_principle_links)──→ principles │ principle_applications (tracked by principles_worker)

Instance sources: 1,324 from session_reflection (automatic), 112 from manual (CLI).

16. Design Decisions

DB-native principles

learning.db is the single source of truth. The ~/.claude/principles/*.md files are materialized views — generated by materialize.py. Never edit .md files directly; use the CLI or DB, then re-materialize.

LLM classification on add

learn add uses Gemini Flash (~$0.0003/call) to classify observations and link them to existing principles. Flash agrees with frontier models 90%+ on structured classification. Use --raw to skip when classification is predetermined.

Dedup by abstraction

Before proposing a new principle, check if an existing one covers the case at a higher abstraction level. "Order tabs by user frequency" is an instance of "Importance Ordering & Attention Budget" — link the instance, don't create a near-duplicate principle.

Auto-advise: direct DB write

Follow-up patterns bypass learn add --type principle (which re-classifies via LLM) and write Principle entries directly. Opus already classified them during extraction — re-classification would be wasteful.

Session skipping heuristics

learning_worker.py skips sessions with <3 user messages or <2 minutes duration. Also checks whether a manual wrapup already extracted learnings for the same session.

17. Troubleshooting

SymptomCauseFix
learn add hangsLLM hot server not runningops restart llm or pass --raw to skip LLM
Principles not updating in sessionsMaterialized files stalepython learning/schema/materialize.py
Semantic search returns nothingEmbeddings missinglearn embed, then learn health to verify
SessionEnd not extracting learningsSession too short or too few messagesCheck supervisor/logs/learning_worker.log for skip reason
Auto-advise patterns not appearingNot yet materialized after extractionpython learning/schema/materialize.py --principles-only
Duplicate principlesSimilar observations classified separatelylearn link-parent to merge, or mark duplicate as deprecated
learn find -s slowLarge embedding table or missing indexlearn health to check; consider learn embed --refresh

Generated Feb 2026. Stats from learning/data/learning.db. Source: present/learning/system_guide.html.

See also: Learning System Presentation (the "why" for external audiences) • learning/TODO.md (roadmap) • learning/HOWTO.md (process improvement methodology)