Accumulating Wisdom: A Self-Improving Knowledge System for AI Coding Agents

How an AI assistant extracts principles from its own work — about code quality, UX design, data processing, and architecture — and uses them to produce better solutions over time.

Contents

Context: Why Domain Expertise Is the Frontier
The Problem: Capable But Not Getting Wiser
The Core Idea
How It Works
Motivating Examples
Measuring Usefulness
Prior Work & How This Differs
Current Results
Vision: Where This Goes
Appendix: System Architecture
Glossary — hover any dotted termLike this! Technical terms are explained on hover and linked to the glossary. for a definition

1. Context: Why Domain Expertise Is the Frontier

In the agentic future, AI agents will be remarkably competent and intelligent out of the box. But there will always be frontiers of specialized capability — from protein folding to reasoning about the impact of news on semiconductor supply chains — that require two things generic models lack:

Extensive domain knowledge — not just facts, but structured understanding of how a domain works: which variables matter, what the normal ranges are, how entities relate, what the failure modes look like.
Advanced reasoning expertise specific to that domain — the judgment calls, heuristics, and pattern recognition that domain experts have internalized over years. "This earnings call language pattern usually precedes a guidance revision." "This patent citation structure suggests defensive filing, not innovation."

Generic LLMsLarge Language Models — AI models trained on vast text data that can generate, analyze, and reason about text. Examples: Claude, GPT, Gemini. will be commoditized — everyone has access to the same frontier models. What remains scarce and valuable: curated, verified domain expertise that agents can leverage for specialized reasoning.

The agent that can analyze a 10-K like a forensic accountant, evaluate a founder like a seasoned VC, or assess supply chain risk like a procurement expert will outperform a generic agent with the same base model. The differentiator isn't reasoning capability — it's having better, fresher, more structured domain knowledge to reason about.

This report describes the learning system — one half of how we build that domain expertise layer. It handles the internal loop: extracting principles from the agent's own work, testing them, and feeding them back. Its companion, the skill acquisition system, handles the external loop: extracting domain knowledge from the web, verifying it through execution, and building a growing library of specialized capabilities. Together, they compound: the agent gets better at learning, and what it learns makes it better at its work.

2. The Problem: Capable But Not Getting Wiser

AI coding assistants like Claude Code are remarkably capable. But each session starts with limited memory from prior sessions. The agent doesn't accumulate wisdom: hard-won insights about what makes code clean, what makes layouts intuitive, what makes data pipelines robust, what architectural choices pay off.

Three dimensions of lost wisdom

Quality: Session 14 builds a Gradio UI. The layout fights the framework — CSS hacks, MutationObservers, custom JavaScript. Session 22 hits the same pattern. Neither session learns the deeper lesson: multiple workarounds signal you're fighting the framework instead of using it correctly.

Efficiency: Session 38 runs a batch job on 300 items. Item #287 has a missing field that causes a crash. No checkpoint, no resume. Wasted: ~40 minutes. Session 39 repeats the pattern with a different dataset.

Architecture: Session 45 builds a new CLI tool from scratch. Session 52 builds another. Neither asks: can an existing system already do this with a small extension?

Analysis of a single user's (Tim's) Claude Code sessions over ~7 weeks found:

1,126

learning instances captured

principles extracted

knowledge domains covered

661

missed parallelism opportunities

The issue isn't raw capability — it's the absence of accumulated judgment. A senior engineer doesn't just know more facts; they've internalized principles about when complexity is warranted, when to verify assumptions, when a workaround signals a wrong abstraction. The agent lacks equally excellent mechanisms to build this kind of wisdom.

The gap isn't intelligence — it's institutional memory. The agent makes the same quality and design mistakes not because it can't reason about them, but because it doesn't retain what it learned yesterday. Every session re-derives insights that should be available from day one.

This section would benefit from a concrete "before/after" metric: e.g., "Code quality score improved from X to Y after UX principles were injected" or "Layout-related rework dropped by Z%." Currently the stats are descriptive, not causal.

3. The Core Idea

What if the AI agent could review its own work — not just what went wrong, but what went right and why — abstract those insights into general principles about code quality, UX design, data handling, and architecture, test them, and carry them forward into every future session?

That's the learning system. It extracts wisdom across multiple dimensions:

Domain	What it learns	Example principle
Code quality	Patterns that produce clean, maintainable code	"Fail loud, never fake" — no silent fallbacks
UX design	What makes interfaces intuitive and clear	"Content outweighs chrome" — answers, not scaffolding, should stand out
Data processing	How to build robust, resumable pipelines	"Validate before bulk" — sample 5 items before running 300
Architecture	When to extend vs. build, when complexity is earned	"Extend, don't invent" — a small extension beats a new app
Observability	Making systems self-aware and debuggable	"The system must be able to observe itself"

The implementation is a closed loop:

Session transcripts (JSONL) + manual observations | v [1] Mine patterns (failures, successes, design choices) | v [2] Score and classify (multi-model consensus) | v [3] Propose principles (root cause analysis → abstraction) | v [4] Test principles (sandbox replay in Docker) | v [5] Promote to active principles | v [6] Inject into future sessions (~/.claude/principles/*.md) | +-------> back to [1] (continuous loop)

The key insight is that principles, not fixes, are the unit of learning. A specific fix ("add --force flag") helps once. A principle ("verify assumptions at system boundaries") prevents entire classes of errors across all future sessions.

Analogy: A junior developer learns "always check if the file exists before reading it." A senior developer learns "verify your preconditions before acting." The junior's lesson applies to one scenario; the senior's applies to hundreds. The learning system aims to produce senior-level insights from junior-level data.

4. How It Works

3.1 Mining Failure→Success Pairs

Every Claude Code session produces a JSONLJSON Lines — a file where each line is one JSON event. The raw transcript of everything the agent did. transcript — every tool call, every result, every error. The mining step parses these transcripts looking for sequences where a tool call fails and a subsequent call succeeds.

Find failures: Scan for tool calls with is_error: true. Categorize by type (import error, file not found, edit mismatch, syntax error, timeout).

Capture context: For each failure, collect the next 8 successful tool calls as "repair candidates." Also capture the agent's thinking blocks between attempts — these reveal reasoning patterns.

Store pairs: Each (failure, candidates, thinking) tuple goes into failures.db. Currently: 1,000+ pairs from 30 days of sessions.

Why 8 candidates instead of just the next success? After an error, AI agents typically investigate (Read, Grep, search) before fixing. Naively pairing "error → next success" was wrong 62% of the time — it paired the error with a diagnostic step, not the actual fix.

A worked example showing a real failure→candidate sequence (with the 8 candidates and which one was the actual repair) would make this concrete. Show the raw data.

3.2 Multi-Model Repair Scoring

Given 8 candidate repairs, which one actually fixed the problem? This is a judgment call that requires understanding code context. The system uses multiple LLMLarge Language Model — an AI that can read and reason about text. Here, different LLMs serve as independent judges evaluating repair quality. judges:

Model	Agreement with flagship	Cost (1K pairs)
Gemini 3 Flash	92%	$1.72
Claude Opus 4.6	92%	~$15
Grok 4.1 Thinking	88%	~$0.25
Gemini 3 Pro (flagship)	—	$8.50

Majority vote across 2+ models sets the verdict. Default pair: Gemini Flash + Grok Thinking (~$2 total) achieves 90%+ agreement with the expensive flagship. Across 997 pairs, 66% had unanimous agreement across all 6 tested models.

3.3 Principle Proposal: From Specific Failures to General Wisdom

This is where the magic happens. Given hundreds of verified failure→fix pairs, a three-stage LLM pipeline abstracts them into general principles:

Analyze root causes: "Don't tell me what went wrong. Tell me why it went wrong." Given 100+ failures across categories, the LLM finds 3–7 cross-cutting root causes. Example: "The agent consistently acted on assumptions about file contents without verifying them first."

Abstract principles: Root causes become principles. The "5-case test" filters: can you name 5 distinct situations where this principle applies? If not, it's a convention or howto, not a principle.

Coverage evaluation: Run proposed principles against a held-out test set of unseen failures. "Would this principle have prevented this error?" Principles that cover 70%+ of test failures are strong candidates.

The 5-case test and 70% coverage threshold are interesting design choices. Document the reasoning: why those numbers? Were they tuned? What happens at different thresholds?

3.4 Testing Principles in Sandbox

A principle that sounds wise might actually hurt in practice. How do you know? You test it. The sandbox replayRunning the agent in an isolated environment to test a principle. Compare results with vs. without the principle on the same tasks. system runs Claude Code inside DockerSoftware that creates isolated, disposable environments. Each test runs in a clean container so tests can't affect real work. containers at specific git commits, with and without each principle in its instructions:

# Run eval campaign: 20 prompts × 7 principles + baseline
python -m learning.session_review.sandbox_replay --eval default --parallel 6

The idea is simple but powerful: take a curated set of 20 coding tasks, run each task twice — once with the principle injected into the agent's instructions, once without (baseline). Measure everything: wall-clock time, tool calls, turns, and result quality (scored by LLM judges on a 0–100 scale).

This creates a controlled experiment for each principle. A principle that speeds things up but reduces quality is caught. A principle that helps on some tasks but hurts others gets a "guard clause"A qualification like "EXCEPT when..." that limits a principle to contexts where it actually helps, based on regression testing. — "apply EXCEPT when..."

Concrete example: "Minimize Blast Radius"

This principle says: "Don't reach for destructive tools when a scoped or reversible alternative exists. shutil.rmtree(), rm -rf, DROP TABLE, git reset --hard — these are irreversible and overbroad."

In the sandbox, the agent is given tasks that involve cleanup, deletion, and restructuring. With the principle injected, the agent reaches for trash instead of rm, uses scoped git restore --staged file instead of git reset --hard, and wraps destructive database operations in transactions.

The replay measures: did the principle slow the agent down? (Minimal — choosing a safer tool takes the same time.) Did it prevent damage? (Yes — in one test case, the baseline agent deleted a directory it shouldn't have.) Quality score: higher with the principle. Time: no significant difference.

Verdict: principle pays for itself. No guard clause needed.

The campaign runs 20 prompts × 7 principles + baseline = 160 runs, each in an isolated Docker container. Multiple LLM judges score quality independently. Only effects larger than 25% are considered statistically meaningful (wall-clock variance is ~20%).

Include aggregate sandbox results: how many of the 98 principles have been replay-tested, what fraction were confirmed helpful, what fraction needed guard clauses. This data exists in sandbox_results.db but hasn't been aggregated for this report.

3.5 The Knowledge Store

Everything lives in a SQLiteA lightweight database in a single file — no server needed. Fast, reliable, and perfect for a single-user knowledge system. database (learning.db) with a clear data model:

Entity	What it represents	Count
Instance	A single observation from a session, experiment, or human	1,126
Principle	An abstracted, general-purpose insight	98
Link	Connects instances to principles (supports, contradicts, refines)	187
Application	Records when a principle was used and whether it helped	1

Principles are the primary output. They're materializedPre-computed and stored in a different format. The database is the truth; the markdown files are regenerated from it so the agent can read simple text. as markdown files that get loaded into every future Claude Code session:

## No Silent Failures

If something fails, make it visible. Never swallow exceptions,
return empty data on error, or let a process silently produce
no output. The cost of a noisy failure is one interruption.
The cost of a silent failure is hours of debugging the wrong thing.

<!-- evidence: 16 instances, 0 applications, as of 2026-02-14 -->

3.6 Hierarchy of Knowledge Types

Not all learnings are equal. The system distinguishes five levels:

Type	Scope	Example	Generality
Principle	5+ applications	"Verify assumptions at boundaries"	Universal
Convention	How we do things	"Use `python` not `python3`"	Project
Pattern	Recurring solution	"Retry with exponential backoff"	Domain
Howto	Specific fix	"Shell escaping: use heredoc"	Situational
Observation	Raw insight	"Edit fails 9% of the time"	Data point

The system's goal is to promote observations upward: raw data becomes patterns, patterns become principles. Each level up means wider applicability and more leverage.

5. Motivating Examples

Example 1: "No Silent Failures" (16 instances)

This is the system's most evidence-backed principle. It emerged from mining sessions where:

A function returned an empty list on error instead of raising an exception. Downstream code processed the empty list silently, producing a report with zero data. The user didn't notice for hours.
A web fetch returned a cached error page. The cache layer didn't check content validity. The analysis proceeded on garbage data.
A batch job skipped failed items without logging. 40% of items had failed. The "successful" run produced 60% of expected output, and nobody noticed until the data was used in a report.

From 16 such instances, the system abstracted: "If something fails, make it visible." This principle now prevents the agent from writing code with silent exception handling, empty fallbacks, or unchecked return values.

Show the actual instance-to-principle link chain: pick one raw instance, show the LLM classification output, show the link with strength score, show the materialized principle text. End-to-end traceability.

Example 2: "Validate Before Bulk, Not After" (6 instances)

Several sessions ran batch processing on hundreds of items, only to discover a systematic problem late in the run:

The batch job catastrophe pattern

The agent processed 300 company profiles. After 2 hours, it discovered that the API had changed its response format. All 300 results were missing a critical field. No checkpoint, no resume — the entire run had to be redone.

If it had run 5 items first and checked the output, it would have caught the problem in 30 seconds instead of 2 hours.

The abstracted principle: "Before running 300+ items or 2+ hours of processing, run a representative sample (5–10 items) and verify the output has what you need."

Example 3: "Workarounds Piling Up = Wrong Abstraction" (7 instances)

When building UIs with Gradio, the agent kept adding CSS hacks to fix layout issues — a MutationObserver here, a custom JavaScript override there, another !important rule. Seven separate sessions exhibited this pattern before the system identified the meta-lesson:

When you're adding CSS hacks, MutationObservers, or custom JS to fix layout issues — STOP. Multiple workarounds signal you're fighting the framework instead of using it correctly.

This principle now triggers early in sessions, before the agent starts down the workaround path, saving hours of accumulated hack-upon-hack debugging.

Example 4: UX Principles — "Information Prominence" (4 instances)

This principle didn't emerge from a bug — it emerged from observing what makes interfaces good. Across several UI-building sessions, the system noticed a recurring pattern: labels, headers, and scaffolding were dominating the visual hierarchy, while the actual content — the answers the user came for — receded into the background.

The prominence inversion

<b>Type:</b> article — the label "Type" is bold and draws the eye. The actual information ("article") is plain text.

TYPE: <b>article</b> — the label recedes (small, grey), the content shines (bold, prominent). The eye lands on information first.

The abstracted principle: "Content outweighs chrome. The answers, not the scaffolding, should stand out." This now guides every UI the agent builds — not avoiding a mistake, but making a better design choice from the start.

Example 5: Architecture — "Extend, Don't Invent" (4 instances)

Four separate sessions built new tools from scratch when a small extension to an existing system would have sufficed. The system abstracted:

When you need a new capability, first ask: can an existing system already do this with a small extension? A new extension point on a working system beats a new app every time.

This principle produces less code and better architecture — fewer moving parts, less duplication, a more cohesive codebase.

Example 6: Knowledge-Accumulating Search

A novel pattern that emerged from the system's own learning process: when searching for information across many similar entities (e.g., finding investor relations pages for 500 companies), structure the work so each search enriches a shared knowledge base. After discovering that 60% of S&P 500 companies use the same webcast provider, you can predict the pattern for new companies without searching.

This principle didn't come from a failure — it came from observing that stateless search (running the same strategy for every entity) was wasteful. The system generalized the pattern into a reusable principle with concrete implementation guidance.

6. Measuring Usefulness

How do you know if a learning system is actually making the AI agent better? This is the hardest question, and the system addresses it at multiple levels:

5.1 Direct Metrics

Metric	What it measures	How	Current
Instance count per principle	Evidence strength	Count of linked instances	Top: 16x, Mean: ~2x
Application tracking	Was the principle used? Did it help?	`principle_applications` table	1 recorded (sparse)
Sandbox speedup	Does the principle make tasks faster?	Docker replay with/without principle	In progress
Error rate by category	Are specific error types decreasing?	Tool error analysis over time	Tracked daily
Coverage score	% of test failures a principle would prevent	LLM evaluation on held-out set	Per-principle

The "application tracking" metric has only 1 recorded application across 98 principles. This is the biggest measurement gap. Without systematic application tracking, you can't prove principles actually help. This needs a plan: how will application tracking be automated?

5.2 Process Metrics

Metric	What it tells you	Target
Principles proposed vs. promoted	Pipeline selectivity	High rejection rate = quality bar
Multi-model agreement rate	Repair scoring reliability	66% unanimous (current)
FTS5Full-Text Search 5 — SQLite's search engine. Builds a word index so keyword searches take milliseconds instead of seconds. search recall	Can users find relevant knowledge?	Qualitative
Principle staleness	Are old principles still relevant?	No stale principles surfaced confidently

5.3 The ROI Framework

The system includes a rigorous cost-benefit analysis framework for fixes (documented in HOWTO.md):

prevention_cost = cost_per_call × total_calls
failure_cost    = cost_per_failure × failure_count
ROI             = failure_cost - prevention_cost

This prevents the common trap of "fix everything" — some errors are so rare or cheap that preventing them costs more than tolerating them. The system categorizes fixes into three tiers:

Zero-cost behavioral changes (always worth it): Reorder operations, use correct tools
Per-failure hooks (worth it when fail rate > ~20%): Pre-tool hooks that reject bad calls
Per-call preflight checks (almost never worth it): Adding validation to every call to catch the ~9% that fail

Key insight from the analysis: "Most intuitive fixes fail the cost-benefit test. 'Always check X before Y' sounds prudent but adds overhead to every call. The 90%+ success rate means you're taxing the majority to help the minority. The best fixes are zero-cost behavioral changes — they redirect existing work, not add new work."

5.4 What "Working" Looks Like (from the Vision Doc)

The system defines intelligence levels for measuring progress:

Level	Capability	Test
L0	Store & retrieve	Exact match recall
L1	Generalize	Cross-memory patterns discovered
L2	Rate importance	Frequently-used knowledge surfaces first
L3	Context-aware retrieval	Relevant knowledge, not just similar
L4	Contradiction detection	Stale or conflicting knowledge flagged
L5	Synthesize	Combines knowledge to produce novel insights

Current system: L0–L1 achieved, working toward L3. L4–L5 represent the long-term vision.

Need concrete test results for each level. "L0-L1 achieved" — what's the evidence? Retrieval precision numbers? Generalization examples that weren't manually curated?

7. Prior Work & How This Differs

Retrieval-Augmented Generation (RAG)A technique where an AI retrieves relevant documents before answering, grounding its response in specific sources rather than just its training data.

RAG systems retrieve relevant documents to augment LLM prompts. The learning system shares this retrieval mechanism (FTS5, embeddingsText converted to numbers (vectors) where similar meanings produce similar numbers. "Verify before acting" and "check assumptions first" have similar embeddings despite different words.) but differs fundamentally: RAG retrieves existing documents; the learning system creates new knowledge by abstracting patterns from operational data. The knowledge doesn't exist in any document — it's synthesized from hundreds of error→fix sequences.

Reinforcement Learning from Human Feedback (RLHF)A training method that adjusts a model's internal parameters using human preferences. Expensive and opaque. The learning system changes instructions instead — cheap and transparent.

RLHF trains model weights using human preference signals. The learning system operates outside the model — it doesn't change model weights. Instead, it changes the context (instructions, principles) that the model receives at the start of each session. This is faster to iterate (no retraining), more interpretable (principles are human-readable text), and cheaper (SQLite + LLM calls vs. GPU training runs).

Experience Replay in RL

Reinforcement learning agents store and replay experiences from a buffer. The learning system's failure mining is analogous to experience replay, but with a crucial difference: it abstracts before replaying. Instead of replaying raw (state, action, reward) tuples, it extracts principles that generalize across episodes. One principle prevents dozens of future failures, whereas experience replay helps with the specific situations in the buffer.

Post-Incident Review / Blameless Postmortems

The SRE tradition of learning from incidents is the closest analog. The learning system automates this process: instead of a human writing "we should have validated the API response," the system mines transcripts, identifies the pattern, proposes the principle, tests it, and injects it into future sessions. It's a continuous, automated postmortem running after every session.

This section should include more specific comparisons: Voyager (Minecraft agent with skill library), Reflexion (verbal self-reflection), LATS (language agent tree search), and other recent "LLM agents that learn" papers. The current comparisons are high-level; specific paper references would strengthen academic credibility.

Self-Play and Curriculum Learning

The sandbox replay system (testing principles in Docker) is a form of curriculum learning — the agent is evaluated on a curated set of tasks that increase in difficulty. The "gyms" (badge quality, fetchability, knowledge extraction) extend this to continuous self-improvement loops on specific capabilities.

What's Genuinely Novel

Several aspects of this system don't have close precedents:

Multi-model consensus for repair identification — using 2+ LLM judges to determine which of 8 candidate actions was the actual fix, with measured agreement rates and cost optimization.
The "5-case test" for principle quality — a principle must apply to 5+ distinct situations or it's demoted to a convention/howto. This prevents overfitting to specific errors.
Guard clause generation from regression testing — principles that hurt performance on some tasks get automatically qualified with "EXCEPT when..." clauses.
The evidence chain — every principle traces back to specific session transcripts, through specific failure pairs, with specific LLM judge scores. Full provenance from raw data to deployed instruction.

8. Current Results

active principles

1,126

evidence instances

187

instance→principle links

principle categories

Top Principles by Evidence

Principle	Evidence	Domain
No Silent Failures	16 instances	Development
Verifying Early Beats Correcting Later	11 instances	Data Quality
Errors Compound Downstream	10 instances	Data Quality
Fail Loud, Never Fake	10 instances	Development
Respect Abstraction Boundaries	8 instances	Development
Workarounds Piling Up = Wrong Abstraction	7 instances	Development
Validate Before Bulk, Not After	6 instances	Batch Jobs
Minimize Blast Radius	6 instances	Development
Self-Observation (system must observe itself)	6 instances	Observability
Verify Action	5 instances	Development

Source Distribution

Of 1,126 instances: 95% (1,068) come from automated session mining, 5% (58) from manual input. The system is overwhelmingly self-feeding — humans provide seed knowledge, the mining pipeline does the rest.

Multi-Model Repair Scoring

997 failure pairs scored across 6 models. Key findings:

66% unanimous agreement across all models
Gemini Flash ($1.72/1K pairs) matches flagship accuracy at 92%
Grok Thinking ($0.25/1K pairs) achieves 88% — cheapest viable judge
The "investigation vs. repair" distinction is the primary source of model disagreement

Missing: before/after comparison on real sessions. "Sessions with principles injected had X% fewer errors / Y% faster completion than sessions without." This is the killer metric that would prove the system works.

9. Vision: Where This Goes

Near-term: Closing the Loop

The biggest gap is application tracking. The system knows what it learned but doesn't yet systematically measure whether those learnings are being used and helping. Closing this loop — tracking when each principle fires, whether the outcome was better — is the difference between a knowledge base and a learning system.

Medium-term: Cross-Domain Pattern Recognition

The highest-value learning isn't domain-specific ("use word boundaries for ticker matching"). It's recognizing that a problem in one domain has already been solved in another.

Two-Tower Architecture Discovery

When designing principle retrieval, the system recognized the structural similarity to recommendation systems: one side is stable (principles), the other changes constantly (situations). This is the same two-tower architectureTwo separate neural networks that each process one type of input and project it into a shared space where similarity can be measured. Add a new item to one side and it's instantly searchable. as Google Search (query + document towers), Spotify recommendations (user + item towers), and CLIP (image + text towers).

The match was structural, not superficial — same shape (two entities, asymmetric update frequency, learned relevance) despite completely different domains. The learning system should eventually make these cross-domain connections automatically.

Long-term: The North Star

A new piece of information on a well-covered topic is boring to the system — low surprise, high explainability, rich connections. That's understanding.

The vision is a system that doesn't just store facts but understands its domain: it can explain why things work, predict what will go wrong, and synthesize novel solutions by combining knowledge from different areas. The intelligence levels (L0–L5) provide a roadmap:

L5 (Synthesize): The system combines knowledge from web scraping failures, batch processing patterns, and UI development to propose: "When a pipeline has multiple stages, the earliest stage determines overall quality. Invest verification effort at the source, not downstream." This principle was never explicitly taught — it was synthesized from patterns across three different domains.

Self-Improving Gyms

Beyond session mining, the system includes "gyms" — structured generate→evaluate→learn loops for specific capabilities:

Gym	What it improves	Status
Badge	Session topic summarization quality	Done
Fetchability	Which proxy/method works for which URLs	Done
KB Extraction	Knowledge extraction from web sources	Done
Sidekick	When to intervene vs. stay quiet	Designed
Code Cleanup	Codebase hygiene suggestions	Designed

Each gym is a miniature learning system for a specific capability, using the same Raw + Active architecture: raw episodes are immutable, the active model is swappable, and improvement is measured by replaying the corpus through new versions.

The gym concept is compelling but under-documented here. A concrete gym result (e.g., "Badge Gym tested 5 prompt variants, Variant C scored 23% higher on topic accuracy") would demonstrate the gen→eval→learn loop in action.

Appendix: System Architecture

+------------------+ | Claude Code | | Sessions | +--------+---------+ | JSONL transcripts | +-------------v--------------+ | | +--------v--------+ +--------v--------+ | Failure Mining | | Tool Error | | (pairs.db) | | Analysis | +--------+---------+ +-----------------+ | +--------v---------+ | Multi-Model | | Repair Scoring | | (majority vote) | +--------+---------+ | +--------v---------+ | Principle | | Proposal | | (3-stage LLM) | +--------+---------+ | +--------v---------+ +------------------+ | Sandbox Replay |<---------| Eval Campaigns | | (Docker) | | (YAML configs) | +--------+---------+ +------------------+ | +--------v---------+ | learning.db | | (SQLite) | | instances, | | principles, | | links, apps | +--------+---------+ | +--------v---------+ | Materialize | | (.md files) | +--------+---------+ | +--------v---------+ | ~/.claude/ | | principles/*.md | | learnings.md | +------------------+ | v Injected into next session

Glossary

LLM (Large Language Model): An AI model trained on vast text data that can generate, analyze, and reason about text. Examples: Claude, GPT, Gemini. In this system, LLMs serve as both the agent being improved and the judges evaluating quality.
JSONL (JSON Lines): A file format where each line is a complete JSON object. Claude Code logs every session as a JSONL file — each line records one event (a tool call, a response, an error). This is the raw data the learning system mines.
FTS5 (Full-Text Search 5): SQLite's built-in full-text search engine. Instead of scanning every row looking for a word, FTS5 builds an inverted index (like a book's index) that maps words to the rows containing them. Searches that would take seconds on raw text take milliseconds with FTS5. The learning system uses it to quickly find relevant principles and instances by keyword.
Embeddings: A way to represent text as a list of numbers (a "vector") such that similar meanings produce similar numbers. "Verify before acting" and "Check assumptions first" have very different words but similar embeddings, because they mean similar things. The learning system uses embeddings for semantic search — finding principles that are conceptually relevant, not just keyword matches.
Cosine Similarity: A mathematical measure of how similar two embeddings are, scored from -1 (opposite) to 1 (identical). When you search for "error handling," the system computes the cosine similarity between your query's embedding and every stored principle's embedding, returning the closest matches. It measures the angle between two vectors — small angle means similar meaning.
SQLite: A lightweight database that lives in a single file (no server needed). The learning system stores all its data in learning.db, a SQLite file you can query directly. It's fast, reliable, and perfect for a single-user knowledge system.
Materialized View: A pre-computed copy of data stored in a different format for easy access. The principle .md files are materialized views of the database — the database is the source of truth, and the markdown files are regenerated from it whenever principles change. This way, the agent can read simple text files instead of querying a database.
Sandbox Replay: Running the AI agent in an isolated environment (Docker container) to test whether a principle helps or hurts. The agent gets a coding task, runs it with and without a specific principle in its instructions, and the results are compared. "Sandbox" means the test can't affect real files or systems.
Docker: Software that creates isolated, disposable environments ("containers") for running programs. Each sandbox replay runs in its own Docker container with a clean copy of the codebase, so tests don't interfere with each other or with real work.
RAG (Retrieval-Augmented Generation): A technique where an LLM retrieves relevant documents before generating a response, grounding its answer in specific sources. The learning system uses RAG-like retrieval (finding relevant principles) but goes further: it creates the knowledge it retrieves, rather than pulling from pre-existing documents.
RLHF (Reinforcement Learning from Human Feedback): A training technique where human preferences are used to adjust an AI model's internal parameters (weights). The learning system differs: instead of changing the model's weights (expensive, opaque), it changes the text instructions the model receives (cheap, interpretable, instantly reversible).
Ablation Study: Testing what happens when you remove a component. "Does the system work without this principle?" If removing it makes things worse, the principle is earning its keep. If nothing changes, it's dead weight.
Two-Tower Model: A neural network architecture with two separate networks (towers) that each process one type of input and project it into a shared space. Used by Google Search (query + document), Spotify (user + song), and planned for principle retrieval (situation + principle). The advantage: add a new principle and it's immediately retrievable without retraining.
Guard Clause: A qualification added to a principle after testing reveals edge cases where it hurts. Example: "Always use parallel tool calls — except for leaf tasks under 5 seconds." The principle is still valid in general, but the guard clause prevents it from being applied where it does more harm than good.

Report generated 2026-02-14. Data from learning/data/learning.db. Source: present/learning/report.html.

System by Tim Chklovski. Built with Claude Code + rivus tooling.