How an AI assistant extracts principles from its own work — about code quality, UX design, data processing, and architecture — and uses them to produce better solutions over time.
In the agentic future, AI agents will be remarkably competent and intelligent out of the box. But there will always be frontiers of specialized capability — from protein folding to reasoning about the impact of news on semiconductor supply chains — that require two things generic models lack:
Generic LLMsLarge Language Models — AI models trained on vast text data that can generate, analyze, and reason about text. Examples: Claude, GPT, Gemini. will be commoditized — everyone has access to the same frontier models. What remains scarce and valuable: curated, verified domain expertise that agents can leverage for specialized reasoning.
The agent that can analyze a 10-K like a forensic accountant, evaluate a founder like a seasoned VC, or assess supply chain risk like a procurement expert will outperform a generic agent with the same base model. The differentiator isn't reasoning capability — it's having better, fresher, more structured domain knowledge to reason about.
This report describes the learning system — one half of how we build that domain expertise layer. It handles the internal loop: extracting principles from the agent's own work, testing them, and feeding them back. Its companion, the skill acquisition system, handles the external loop: extracting domain knowledge from the web, verifying it through execution, and building a growing library of specialized capabilities. Together, they compound: the agent gets better at learning, and what it learns makes it better at its work.
AI coding assistants like Claude Code are remarkably capable. But each session starts with limited memory from prior sessions. The agent doesn't accumulate wisdom: hard-won insights about what makes code clean, what makes layouts intuitive, what makes data pipelines robust, what architectural choices pay off.
Quality: Session 14 builds a Gradio UI. The layout fights the framework — CSS hacks, MutationObservers, custom JavaScript. Session 22 hits the same pattern. Neither session learns the deeper lesson: multiple workarounds signal you're fighting the framework instead of using it correctly.
Efficiency: Session 38 runs a batch job on 300 items. Item #287 has a missing field that causes a crash. No checkpoint, no resume. Wasted: ~40 minutes. Session 39 repeats the pattern with a different dataset.
Architecture: Session 45 builds a new CLI tool from scratch. Session 52 builds another. Neither asks: can an existing system already do this with a small extension?
Analysis of a single user's (Tim's) Claude Code sessions over ~7 weeks found:
The issue isn't raw capability — it's the absence of accumulated judgment. A senior engineer doesn't just know more facts; they've internalized principles about when complexity is warranted, when to verify assumptions, when a workaround signals a wrong abstraction. The agent lacks equally excellent mechanisms to build this kind of wisdom.
What if the AI agent could review its own work — not just what went wrong, but what went right and why — abstract those insights into general principles about code quality, UX design, data handling, and architecture, test them, and carry them forward into every future session?
That's the learning system. It extracts wisdom across multiple dimensions:
| Domain | What it learns | Example principle |
|---|---|---|
| Code quality | Patterns that produce clean, maintainable code | "Fail loud, never fake" — no silent fallbacks |
| UX design | What makes interfaces intuitive and clear | "Content outweighs chrome" — answers, not scaffolding, should stand out |
| Data processing | How to build robust, resumable pipelines | "Validate before bulk" — sample 5 items before running 300 |
| Architecture | When to extend vs. build, when complexity is earned | "Extend, don't invent" — a small extension beats a new app |
| Observability | Making systems self-aware and debuggable | "The system must be able to observe itself" |
The implementation is a closed loop:
The key insight is that principles, not fixes, are the unit of learning. A specific fix ("add --force flag") helps once. A principle ("verify assumptions at system boundaries") prevents entire classes of errors across all future sessions.
Every Claude Code session produces a JSONLJSON Lines — a file where each line is one JSON event. The raw transcript of everything the agent did. transcript — every tool call, every result, every error. The mining step parses these transcripts looking for sequences where a tool call fails and a subsequent call succeeds.
is_error: true. Categorize by type (import error, file not found, edit mismatch, syntax error, timeout).
failures.db. Currently: 1,000+ pairs from 30 days of sessions.
Why 8 candidates instead of just the next success? After an error, AI agents typically investigate (Read, Grep, search) before fixing. Naively pairing "error → next success" was wrong 62% of the time — it paired the error with a diagnostic step, not the actual fix.
Given 8 candidate repairs, which one actually fixed the problem? This is a judgment call that requires understanding code context. The system uses multiple LLMLarge Language Model — an AI that can read and reason about text. Here, different LLMs serve as independent judges evaluating repair quality. judges:
| Model | Agreement with flagship | Cost (1K pairs) |
|---|---|---|
| Gemini 3 Flash | 92% | $1.72 |
| Claude Opus 4.6 | 92% | ~$15 |
| Grok 4.1 Thinking | 88% | ~$0.25 |
| Gemini 3 Pro (flagship) | — | $8.50 |
Majority vote across 2+ models sets the verdict. Default pair: Gemini Flash + Grok Thinking (~$2 total) achieves 90%+ agreement with the expensive flagship. Across 997 pairs, 66% had unanimous agreement across all 6 tested models.
This is where the magic happens. Given hundreds of verified failure→fix pairs, a three-stage LLM pipeline abstracts them into general principles:
A principle that sounds wise might actually hurt in practice. How do you know? You test it. The sandbox replayRunning the agent in an isolated environment to test a principle. Compare results with vs. without the principle on the same tasks. system runs Claude Code inside DockerSoftware that creates isolated, disposable environments. Each test runs in a clean container so tests can't affect real work. containers at specific git commits, with and without each principle in its instructions:
# Run eval campaign: 20 prompts × 7 principles + baseline
python -m learning.session_review.sandbox_replay --eval default --parallel 6
The idea is simple but powerful: take a curated set of 20 coding tasks, run each task twice — once with the principle injected into the agent's instructions, once without (baseline). Measure everything: wall-clock time, tool calls, turns, and result quality (scored by LLM judges on a 0–100 scale).
This creates a controlled experiment for each principle. A principle that speeds things up but reduces quality is caught. A principle that helps on some tasks but hurts others gets a "guard clause"A qualification like "EXCEPT when..." that limits a principle to contexts where it actually helps, based on regression testing. — "apply EXCEPT when..."
This principle says: "Don't reach for destructive tools when a scoped or reversible alternative exists. shutil.rmtree(), rm -rf, DROP TABLE, git reset --hard — these are irreversible and overbroad."
In the sandbox, the agent is given tasks that involve cleanup, deletion, and restructuring. With the principle injected, the agent reaches for trash instead of rm, uses scoped git restore --staged file instead of git reset --hard, and wraps destructive database operations in transactions.
The replay measures: did the principle slow the agent down? (Minimal — choosing a safer tool takes the same time.) Did it prevent damage? (Yes — in one test case, the baseline agent deleted a directory it shouldn't have.) Quality score: higher with the principle. Time: no significant difference.
Verdict: principle pays for itself. No guard clause needed.
The campaign runs 20 prompts × 7 principles + baseline = 160 runs, each in an isolated Docker container. Multiple LLM judges score quality independently. Only effects larger than 25% are considered statistically meaningful (wall-clock variance is ~20%).
sandbox_results.db but hasn't been aggregated for this report.Everything lives in a SQLiteA lightweight database in a single file — no server needed. Fast, reliable, and perfect for a single-user knowledge system. database (learning.db) with a clear data model:
| Entity | What it represents | Count |
|---|---|---|
| Instance | A single observation from a session, experiment, or human | 1,126 |
| Principle | An abstracted, general-purpose insight | 98 |
| Link | Connects instances to principles (supports, contradicts, refines) | 187 |
| Application | Records when a principle was used and whether it helped | 1 |
Principles are the primary output. They're materializedPre-computed and stored in a different format. The database is the truth; the markdown files are regenerated from it so the agent can read simple text. as markdown files that get loaded into every future Claude Code session:
## No Silent Failures
If something fails, make it visible. Never swallow exceptions,
return empty data on error, or let a process silently produce
no output. The cost of a noisy failure is one interruption.
The cost of a silent failure is hours of debugging the wrong thing.
<!-- evidence: 16 instances, 0 applications, as of 2026-02-14 -->
Not all learnings are equal. The system distinguishes five levels:
| Type | Scope | Example | Generality |
|---|---|---|---|
| Principle | 5+ applications | "Verify assumptions at boundaries" | Universal |
| Convention | How we do things | "Use python not python3" | Project |
| Pattern | Recurring solution | "Retry with exponential backoff" | Domain |
| Howto | Specific fix | "Shell escaping: use heredoc" | Situational |
| Observation | Raw insight | "Edit fails 9% of the time" | Data point |
The system's goal is to promote observations upward: raw data becomes patterns, patterns become principles. Each level up means wider applicability and more leverage.
This is the system's most evidence-backed principle. It emerged from mining sessions where:
From 16 such instances, the system abstracted: "If something fails, make it visible." This principle now prevents the agent from writing code with silent exception handling, empty fallbacks, or unchecked return values.
Several sessions ran batch processing on hundreds of items, only to discover a systematic problem late in the run:
The agent processed 300 company profiles. After 2 hours, it discovered that the API had changed its response format. All 300 results were missing a critical field. No checkpoint, no resume — the entire run had to be redone.
If it had run 5 items first and checked the output, it would have caught the problem in 30 seconds instead of 2 hours.
The abstracted principle: "Before running 300+ items or 2+ hours of processing, run a representative sample (5–10 items) and verify the output has what you need."
When building UIs with Gradio, the agent kept adding CSS hacks to fix layout issues — a MutationObserver here, a custom JavaScript override there, another !important rule. Seven separate sessions exhibited this pattern before the system identified the meta-lesson:
When you're adding CSS hacks, MutationObservers, or custom JS to fix layout issues — STOP. Multiple workarounds signal you're fighting the framework instead of using it correctly.
This principle now triggers early in sessions, before the agent starts down the workaround path, saving hours of accumulated hack-upon-hack debugging.
This principle didn't emerge from a bug — it emerged from observing what makes interfaces good. Across several UI-building sessions, the system noticed a recurring pattern: labels, headers, and scaffolding were dominating the visual hierarchy, while the actual content — the answers the user came for — receded into the background.
<b>Type:</b> article — the label "Type" is bold and draws the eye. The actual information ("article") is plain text.
TYPE: <b>article</b> — the label recedes (small, grey), the content shines (bold, prominent). The eye lands on information first.
The abstracted principle: "Content outweighs chrome. The answers, not the scaffolding, should stand out." This now guides every UI the agent builds — not avoiding a mistake, but making a better design choice from the start.
Four separate sessions built new tools from scratch when a small extension to an existing system would have sufficed. The system abstracted:
When you need a new capability, first ask: can an existing system already do this with a small extension? A new extension point on a working system beats a new app every time.
This principle produces less code and better architecture — fewer moving parts, less duplication, a more cohesive codebase.
A novel pattern that emerged from the system's own learning process: when searching for information across many similar entities (e.g., finding investor relations pages for 500 companies), structure the work so each search enriches a shared knowledge base. After discovering that 60% of S&P 500 companies use the same webcast provider, you can predict the pattern for new companies without searching.
This principle didn't come from a failure — it came from observing that stateless search (running the same strategy for every entity) was wasteful. The system generalized the pattern into a reusable principle with concrete implementation guidance.
How do you know if a learning system is actually making the AI agent better? This is the hardest question, and the system addresses it at multiple levels:
| Metric | What it measures | How | Current |
|---|---|---|---|
| Instance count per principle | Evidence strength | Count of linked instances | Top: 16x, Mean: ~2x |
| Application tracking | Was the principle used? Did it help? | principle_applications table | 1 recorded (sparse) |
| Sandbox speedup | Does the principle make tasks faster? | Docker replay with/without principle | In progress |
| Error rate by category | Are specific error types decreasing? | Tool error analysis over time | Tracked daily |
| Coverage score | % of test failures a principle would prevent | LLM evaluation on held-out set | Per-principle |
| Metric | What it tells you | Target |
|---|---|---|
| Principles proposed vs. promoted | Pipeline selectivity | High rejection rate = quality bar |
| Multi-model agreement rate | Repair scoring reliability | 66% unanimous (current) |
| FTS5Full-Text Search 5 — SQLite's search engine. Builds a word index so keyword searches take milliseconds instead of seconds. search recall | Can users find relevant knowledge? | Qualitative |
| Principle staleness | Are old principles still relevant? | No stale principles surfaced confidently |
The system includes a rigorous cost-benefit analysis framework for fixes (documented in HOWTO.md):
prevention_cost = cost_per_call × total_calls
failure_cost = cost_per_failure × failure_count
ROI = failure_cost - prevention_cost
This prevents the common trap of "fix everything" — some errors are so rare or cheap that preventing them costs more than tolerating them. The system categorizes fixes into three tiers:
The system defines intelligence levels for measuring progress:
| Level | Capability | Test |
|---|---|---|
| L0 | Store & retrieve | Exact match recall |
| L1 | Generalize | Cross-memory patterns discovered |
| L2 | Rate importance | Frequently-used knowledge surfaces first |
| L3 | Context-aware retrieval | Relevant knowledge, not just similar |
| L4 | Contradiction detection | Stale or conflicting knowledge flagged |
| L5 | Synthesize | Combines knowledge to produce novel insights |
Current system: L0–L1 achieved, working toward L3. L4–L5 represent the long-term vision.
RAG systems retrieve relevant documents to augment LLM prompts. The learning system shares this retrieval mechanism (FTS5, embeddingsText converted to numbers (vectors) where similar meanings produce similar numbers. "Verify before acting" and "check assumptions first" have similar embeddings despite different words.) but differs fundamentally: RAG retrieves existing documents; the learning system creates new knowledge by abstracting patterns from operational data. The knowledge doesn't exist in any document — it's synthesized from hundreds of error→fix sequences.
RLHF trains model weights using human preference signals. The learning system operates outside the model — it doesn't change model weights. Instead, it changes the context (instructions, principles) that the model receives at the start of each session. This is faster to iterate (no retraining), more interpretable (principles are human-readable text), and cheaper (SQLite + LLM calls vs. GPU training runs).
Reinforcement learning agents store and replay experiences from a buffer. The learning system's failure mining is analogous to experience replay, but with a crucial difference: it abstracts before replaying. Instead of replaying raw (state, action, reward) tuples, it extracts principles that generalize across episodes. One principle prevents dozens of future failures, whereas experience replay helps with the specific situations in the buffer.
The SRE tradition of learning from incidents is the closest analog. The learning system automates this process: instead of a human writing "we should have validated the API response," the system mines transcripts, identifies the pattern, proposes the principle, tests it, and injects it into future sessions. It's a continuous, automated postmortem running after every session.
The sandbox replay system (testing principles in Docker) is a form of curriculum learning — the agent is evaluated on a curated set of tasks that increase in difficulty. The "gyms" (badge quality, fetchability, knowledge extraction) extend this to continuous self-improvement loops on specific capabilities.
Several aspects of this system don't have close precedents:
| Principle | Evidence | Domain |
|---|---|---|
| No Silent Failures | 16 instances | Development |
| Verifying Early Beats Correcting Later | 11 instances | Data Quality |
| Errors Compound Downstream | 10 instances | Data Quality |
| Fail Loud, Never Fake | 10 instances | Development |
| Respect Abstraction Boundaries | 8 instances | Development |
| Workarounds Piling Up = Wrong Abstraction | 7 instances | Development |
| Validate Before Bulk, Not After | 6 instances | Batch Jobs |
| Minimize Blast Radius | 6 instances | Development |
| Self-Observation (system must observe itself) | 6 instances | Observability |
| Verify Action | 5 instances | Development |
Of 1,126 instances: 95% (1,068) come from automated session mining, 5% (58) from manual input. The system is overwhelmingly self-feeding — humans provide seed knowledge, the mining pipeline does the rest.
997 failure pairs scored across 6 models. Key findings:
The biggest gap is application tracking. The system knows what it learned but doesn't yet systematically measure whether those learnings are being used and helping. Closing this loop — tracking when each principle fires, whether the outcome was better — is the difference between a knowledge base and a learning system.
The highest-value learning isn't domain-specific ("use word boundaries for ticker matching"). It's recognizing that a problem in one domain has already been solved in another.
When designing principle retrieval, the system recognized the structural similarity to recommendation systems: one side is stable (principles), the other changes constantly (situations). This is the same two-tower architectureTwo separate neural networks that each process one type of input and project it into a shared space where similarity can be measured. Add a new item to one side and it's instantly searchable. as Google Search (query + document towers), Spotify recommendations (user + item towers), and CLIP (image + text towers).
The match was structural, not superficial — same shape (two entities, asymmetric update frequency, learned relevance) despite completely different domains. The learning system should eventually make these cross-domain connections automatically.
A new piece of information on a well-covered topic is boring to the system — low surprise, high explainability, rich connections. That's understanding.
The vision is a system that doesn't just store facts but understands its domain: it can explain why things work, predict what will go wrong, and synthesize novel solutions by combining knowledge from different areas. The intelligence levels (L0–L5) provide a roadmap:
Beyond session mining, the system includes "gyms" — structured generate→evaluate→learn loops for specific capabilities:
| Gym | What it improves | Status |
|---|---|---|
| Badge | Session topic summarization quality | Done |
| Fetchability | Which proxy/method works for which URLs | Done |
| KB Extraction | Knowledge extraction from web sources | Done |
| Sidekick | When to intervene vs. stay quiet | Designed |
| Code Cleanup | Codebase hygiene suggestions | Designed |
Each gym is a miniature learning system for a specific capability, using the same Raw + Active architecture: raw episodes are immutable, the active model is swappable, and improvement is measured by replaying the corpus through new versions.
learning.db, a SQLite file you can query directly. It's fast, reliable, and perfect for a single-user knowledge system..md files are materialized views of the database — the database is the source of truth, and the markdown files are regenerated from it whenever principles change. This way, the agent can read simple text files instead of querying a database.Report generated 2026-02-14. Data from learning/data/learning.db. Source: present/learning/report.html.
System by Tim Chklovski. Built with Claude Code + rivus tooling.