Learning from Experience, Applying the Lessons

How an AI coding agent retrieves the right lessons from experience and adapts them to the task at hand

1. The Problem: Principles Don't Apply Themselves

A companion report describes how we extract principles from an AI coding agent's own work — mining session transcripts for error→fix pairs, abstracting them into general rules, and testing those rules in sandboxed replays. After 217 sessions, the system has distilled 156 principles backed by 1,336 specific instances.

"Minimize Blast Radius"

This principle says: Don't reach for destructive tools when a scoped or reversible alternative exists.

Excellent advice. But when the agent is about to run a batch delete on 300 stale cache files, what does it actually mean here? It might mean "use trash instead of rm." Or "delete one file first to verify the glob pattern." Or "wrap the operation in a transaction." The principle is the what; the agent still needs to figure out the how for this particular situation.

This is the gap between retrieval (finding the right principles) and adaptation (making them actionable for the current task). Both are hard, for different reasons:

Retrieval problem

1,336 instances outnumber 156 principles 10:1. Searching for "asking permission" returns 10 results about bash permission denied errors instead of the UX principle about when to prompt users before acting.

And a principle like "Importance Ordering & Attention Budget" says "vertical space is precious" — but when you're deciding tab order in a UI, you'd search "tab ordering" and miss it entirely. The principle and the situation are in different meaning-spaces.

Adaptation problem

"Validate inputs" is useless guidance. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful.

Adaptation means bridging the abstraction gap: taking a general rule and generating the specific implication for this task, this codebase, this API. Neither retrieval nor raw principle text can do this alone — it requires interpretation.

This report covers how we solve both: a retrieval layer that finds the right principles despite vocabulary mismatch and instance flooding, and an adaptation layer that transforms those principles into concrete, task-specific guidance.

The full loop: ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ RETRIEVE │────▶│ ADAPT │────▶│ APPLY │ │ find the │ │ make it │ │ agent uses │ │ right ones │ │ concrete │ │ the advice │ └──────┬───────┘ └──────────────┘ └──────┬───────┘ ▲ │ │ ┌──────────────┐ │ └──────────────│ FEEDBACK │◀──────────┘ │ did it help?│ └──────────────┘

2. System Overview

The Knowledge Base

Principles are general rules: "Fail Loud, Never Fake", "No Silent Failures", "Edit, Don't Replace — Content Is Sacred." High-value, distilled from experience, each backed by multiple instances.

Instances are specific observations: "Python '202212' == 202212 is False — no error, no warning, causing 100% silent lookup failure." Useful as evidence, but there are 10× more of them, and many are error patterns (bash errors, import failures, etc.).

Applications record when a principle was used and what happened — the data that closes the feedback loop.

Three Layers

User query or task context │ ▼ ╔═════════════════════════════════╗ ║ 1. RETRIEVAL ║ ║ Hybrid search: FTS + semantic ║ ║ + principle boost + value rank ║ ╚══════════════╤══════════════════╝ │ top-N principles + instances ▼ ╔═════════════════════════════════╗ ║ 2. ADAPTATION ║ ║ LLM synthesizes principles ║ ║ into task-specific guidance ║ ╚══════════════╤══════════════════╝ │ concrete checklist ▼ ╔═════════════════════════════════╗ ║ 3. FEEDBACK ║ ║ Track which principles helped ║ ║ Feed outcomes back into search ║ ╚═════════════════════════════════╝

3. Retrieval: Finding the Right Principles

The first challenge: surface the right principles from a mixed corpus of 156 principles and 1,336 instances. Three search modes, each with different strengths:

Key improvements over baseline

Mode	How it works	Good at	Bad at
FTSFull-Text Search — SQLite's FTS5 engine. Finds documents containing the exact words you typed, ranked by BM25.	Keyword match, ranked by BM25Best Matching 25 — ranks documents by term frequency vs. corpus rarity. Rare terms score higher.	Exact terms like `"blast radius"`	Conceptual queries — `"keep code simple"` won't find "Less Is More"
SemanticConverts text into a 3,072-dimensional vector (Gemini embedding model) that captures meaning. "keep code simple" and "less is more" end up near each other in vector space.	EmbeddingsText represented as a list of numbers (a vector). Similar meaning = similar vectors. 3,072 dimensions, Gemini text-embedding-004. compared by cosine similarityMeasures angle between two vectors. 1.0 = identical meaning, 0.0 = unrelated.	Conceptual matches, paraphrases	Instances drown out principles
Hybrid	FTS + semantic merged (70/30 weights)	Best of both	More complex

#	Improvement	What it does	P@1	MRR
0	Baseline	Raw semantic search, no adjustments	0.738	0.813
1	Principle boost	+0.06 cosine similarity for principles (compensates for the 10:1 instance ratio)	0.857	0.908
2	Threshold filter	Drop results scoring below 0.65	—	—
3	Metadata filters	Scope by project or type	—	—
4	Hybrid merge	FTS + semantic combined	0.845	0.907
5	Value re-ranking	Boost from application_count — battle-tested principles rank higher	0.905	0.931

The hardest queries improved the most. Cross-domain P@1Precision at 1 — is the top result correct? The most important retrieval metric. went from 0.60 → 0.90 and negation from 0.39 → 0.77. Value re-ranking — which uses application counts from the feedback loop — delivered the single largest improvement.

Example: anti-pattern retrieval

You describe a bad practice — can the system find the principle that forbids it?

Query: "catching exceptions silently and continuing"

$ learn find -s "catching exceptions silently and continuing"

  [0.784] P  development/no-silent-failures
          No Silent Failures — If something fails, make it visible.

  [0.763] P  batch-jobs/errors-are-data-not-exceptions
          Errors Are Data, Not Exceptions — At scale, errors will happen...

  [0.757] P  observability/error-logging
          Error Logging — All errors must be captured where the system can see them.

The query describes an anti-pattern. The system finds principles that forbid it — even though they share no keywords with the query.

4. Adaptation: From Abstract to Actionable

Retrieval gives you a list of principles. Adaptation gives you a checklist you can act on. Three mechanisms operate at different time scales:

4.1 Ambient Injection (always on)

All 156 principles are materializedPre-computed and stored as plain markdown files, regenerated from the database. The database is the source of truth; the markdown files are a readable projection for the agent. as markdown files in ~/.claude/principles/ — 10 category files (dev, testing, batch-jobs, observability, etc.) loaded into every Claude Code session's context. At ~140 characters each, that's about 6K tokens. It fits.

This is the cheapest form of adaptation: the agent has the principles in context and can reason about them naturally. It works for principles the agent encounters frequently. But with 156 principles in the background, any individual one is easy to overlook when it matters most.

4.2 On-Demand Synthesis (the key mechanism)

This is where retrieval becomes adaptation. Before starting a non-trivial task, the agent (or a hook) runs the apply-learnings procedure:

Raw retrieval output

<principle id="dev/fail-loud-never-fake"
           score="0.84">
  Fail Loud, Never Fake — Never
  simplify a problem by substituting
  mock data, placeholder implementations,
  or lossy summaries.
</principle>

<principle id="batch-jobs/validate-before-bulk"
           score="0.81">
  Validate Before Bulk — Before
  processing 300+ items, run a sample
  of 5-10 and verify the output.
</principle>

<principle id="observability/error-logging"
           score="0.79">
  Error Logging — All errors must
  be captured where the system can
  see them.
</principle>

Abstract. Correct but not actionable for the specific task at hand.

After adaptation

## Pre-Flight Checklist

Task: Fetch YouTube transcripts
for 300 company earnings calls

Relevant principles:
- Validate Before Bulk — Fetch 5
  transcripts first. Check: does the
  API return the `transcript` field?
  It's sometimes null for private
  videos or those with auto-captions
  disabled.

- Fail Loud, Never Fake — If a
  transcript fetch returns empty text,
  raise an error. Don't store an empty
  string as if it were a real transcript.

- Error Logging — Log each failed
  video_id to a separate errors file.
  After the batch, report: "247/300
  succeeded, 53 failed (see errors.jsonl)"

Watch out for:
- YouTube API rate limits kick in
  around item 100. Use exponential
  backoff (tenacity, not hand-rolled).
- The `transcript` field is null, not
  absent, when captions are disabled.
  Check `if transcript is None`, not
  `if "transcript" not in response`.

Concrete. Every bullet is specific to this task, this API, this failure mode.

The key rule: be specific, not generic. "Validate inputs" is useless. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful. The specificity comes from combining the abstract principle with the agent's knowledge of the current task and codebase.

The synthesis step is driven by explicit instructions. Here's the core of the procedure:

The prompt doesn't try to be clever. It tells the LLM to be specific, to skip irrelevant results, and to keep it short. The quality comes from combining good retrieval (the right principles are in the candidate set) with the LLM's ability to reason about what an abstract rule means in a concrete context.

4.3 Automatic Detection (background, every prompt)

A background hook runs on every non-trivial user prompt. It sends the prompt plus a compact catalog of all 156 principles to a fast LLM (Haiku), which returns 0–3 principle IDs that genuinely apply to this specific task:

This serves two purposes: it builds the application-tracking data that feeds back into value-boosted re-ranking (principles detected more often in real work rank higher in future searches), and it provides a signal for when a principle is relevant even though nobody searched for it.

Currently, the detection results are recorded as telemetry — they populate the principle_applications table but aren't surfaced back into the agent's active context. The next step (see What's Next) is real-time injection: surfacing the detected principles directly into the session.

How the three layers interact

Ambient injection provides a foundation. On-demand synthesis kicks in for important tasks where generic principles need to be translated into specific actions. Auto-detection builds the data that makes both of the other layers better over time.

5. The Feedback Loop

The system gets better by being used. Three mechanisms feed outcomes back into future retrieval and adaptation:

5.1 Application Tracking

Layer	When	Adaptation depth	Cost
Ambient	Always in context	None — raw principle text	~6K tokens, amortized
On-demand synthesis	Before non-trivial tasks	Full — task-specific checklist	~2K in + ~500 out tokens, ~$0.01, ~3s latency
Auto-detection	Every prompt (background)	Classification only (relevant/not)	~$0.001/prompt (Haiku)

Each application record captures: which principle, which session, what outcome (success/partial/failure), what error was prevented, estimated time saved. The principle_applications table now holds 1,732 records across both manual entries and automated detection.

5.2 Value-Boosted Re-Ranking

A principle like "Verify Action" with 121 applications gets +0.028 cosine score — enough to break ties when two principles have similar semantic scores. A brand-new principle with zero applications gets +0.000. Battle-tested principles rise; untested ones must earn their rank.

This is the single most effective retrieval improvement: value re-ranking pushed overall P@1 from 0.845 to 0.905. The feedback loop makes retrieval better, which makes adaptation more likely to surface the right principles, which generates more application data, which improves retrieval further.

5.3 Retroactive Study

A separate process goes back through completed session transcripts and scores principle compliance per episode. For each significant episode (file edit, batch operation, API integration), an LLM judge with access to all principles asks: Was this principle followed? Violated? Could it have helped?

This produces outcome-aware application records even for sessions that didn't explicitly use the apply-learnings workflow. It's the mechanism that populated the bulk of the 1,732 application records — retroactively discovering which principles were implicitly relevant, whether the agent followed them, and what happened.

Honest gap: we haven't measured the compounding over time. The system has data from 217 sessions, but we haven't tracked retrieval quality at different points in that history (e.g., P@1 at session 50, 100, 150, 200). It's possible the feedback loop saturated early — most of the value-boost data might have accumulated in the first 100 sessions. Or it might still be climbing. Either way, a learning curve chart would be the strongest possible evidence for or against the compounding claim. This is a near-term measurement priority.

6. Measuring Quality

Retrieval metrics

learning/search_eval.py contains 84 hand-crafted queries across 4 difficulty categories:

Where retrieval still fails

With P@1 at 0.905, roughly 1 in 10 queries returns the wrong top result. The failures cluster in the negation category (P@1 = 0.769 — nearly 1 in 4). Two illustrative cases:

Category	n	Baseline P@1	Final P@1	Delta
Exact (keyword match)	16	0.938	1.000	+0.062
Conceptual (paraphrases)	45	0.800	0.911	+0.111
Cross-domain	10	0.600	0.900	+0.300
Negation (anti-patterns)	13	0.385	0.769	+0.384
Overall	84	0.738	0.905	+0.167

Failure: "error was logged to a file but nobody reads that file"

Expected: dev/trace-the-chain-to-an-action — "Every signal should lead to an action. A notification that nobody sees, a log that nobody reads, a metric that nobody checks — these are dead ends."

Got instead: observability/error-logging — "All errors must be captured where the system can see them."

Why it fails: The query mentions "logged" and "file" — semantically close to error logging. But the point of the query is that logging isn't enough. The system needs to understand that the query is about the insufficiency of logging, not about logging itself. This requires normative reasoning that embedding similarity can't perform.

Failure: "adding more flags and config options to an already complex function"

Expected: dev/decompose-into-orthogonal-axes — "When a function grows flags, it's usually 2+ independent concerns tangled together. Separate them."

Got instead: Instances about CLI argument parsing and feature flag libraries — the embedding space clusters "flags" and "config options" near literal configuration management, not the architectural anti-pattern of accumulating complexity.

Why it fails: The abstraction gap. The query describes a symptom (adding flags); the principle addresses the root cause (tangled concerns). Bridging symptom to root cause requires interpretation, not similarity.

Both failures share the same pattern: the query describes a situation, and the correct principle operates at a different level of abstraction. This is exactly the gap an LLM re-ranker (see What's Next) would address — it can reason about "this query is about the insufficiency of X" rather than matching on X itself.

Adaptation quality

Retrieval is measurable with standard IR metrics. Adaptation quality is harder — it's a judgment about whether synthesized guidance is specific, correct, and actionable for the task at hand. We measure it indirectly through the feedback loop:

Signal	What it measures	Current state
Application outcomes	Was the adapted principle actually useful?	1,732 tracked applications
Success rate	Fraction of applications rated "success"	Per-principle, feeds into value boost
Retroactive compliance	Were principles followed even when not explicitly surfaced?	Episode-level scoring across past sessions
Prevented errors	Did the adapted guidance catch something before it happened?	Tracked in outcome_notes

The honest gap: we don't yet have a controlled A/B measurement of "agent with adaptation" vs "agent without." The sandbox replay system (described in the companion report) can do this for individual principles, but hasn't been run on the full retrieve-adapt-apply pipeline end-to-end.

Running the eval

7. What's Next

Near-term

Medium-term

8. Glossary

Report generated February 2026. Data from learning/data/learning.db. Eval harness: learning/search_eval.py (84 queries). Full detailed eval report: search_eval_report.html. Companion report: Accumulating Wisdom.

Source: present/search_quality/report.html · static.localhost