Learning from Experience, Applying the Lessons

How an AI coding agent retrieves the right lessons from experience and adapts them to the task at hand

Contents
  1. The Problem: Principles Don't Apply Themselves
  2. System Overview
  3. Retrieval: Finding the Right Principles
  4. Adaptation: From Abstract to Actionable
  5. The Feedback Loop
  6. Measuring Quality
  7. What's Next
  8. Glossary

1. The Problem: Principles Don't Apply Themselves

A companion report describes how we extract principles from an AI coding agent's own work — mining session transcripts for error→fix pairs, abstracting them into general rules, and testing those rules in sandboxed replays. After 217 sessions, the system has distilled 156 principles backed by 1,336 specific instances.

But having principles isn't the same as using them well. Consider:

"Minimize Blast Radius"

This principle says: Don't reach for destructive tools when a scoped or reversible alternative exists.

Excellent advice. But when the agent is about to run a batch delete on 300 stale cache files, what does it actually mean here? It might mean "use trash instead of rm." Or "delete one file first to verify the glob pattern." Or "wrap the operation in a transaction." The principle is the what; the agent still needs to figure out the how for this particular situation.

This is the gap between retrieval (finding the right principles) and adaptation (making them actionable for the current task). Both are hard, for different reasons:

Retrieval problem

1,336 instances outnumber 156 principles 10:1. Searching for "asking permission" returns 10 results about bash permission denied errors instead of the UX principle about when to prompt users before acting.

And a principle like "Importance Ordering & Attention Budget" says "vertical space is precious" — but when you're deciding tab order in a UI, you'd search "tab ordering" and miss it entirely. The principle and the situation are in different meaning-spaces.

Adaptation problem

"Validate inputs" is useless guidance. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful.

Adaptation means bridging the abstraction gap: taking a general rule and generating the specific implication for this task, this codebase, this API. Neither retrieval nor raw principle text can do this alone — it requires interpretation.

This report covers how we solve both: a retrieval layer that finds the right principles despite vocabulary mismatch and instance flooding, and an adaptation layer that transforms those principles into concrete, task-specific guidance.

The full loop: ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ RETRIEVE │────▶│ ADAPT │────▶│ APPLY │ │ find the │ │ make it │ │ agent uses │ │ right ones │ │ concrete │ │ the advice │ └──────┬───────┘ └──────────────┘ └──────┬───────┘ ▲ │ │ ┌──────────────┐ │ └──────────────│ FEEDBACK │◀──────────┘ │ did it help?│ └──────────────┘

2. System Overview

The Knowledge Base

156
Principles
1,336
Instances
1,732
Applications
84
Eval queries

Principles are general rules: "Fail Loud, Never Fake", "No Silent Failures", "Edit, Don't Replace — Content Is Sacred." High-value, distilled from experience, each backed by multiple instances.

Instances are specific observations: "Python '202212' == 202212 is False — no error, no warning, causing 100% silent lookup failure." Useful as evidence, but there are 10× more of them, and many are error patterns (bash errors, import failures, etc.).

Applications record when a principle was used and what happened — the data that closes the feedback loop.

Three Layers

User query or task context │ ▼ ╔═════════════════════════════════╗ ║ 1. RETRIEVAL ║ ║ Hybrid search: FTS + semantic ║ ║ + principle boost + value rank ║ ╚══════════════╤══════════════════╝ │ top-N principles + instances ▼ ╔═════════════════════════════════╗ ║ 2. ADAPTATION ║ ║ LLM synthesizes principles ║ ║ into task-specific guidance ║ ╚══════════════╤══════════════════╝ │ concrete checklist ▼ ╔═════════════════════════════════╗ ║ 3. FEEDBACK ║ ║ Track which principles helped ║ ║ Feed outcomes back into search ║ ╚═════════════════════════════════╝

3. Retrieval: Finding the Right Principles

The first challenge: surface the right principles from a mixed corpus of 156 principles and 1,336 instances. Three search modes, each with different strengths:

Mode How it works Good at Bad at
FTSFull-Text Search — SQLite's FTS5 engine. Finds documents containing the exact words you typed, ranked by BM25. Keyword match, ranked by BM25Best Matching 25 — ranks documents by term frequency vs. corpus rarity. Rare terms score higher. Exact terms like "blast radius" Conceptual queries — "keep code simple" won't find "Less Is More"
SemanticConverts text into a 3,072-dimensional vector (Gemini embedding model) that captures meaning. "keep code simple" and "less is more" end up near each other in vector space. EmbeddingsText represented as a list of numbers (a vector). Similar meaning = similar vectors. 3,072 dimensions, Gemini text-embedding-004. compared by cosine similarityMeasures angle between two vectors. 1.0 = identical meaning, 0.0 = unrelated. Conceptual matches, paraphrases Instances drown out principles
Hybrid FTS + semantic merged (70/30 weights) Best of both More complex

Key improvements over baseline

Six improvements, each measured independently via an 84-query eval harness:

#ImprovementWhat it doesP@1MRR
0BaselineRaw semantic search, no adjustments0.7380.813
1Principle boost+0.06 cosine similarity for principles (compensates for the 10:1 instance ratio)0.8570.908
2Threshold filterDrop results scoring below 0.65
3Metadata filtersScope by project or type
4Hybrid mergeFTS + semantic combined0.8450.907
5Value re-rankingBoost from application_count — battle-tested principles rank higher0.9050.931

The hardest queries improved the most. Cross-domain P@1Precision at 1 — is the top result correct? The most important retrieval metric. went from 0.60 → 0.90 and negation from 0.39 → 0.77. Value re-ranking — which uses application counts from the feedback loop — delivered the single largest improvement.

Retrieval is necessary but not sufficient. Even at P@1 = 0.905, the top result is an abstract principle. Telling the agent "No Silent Failures" when it's about to write a batch job is like telling a junior developer "be careful." The advice is correct but not actionable. The next layer — adaptation — is where abstract becomes concrete.

Example: anti-pattern retrieval

You describe a bad practice — can the system find the principle that forbids it?

Query: "catching exceptions silently and continuing"
$ learn find -s "catching exceptions silently and continuing"

  [0.784] P  development/no-silent-failures
          No Silent Failures — If something fails, make it visible.

  [0.763] P  batch-jobs/errors-are-data-not-exceptions
          Errors Are Data, Not Exceptions — At scale, errors will happen...

  [0.757] P  observability/error-logging
          Error Logging — All errors must be captured where the system can see them.

The query describes an anti-pattern. The system finds principles that forbid it — even though they share no keywords with the query.

4. Adaptation: From Abstract to Actionable

Retrieval gives you a list of principles. Adaptation gives you a checklist you can act on. Three mechanisms operate at different time scales:

4.1 Ambient Injection (always on)

All 156 principles are materializedPre-computed and stored as plain markdown files, regenerated from the database. The database is the source of truth; the markdown files are a readable projection for the agent. as markdown files in ~/.claude/principles/ — 10 category files (dev, testing, batch-jobs, observability, etc.) loaded into every Claude Code session's context. At ~140 characters each, that's about 6K tokens. It fits.

Each principle includes its rationale, anti-pattern, and evidence count:

## No Silent Failures

If something fails, make it visible. Never swallow exceptions,
return empty data on error, or let a process silently produce
no output. The cost of a noisy failure is one interruption.
The cost of a silent failure is hours of debugging the wrong thing.

**Anti-pattern:** Bare except/pass, returning [] on error,
catching and logging but continuing to process corrupted data.

<!-- evidence: 16 instances, 121 applications, as of 2026-02-22 -->

This is the cheapest form of adaptation: the agent has the principles in context and can reason about them naturally. It works for principles the agent encounters frequently. But with 156 principles in the background, any individual one is easy to overlook when it matters most.

4.2 On-Demand Synthesis (the key mechanism)

This is where retrieval becomes adaptation. Before starting a non-trivial task, the agent (or a hook) runs the apply-learnings procedure:

1
Extract search facets. From the task description, identify 2–4 dimensions that might have relevant principles. Building a batch pipeline? Facets: "batch processing", "error handling at scale", "data quality."
2
Parallel semantic search. Run learn find -s "facet" --xml --limit 5 for each facet. The XML output includes principle text, rationale, anti-pattern, and similarity score — structured for LLM consumption.
3
LLM synthesis. The agent reads the combined results and produces a Pre-Flight Checklist — each principle translated into a specific implication for this task.

The synthesis step is the adaptation. Here's what it looks like in practice:

Raw retrieval output
<principle id="dev/fail-loud-never-fake"
           score="0.84">
  Fail Loud, Never Fake — Never
  simplify a problem by substituting
  mock data, placeholder implementations,
  or lossy summaries.
</principle>

<principle id="batch-jobs/validate-before-bulk"
           score="0.81">
  Validate Before Bulk — Before
  processing 300+ items, run a sample
  of 5-10 and verify the output.
</principle>

<principle id="observability/error-logging"
           score="0.79">
  Error Logging — All errors must
  be captured where the system can
  see them.
</principle>

Abstract. Correct but not actionable for the specific task at hand.

After adaptation
## Pre-Flight Checklist

Task: Fetch YouTube transcripts
for 300 company earnings calls

Relevant principles:
- Validate Before Bulk — Fetch 5
  transcripts first. Check: does the
  API return the `transcript` field?
  It's sometimes null for private
  videos or those with auto-captions
  disabled.

- Fail Loud, Never Fake — If a
  transcript fetch returns empty text,
  raise an error. Don't store an empty
  string as if it were a real transcript.

- Error Logging — Log each failed
  video_id to a separate errors file.
  After the batch, report: "247/300
  succeeded, 53 failed (see errors.jsonl)"

Watch out for:
- YouTube API rate limits kick in
  around item 100. Use exponential
  backoff (tenacity, not hand-rolled).
- The `transcript` field is null, not
  absent, when captions are disabled.
  Check `if transcript is None`, not
  `if "transcript" not in response`.

Concrete. Every bullet is specific to this task, this API, this failure mode.

The key rule: be specific, not generic. "Validate inputs" is useless. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful. The specificity comes from combining the abstract principle with the agent's knowledge of the current task and codebase.

The synthesis step is driven by explicit instructions. Here's the core of the procedure:

# From the apply-learnings skill:

1. Extract 2-4 search facets from the task
   (e.g., "batch processing", "error handling at scale", "data quality")

2. Run parallel semantic searches:
   $ learn find -s "facet" --xml --limit 5

3. Synthesize into a Pre-Flight Checklist:

   For each retrieved principle, produce:
   **[Principle Name]** — [1-sentence implication for THIS task]

   RULES:
   - Be specific, not generic. "Validate inputs" is useless.
     "Check that the API returns the transcript field before
     processing — it's sometimes null for private videos" is useful.
   - Only include learnings that actually apply. Score > 0.7 is a
     threshold, but use judgment.
   - Keep it short. 3-7 bullet points total. Checklist, not lecture.
   - Skip if nothing applies. Don't force-fit principles.

The prompt doesn't try to be clever. It tells the LLM to be specific, to skip irrelevant results, and to keep it short. The quality comes from combining good retrieval (the right principles are in the candidate set) with the LLM's ability to reason about what an abstract rule means in a concrete context.

The LLM is the adapter. Retrieval finds the right principles; the LLM bridges the abstraction gap between a general rule and what it means for this task. Neither keyword search nor vector similarity can do this — it requires the kind of interpretation that recognizes "tab ordering" as a specific case of "importance ordering." The adaptation step is where the real value is created.

4.3 Automatic Detection (background, every prompt)

A background hook runs on every non-trivial user prompt. It sends the prompt plus a compact catalog of all 156 principles to a fast LLM (Haiku), which returns 0–3 principle IDs that genuinely apply to this specific task:

# principles_worker.py (spawned as subprocess on each prompt)
SYSTEM_PROMPT = """
You detect which coding principles are relevant to a user's current task.
Given a user prompt and a list of principles, return the IDs of principles
that SHOULD guide the work. Be selective: typically 0-3 per prompt.

RULES:
- Only include principles that genuinely apply to THIS specific task
- Don't include generic principles just because they're always true
- A principle about testing is only relevant if the user is writing/running tests
"""

This serves two purposes: it builds the application-tracking data that feeds back into value-boosted re-ranking (principles detected more often in real work rank higher in future searches), and it provides a signal for when a principle is relevant even though nobody searched for it.

Currently, the detection results are recorded as telemetry — they populate the principle_applications table but aren't surfaced back into the agent's active context. The next step (see What's Next) is real-time injection: surfacing the detected principles directly into the session.

How the three layers interact

LayerWhenAdaptation depthCost
AmbientAlways in contextNone — raw principle text~6K tokens, amortized
On-demand synthesisBefore non-trivial tasksFull — task-specific checklist~2K in + ~500 out tokens, ~$0.01, ~3s latency
Auto-detectionEvery prompt (background)Classification only (relevant/not)~$0.001/prompt (Haiku)

Ambient injection provides a foundation. On-demand synthesis kicks in for important tasks where generic principles need to be translated into specific actions. Auto-detection builds the data that makes both of the other layers better over time.

5. The Feedback Loop

The system gets better by being used. Three mechanisms feed outcomes back into future retrieval and adaptation:

5.1 Application Tracking

After the agent follows a principle, the outcome gets recorded:

$ learn apply dev/fail-loud-never-fake \
    --outcome success \
    --notes "caught the empty transcript issue before storing 300 bad records" \
    --project rivus

Each application record captures: which principle, which session, what outcome (success/partial/failure), what error was prevented, estimated time saved. The principle_applications table now holds 1,732 records across both manual entries and automated detection.

5.2 Value-Boosted Re-Ranking

Application counts feed directly back into search. The value boost formula:

boost = min(0.03, log(1 + application_count) × 0.005
                 + log(1 + instance_count) × 0.003)

A principle like "Verify Action" with 121 applications gets +0.028 cosine score — enough to break ties when two principles have similar semantic scores. A brand-new principle with zero applications gets +0.000. Battle-tested principles rise; untested ones must earn their rank.

This is the single most effective retrieval improvement: value re-ranking pushed overall P@1 from 0.845 to 0.905. The feedback loop makes retrieval better, which makes adaptation more likely to surface the right principles, which generates more application data, which improves retrieval further.

5.3 Retroactive Study

A separate process goes back through completed session transcripts and scores principle compliance per episode. For each significant episode (file edit, batch operation, API integration), an LLM judge with access to all principles asks: Was this principle followed? Violated? Could it have helped?

This produces outcome-aware application records even for sessions that didn't explicitly use the apply-learnings workflow. It's the mechanism that populated the bulk of the 1,732 application records — retroactively discovering which principles were implicitly relevant, whether the agent followed them, and what happened.

The compounding effect: More sessions → more retroactive analysis → better application counts → better value-boosted re-ranking → better retrieval → better adaptation → better outcomes → back to the beginning. Each cycle through the loop makes every subsequent cycle more effective.
Honest gap: we haven't measured the compounding over time. The system has data from 217 sessions, but we haven't tracked retrieval quality at different points in that history (e.g., P@1 at session 50, 100, 150, 200). It's possible the feedback loop saturated early — most of the value-boost data might have accumulated in the first 100 sessions. Or it might still be climbing. Either way, a learning curve chart would be the strongest possible evidence for or against the compounding claim. This is a near-term measurement priority.

6. Measuring Quality

Retrieval metrics

learning/search_eval.py contains 84 hand-crafted queries across 4 difficulty categories:

CategorynBaseline P@1Final P@1Delta
Exact (keyword match)160.9381.000+0.062
Conceptual (paraphrases)450.8000.911+0.111
Cross-domain100.6000.900+0.300
Negation (anti-patterns)130.3850.769+0.384
Overall840.7380.905+0.167

Where retrieval still fails

With P@1 at 0.905, roughly 1 in 10 queries returns the wrong top result. The failures cluster in the negation category (P@1 = 0.769 — nearly 1 in 4). Two illustrative cases:

Failure: "error was logged to a file but nobody reads that file"

Expected: dev/trace-the-chain-to-an-action"Every signal should lead to an action. A notification that nobody sees, a log that nobody reads, a metric that nobody checks — these are dead ends."

Got instead: observability/error-logging"All errors must be captured where the system can see them."

Why it fails: The query mentions "logged" and "file" — semantically close to error logging. But the point of the query is that logging isn't enough. The system needs to understand that the query is about the insufficiency of logging, not about logging itself. This requires normative reasoning that embedding similarity can't perform.

Failure: "adding more flags and config options to an already complex function"

Expected: dev/decompose-into-orthogonal-axes"When a function grows flags, it's usually 2+ independent concerns tangled together. Separate them."

Got instead: Instances about CLI argument parsing and feature flag libraries — the embedding space clusters "flags" and "config options" near literal configuration management, not the architectural anti-pattern of accumulating complexity.

Why it fails: The abstraction gap. The query describes a symptom (adding flags); the principle addresses the root cause (tangled concerns). Bridging symptom to root cause requires interpretation, not similarity.

Both failures share the same pattern: the query describes a situation, and the correct principle operates at a different level of abstraction. This is exactly the gap an LLM re-ranker (see What's Next) would address — it can reason about "this query is about the insufficiency of X" rather than matching on X itself.

Adaptation quality

Retrieval is measurable with standard IR metrics. Adaptation quality is harder — it's a judgment about whether synthesized guidance is specific, correct, and actionable for the task at hand. We measure it indirectly through the feedback loop:

SignalWhat it measuresCurrent state
Application outcomesWas the adapted principle actually useful?1,732 tracked applications
Success rateFraction of applications rated "success"Per-principle, feeds into value boost
Retroactive complianceWere principles followed even when not explicitly surfaced?Episode-level scoring across past sessions
Prevented errorsDid the adapted guidance catch something before it happened?Tracked in outcome_notes

The honest gap: we don't yet have a controlled A/B measurement of "agent with adaptation" vs "agent without." The sandbox replay system (described in the companion report) can do this for individual principles, but hasn't been run on the full retrieve-adapt-apply pipeline end-to-end.

Running the eval

$ python -m learning.search_eval                  # run all modes, print table
$ python -m learning.search_eval --mode fts       # FTS only
$ python -m learning.search_eval --verbose        # per-query details
$ python -m learning.search_eval --report         # HTML report to static

7. What's Next

Near-term

Medium-term

8. Glossary

FTS (Full-Text Search)
SQLite's FTS5 engine — indexes text and finds documents containing specific words. Fast (1ms) but literal: it matches keywords, not meaning. Uses BM25 ranking.
BM25
Best Matching 25 — a ranking formula used in keyword search. Gives higher scores to documents where your search terms appear frequently but are rare across the whole corpus.
Semantic Search
Search by meaning rather than keywords. Text is converted to numerical vectors (embeddings) and compared by direction (cosine similarity). "keep code simple" finds "less is more" even though they share no words.
Embedding
A representation of text as a list of numbers (a vector). Texts with similar meaning have similar vectors. Our embeddings are 3,072-dimensional, generated by Google's Gemini text-embedding-004 model.
Cosine Similarity
Measures the angle between two vectors. 1.0 = identical direction (same meaning), 0.0 = perpendicular (unrelated). Used to rank semantic search results.
P@1 (Precision at 1)
Is the very first search result relevant? 1.0 means the top result is always correct. The most important retrieval metric.
R@5 (Recall at 5)
Of all the results that should have been found, what fraction appeared in the top 5?
MRR (Mean Reciprocal Rank)
Averages 1/(rank of first relevant result) across all queries. MRR = 1.0 means the right answer is always #1.
Materialize
Pre-compute and store in a different format. Here: regenerate markdown files from the database so the agent can read plain text. The database remains the source of truth.

Report generated February 2026. Data from learning/data/learning.db. Eval harness: learning/search_eval.py (84 queries). Full detailed eval report: search_eval_report.html. Companion report: Accumulating Wisdom.

Source: present/search_quality/report.html · static.localhost