How an AI coding agent retrieves the right lessons from experience and adapts them to the task at hand
A companion report describes how we extract principles from an AI coding agent's own work — mining session transcripts for error→fix pairs, abstracting them into general rules, and testing those rules in sandboxed replays. After 217 sessions, the system has distilled 156 principles backed by 1,336 specific instances.
But having principles isn't the same as using them well. Consider:
This principle says: Don't reach for destructive tools when a scoped or reversible alternative exists.
Excellent advice. But when the agent is about to run a batch delete on 300 stale cache files, what does it actually mean here? It might mean "use trash instead of rm." Or "delete one file first to verify the glob pattern." Or "wrap the operation in a transaction." The principle is the what; the agent still needs to figure out the how for this particular situation.
This is the gap between retrieval (finding the right principles) and adaptation (making them actionable for the current task). Both are hard, for different reasons:
1,336 instances outnumber 156 principles 10:1. Searching for "asking permission" returns 10 results about bash permission denied errors instead of the UX principle about when to prompt users before acting.
And a principle like "Importance Ordering & Attention Budget" says "vertical space is precious" — but when you're deciding tab order in a UI, you'd search "tab ordering" and miss it entirely. The principle and the situation are in different meaning-spaces.
"Validate inputs" is useless guidance. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful.
Adaptation means bridging the abstraction gap: taking a general rule and generating the specific implication for this task, this codebase, this API. Neither retrieval nor raw principle text can do this alone — it requires interpretation.
This report covers how we solve both: a retrieval layer that finds the right principles despite vocabulary mismatch and instance flooding, and an adaptation layer that transforms those principles into concrete, task-specific guidance.
Principles are general rules: "Fail Loud, Never Fake", "No Silent Failures", "Edit, Don't Replace — Content Is Sacred." High-value, distilled from experience, each backed by multiple instances.
Instances are specific observations: "Python '202212' == 202212 is False — no error, no warning, causing 100% silent lookup failure." Useful as evidence, but there are 10× more of them, and many are error patterns (bash errors, import failures, etc.).
Applications record when a principle was used and what happened — the data that closes the feedback loop.
The first challenge: surface the right principles from a mixed corpus of 156 principles and 1,336 instances. Three search modes, each with different strengths:
| Mode | How it works | Good at | Bad at |
|---|---|---|---|
| FTSFull-Text Search — SQLite's FTS5 engine. Finds documents containing the exact words you typed, ranked by BM25. | Keyword match, ranked by BM25Best Matching 25 — ranks documents by term frequency vs. corpus rarity. Rare terms score higher. | Exact terms like "blast radius" |
Conceptual queries — "keep code simple" won't find "Less Is More" |
| SemanticConverts text into a 3,072-dimensional vector (Gemini embedding model) that captures meaning. "keep code simple" and "less is more" end up near each other in vector space. | EmbeddingsText represented as a list of numbers (a vector). Similar meaning = similar vectors. 3,072 dimensions, Gemini text-embedding-004. compared by cosine similarityMeasures angle between two vectors. 1.0 = identical meaning, 0.0 = unrelated. | Conceptual matches, paraphrases | Instances drown out principles |
| Hybrid | FTS + semantic merged (70/30 weights) | Best of both | More complex |
Six improvements, each measured independently via an 84-query eval harness:
| # | Improvement | What it does | P@1 | MRR |
|---|---|---|---|---|
| 0 | Baseline | Raw semantic search, no adjustments | 0.738 | 0.813 |
| 1 | Principle boost | +0.06 cosine similarity for principles (compensates for the 10:1 instance ratio) | 0.857 | 0.908 |
| 2 | Threshold filter | Drop results scoring below 0.65 | — | — |
| 3 | Metadata filters | Scope by project or type | — | — |
| 4 | Hybrid merge | FTS + semantic combined | 0.845 | 0.907 |
| 5 | Value re-ranking | Boost from application_count — battle-tested principles rank higher | 0.905 | 0.931 |
The hardest queries improved the most. Cross-domain P@1Precision at 1 — is the top result correct? The most important retrieval metric. went from 0.60 → 0.90 and negation from 0.39 → 0.77. Value re-ranking — which uses application counts from the feedback loop — delivered the single largest improvement.
You describe a bad practice — can the system find the principle that forbids it?
$ learn find -s "catching exceptions silently and continuing"
[0.784] P development/no-silent-failures
No Silent Failures — If something fails, make it visible.
[0.763] P batch-jobs/errors-are-data-not-exceptions
Errors Are Data, Not Exceptions — At scale, errors will happen...
[0.757] P observability/error-logging
Error Logging — All errors must be captured where the system can see them.
The query describes an anti-pattern. The system finds principles that forbid it — even though they share no keywords with the query.
Retrieval gives you a list of principles. Adaptation gives you a checklist you can act on. Three mechanisms operate at different time scales:
All 156 principles are materializedPre-computed and stored as plain markdown files, regenerated from the database. The database is the source of truth; the markdown files are a readable projection for the agent. as markdown files in ~/.claude/principles/ — 10 category files (dev, testing, batch-jobs, observability, etc.) loaded into every Claude Code session's context. At ~140 characters each, that's about 6K tokens. It fits.
Each principle includes its rationale, anti-pattern, and evidence count:
## No Silent Failures
If something fails, make it visible. Never swallow exceptions,
return empty data on error, or let a process silently produce
no output. The cost of a noisy failure is one interruption.
The cost of a silent failure is hours of debugging the wrong thing.
**Anti-pattern:** Bare except/pass, returning [] on error,
catching and logging but continuing to process corrupted data.
<!-- evidence: 16 instances, 121 applications, as of 2026-02-22 -->
This is the cheapest form of adaptation: the agent has the principles in context and can reason about them naturally. It works for principles the agent encounters frequently. But with 156 principles in the background, any individual one is easy to overlook when it matters most.
This is where retrieval becomes adaptation. Before starting a non-trivial task, the agent (or a hook) runs the apply-learnings procedure:
learn find -s "facet" --xml --limit 5 for each facet. The XML output includes principle text, rationale, anti-pattern, and similarity score — structured for LLM consumption.
The synthesis step is the adaptation. Here's what it looks like in practice:
<principle id="dev/fail-loud-never-fake"
score="0.84">
Fail Loud, Never Fake — Never
simplify a problem by substituting
mock data, placeholder implementations,
or lossy summaries.
</principle>
<principle id="batch-jobs/validate-before-bulk"
score="0.81">
Validate Before Bulk — Before
processing 300+ items, run a sample
of 5-10 and verify the output.
</principle>
<principle id="observability/error-logging"
score="0.79">
Error Logging — All errors must
be captured where the system can
see them.
</principle>
Abstract. Correct but not actionable for the specific task at hand.
## Pre-Flight Checklist
Task: Fetch YouTube transcripts
for 300 company earnings calls
Relevant principles:
- Validate Before Bulk — Fetch 5
transcripts first. Check: does the
API return the `transcript` field?
It's sometimes null for private
videos or those with auto-captions
disabled.
- Fail Loud, Never Fake — If a
transcript fetch returns empty text,
raise an error. Don't store an empty
string as if it were a real transcript.
- Error Logging — Log each failed
video_id to a separate errors file.
After the batch, report: "247/300
succeeded, 53 failed (see errors.jsonl)"
Watch out for:
- YouTube API rate limits kick in
around item 100. Use exponential
backoff (tenacity, not hand-rolled).
- The `transcript` field is null, not
absent, when captions are disabled.
Check `if transcript is None`, not
`if "transcript" not in response`.
Concrete. Every bullet is specific to this task, this API, this failure mode.
The key rule: be specific, not generic. "Validate inputs" is useless. "Check that the API returns the transcript field before processing — it's sometimes null for private videos" is useful. The specificity comes from combining the abstract principle with the agent's knowledge of the current task and codebase.
The synthesis step is driven by explicit instructions. Here's the core of the procedure:
# From the apply-learnings skill:
1. Extract 2-4 search facets from the task
(e.g., "batch processing", "error handling at scale", "data quality")
2. Run parallel semantic searches:
$ learn find -s "facet" --xml --limit 5
3. Synthesize into a Pre-Flight Checklist:
For each retrieved principle, produce:
**[Principle Name]** — [1-sentence implication for THIS task]
RULES:
- Be specific, not generic. "Validate inputs" is useless.
"Check that the API returns the transcript field before
processing — it's sometimes null for private videos" is useful.
- Only include learnings that actually apply. Score > 0.7 is a
threshold, but use judgment.
- Keep it short. 3-7 bullet points total. Checklist, not lecture.
- Skip if nothing applies. Don't force-fit principles.
The prompt doesn't try to be clever. It tells the LLM to be specific, to skip irrelevant results, and to keep it short. The quality comes from combining good retrieval (the right principles are in the candidate set) with the LLM's ability to reason about what an abstract rule means in a concrete context.
A background hook runs on every non-trivial user prompt. It sends the prompt plus a compact catalog of all 156 principles to a fast LLM (Haiku), which returns 0–3 principle IDs that genuinely apply to this specific task:
# principles_worker.py (spawned as subprocess on each prompt)
SYSTEM_PROMPT = """
You detect which coding principles are relevant to a user's current task.
Given a user prompt and a list of principles, return the IDs of principles
that SHOULD guide the work. Be selective: typically 0-3 per prompt.
RULES:
- Only include principles that genuinely apply to THIS specific task
- Don't include generic principles just because they're always true
- A principle about testing is only relevant if the user is writing/running tests
"""
This serves two purposes: it builds the application-tracking data that feeds back into value-boosted re-ranking (principles detected more often in real work rank higher in future searches), and it provides a signal for when a principle is relevant even though nobody searched for it.
Currently, the detection results are recorded as telemetry — they populate the principle_applications table but aren't surfaced back into the agent's active context. The next step (see What's Next) is real-time injection: surfacing the detected principles directly into the session.
| Layer | When | Adaptation depth | Cost |
|---|---|---|---|
| Ambient | Always in context | None — raw principle text | ~6K tokens, amortized |
| On-demand synthesis | Before non-trivial tasks | Full — task-specific checklist | ~2K in + ~500 out tokens, ~$0.01, ~3s latency |
| Auto-detection | Every prompt (background) | Classification only (relevant/not) | ~$0.001/prompt (Haiku) |
Ambient injection provides a foundation. On-demand synthesis kicks in for important tasks where generic principles need to be translated into specific actions. Auto-detection builds the data that makes both of the other layers better over time.
The system gets better by being used. Three mechanisms feed outcomes back into future retrieval and adaptation:
After the agent follows a principle, the outcome gets recorded:
$ learn apply dev/fail-loud-never-fake \
--outcome success \
--notes "caught the empty transcript issue before storing 300 bad records" \
--project rivus
Each application record captures: which principle, which session, what outcome (success/partial/failure), what error was prevented, estimated time saved. The principle_applications table now holds 1,732 records across both manual entries and automated detection.
Application counts feed directly back into search. The value boost formula:
boost = min(0.03, log(1 + application_count) × 0.005
+ log(1 + instance_count) × 0.003)
A principle like "Verify Action" with 121 applications gets +0.028 cosine score — enough to break ties when two principles have similar semantic scores. A brand-new principle with zero applications gets +0.000. Battle-tested principles rise; untested ones must earn their rank.
This is the single most effective retrieval improvement: value re-ranking pushed overall P@1 from 0.845 to 0.905. The feedback loop makes retrieval better, which makes adaptation more likely to surface the right principles, which generates more application data, which improves retrieval further.
A separate process goes back through completed session transcripts and scores principle compliance per episode. For each significant episode (file edit, batch operation, API integration), an LLM judge with access to all principles asks: Was this principle followed? Violated? Could it have helped?
This produces outcome-aware application records even for sessions that didn't explicitly use the apply-learnings workflow. It's the mechanism that populated the bulk of the 1,732 application records — retroactively discovering which principles were implicitly relevant, whether the agent followed them, and what happened.
learning/search_eval.py contains 84 hand-crafted queries across 4 difficulty categories:
| Category | n | Baseline P@1 | Final P@1 | Delta |
|---|---|---|---|---|
| Exact (keyword match) | 16 | 0.938 | 1.000 | +0.062 |
| Conceptual (paraphrases) | 45 | 0.800 | 0.911 | +0.111 |
| Cross-domain | 10 | 0.600 | 0.900 | +0.300 |
| Negation (anti-patterns) | 13 | 0.385 | 0.769 | +0.384 |
| Overall | 84 | 0.738 | 0.905 | +0.167 |
With P@1 at 0.905, roughly 1 in 10 queries returns the wrong top result. The failures cluster in the negation category (P@1 = 0.769 — nearly 1 in 4). Two illustrative cases:
Expected: dev/trace-the-chain-to-an-action — "Every signal should lead to an action. A notification that nobody sees, a log that nobody reads, a metric that nobody checks — these are dead ends."
Got instead: observability/error-logging — "All errors must be captured where the system can see them."
Why it fails: The query mentions "logged" and "file" — semantically close to error logging. But the point of the query is that logging isn't enough. The system needs to understand that the query is about the insufficiency of logging, not about logging itself. This requires normative reasoning that embedding similarity can't perform.
Expected: dev/decompose-into-orthogonal-axes — "When a function grows flags, it's usually 2+ independent concerns tangled together. Separate them."
Got instead: Instances about CLI argument parsing and feature flag libraries — the embedding space clusters "flags" and "config options" near literal configuration management, not the architectural anti-pattern of accumulating complexity.
Why it fails: The abstraction gap. The query describes a symptom (adding flags); the principle addresses the root cause (tangled concerns). Bridging symptom to root cause requires interpretation, not similarity.
Both failures share the same pattern: the query describes a situation, and the correct principle operates at a different level of abstraction. This is exactly the gap an LLM re-ranker (see What's Next) would address — it can reason about "this query is about the insufficiency of X" rather than matching on X itself.
Retrieval is measurable with standard IR metrics. Adaptation quality is harder — it's a judgment about whether synthesized guidance is specific, correct, and actionable for the task at hand. We measure it indirectly through the feedback loop:
| Signal | What it measures | Current state |
|---|---|---|
| Application outcomes | Was the adapted principle actually useful? | 1,732 tracked applications |
| Success rate | Fraction of applications rated "success" | Per-principle, feeds into value boost |
| Retroactive compliance | Were principles followed even when not explicitly surfaced? | Episode-level scoring across past sessions |
| Prevented errors | Did the adapted guidance catch something before it happened? | Tracked in outcome_notes |
The honest gap: we don't yet have a controlled A/B measurement of "agent with adaptation" vs "agent without." The sandbox replay system (described in the companion report) can do this for individual principles, but hasn't been run on the full retrieve-adapt-apply pipeline end-to-end.
$ python -m learning.search_eval # run all modes, print table
$ python -m learning.search_eval --mode fts # FTS only
$ python -m learning.search_eval --verbose # per-query details
$ python -m learning.search_eval --report # HTML report to static
Report generated February 2026. Data from learning/data/learning.db. Eval harness: learning/search_eval.py (84 queries). Full detailed eval report: search_eval_report.html. Companion report: Accumulating Wisdom.
Source: present/search_quality/report.html · static.localhost