Learning System

The system literally learns from its own mistakes. Session review → principle extraction → better future sessions.

The Loop

1. Work — Claude Code sessions produce transcripts (JSONL files with every tool call, error, and correction).

2. Review — Multi-model judges (Gemini, Grok, Claude) analyze transcripts. Which tool call fixed which error? What patterns recur?

3. Extract — Error→repair pairs become principles. "When X happens, do Y." Scored by majority vote across judges.

4. Materialize — Principles written to ~/.claude/principles/*.md. Every future session inherits them automatically.

5. Measure — Sandbox eval: replay sessions in Docker, measure if principles actually improve outcomes.

Detailed learning loop diagram coming soon

In the meantime: kb.localhost/learning ↗

Scale

664+ sessions reviewed and analyzed.

25K+ instances linked to principles, auto-classified by LLM.

Vector embeddings for semantic retrieval — find relevant principles for any new situation.

Gyms

Badge gym — test prompt variants by replaying real sessions, score quality, pick best.

Fetchability gym — probe URLs with different methods in parallel, build site × method matrix.

Principles flow back into sessions via CLAUDE.md and the principles directory — closing the loop.

Learning from the Delta

The most direct learning signal comes from doing more work on the same problem. Run a quick session, then invest 3× the effort on the same question. Compare the results. The delta — what the deeper pass found that the quick pass missed — reveals systematic blind spots.

This creates a natural teaching loop: the expensive pass produces a target that the cheap pass should have reached. Over time, we can even use this for distillation — teaching a fast model (Haiku) to approximate what a slow model (Opus) would have caught, so everyday sessions get closer to deep-effort quality without the cost.

Four Layers of Accumulated Knowledge

Every LLM session produces valuable artifacts — decisions, discoveries, patterns, code. Most systems throw them away. We persist them in four storage layers, each with different retrieval characteristics:

Code — Direct file edits, refactors, new modules. The most concrete layer — it changes what runs. But the least searchable for intent.

Semantic net — Vector embeddings of session content, recaps, topics. Enables "find the session where we figured out X" via meaning, not keywords.

Learnings & principles — Structured observations with provenance: which session, what error, what fix. Auto-deduped, auto-materialized to files every future session inherits.

Skills — Codified workflows and checklists. The highest-leverage layer — a single skill encodes a process that would otherwise be re-discovered every time.

The system's value comes from all four layers working together: code handles what, semantic net handles when/where, learnings handle why/how, and skills handle the playbook.

What Makes This Unusual

Most AI systems forget between sessions. This one remembers. Not just conversation history — it extracts why something worked or failed, encodes the pattern, and applies it automatically next time. The improvement is cumulative and measurable.