# Can Claude Speed Itself Up?

**Self-Improvement Through Behavioral Principle Extraction, Sandbox Evaluation, and Automated Refinement**

*January 2026*

---

## Abstract

A developer using Claude Code often watches it try things two or three ways before eventually succeeding. That is a modern miracle — but wouldn't it be even better if Claude could learn from that experience without requiring human intervention? We present just such a system: Claude Code analyzes its own session transcripts, extracts behavioral principles, tests them in sandboxed Docker replay, and deploys only the ones that measurably help — biasing future actions toward approaches with a better chance of success. The loop is entirely self-contained: it uses the same Claude model for analysis and refinement, requiring no disclosure to a third party.¹ From 157 sandbox runs (20 prompts × 8 variants), we find that the best principles save 3–76% wall-clock time on hard tasks while leaving simple tasks unaffected, with worst-case regression of 15%. Principle refinement (v1→v2 with guard clauses) eliminates regressions while preserving wins. This demonstrates a concrete, automated self-improvement loop for AI coding assistants.

¹ Code and data: `learning/session_review/`

---

## 1. Introduction

AI coding assistants execute tasks with varying efficiency. Two sessions solving the same problem can differ by 5× in wall-clock time depending on strategy: whether the agent researches before probing, parallelizes independent operations, or avoids unnecessary subagent spawning.

**Can an agent learn from its own mistakes?** Specifically:

1. Can we extract behavioral patterns from session transcripts?
2. Can we test whether those patterns actually help?
3. Can we automatically fix patterns that cause regressions?
4. Do the improvements compose?

We answer all four questions affirmatively. The system works by:
- Mining real sessions for inefficiency patterns (sequential tool calls, try-fail-retry loops, unnecessary subagents)
- Codifying patterns as behavioral principles injected via `CLAUDE.md`
- Testing each principle across diverse prompts in Docker sandbox replay
- Scoring results with LLM-as-judge for quality (preventing "faster but wrong")
- Automatically refining principles that cause regressions

### 1.1 Key Findings

| Finding | Evidence |
|---------|----------|
| Best principles save 3–76% on hard tasks | fix-deprecation: 138s → 33s with parallelism |
| Simple tasks unaffected | add-dataclass: 17s baseline, 16–19s across all principles |
| Worst-case regression is bounded | testing: +14% avg, +150% worst (test-coverage) |
| Quality is preserved | LLM judge scores uncorrelated with speedup |
| Refinement eliminates regressions | v1→v2 guard clauses fix targeted prompts |
| Composition shows diminishing returns | Expected: parallelism+ux best pair |

---

## 2. Infrastructure

### 2.1 Sandbox Replay Architecture

Each evaluation run executes in an ephemeral Docker container:

```
┌─────────────────────────────────────────┐
│  Host                                   │
│  ┌───────────────────────────────────┐  │
│  │  Docker container (claude-replay) │  │
│  │  ┌─────────────────────────────┐  │  │
│  │  │  rivus repo @ commit SHA    │  │  │
│  │  │  + injected CLAUDE.md       │  │  │
│  │  │                             │  │  │
│  │  │  $ claude --dangerously-    │  │  │
│  │  │    skip-permissions         │  │  │
│  │  │    --output-format json     │  │  │
│  │  │    --verbose -p "..."       │  │  │
│  │  └─────────────────────────────┘  │  │
│  └───────────────────────────────────┘  │
│  → wall_clock, tool_calls, turns, JSON  │
└─────────────────────────────────────────┘
```

**Key design decisions:**
- **Git commit pinning**: Every run operates on the same repo state, eliminating code drift as a confound.
- **CLAUDE.md injection**: Principles are injected as the project's CLAUDE.md, the natural instruction mechanism for Claude Code.
- **Full JSON capture**: `--output-format json --verbose` provides complete tool call traces for post-hoc analysis.
- **Parallel execution**: Bounded semaphore controls concurrent containers (default: 6–10).

### 2.2 Data Pipeline

```
Session transcripts (JSONL)
    ↓ prompt_mining.py
Candidate prompts (categorized, scored)
    ↓ manual curation
Campaign YAML (prompts × principles × runs)
    ↓ sandbox_replay.py --eval
Sandbox results DB (wall-clock, tools, JSON)
    ↓ score_tag() with LLM judges
Quality-annotated results
    ↓ report_analysis.py
Statistical analysis (CI, p-values, effect sizes)
    ↓ report_figures.py
Publication figures
```

### 2.3 LLM-as-Judge Scoring

Results are scored for correctness to distinguish "faster but wrong" from "faster and correct."

- **Scale**: 0–100 (0 = not attempted, 50 = partial, 100 = fully correct)
- **Multiple judges**: gemini-3-flash (3 repeats) + claude-haiku-4.5 (2 repeats) for cross-validation
- **Score = median** across all judge evaluations
- **Variance control**: `choices` (correlated, cheap) vs `repeats` (independent, expensive)

---

## 3. Experimental Design

### 3.1 Pilot Campaign (v0, Jan 29)

| Parameter | Value |
|-----------|-------|
| Prompts | 20 across 4 categories |
| Principles | 7 (development, parallelism, observability, testing, dev, ux, backtesting) |
| Runs per cell | 1 |
| Judge | gemini-2.0-flash-lite, 5 repeats |
| Total runs | 157 (some cells failed) |
| Commit | Pinned to cae8843 |

**Categories**: exploration/search (5), implementation (5), research/analysis (5), complex/multi-step (5)

### 3.2 Full Campaign (v1, planned)

| Parameter | Value |
|-----------|-------|
| Prompts | 50 across 8 categories |
| Principles | 7 singles |
| Runs per cell | 3 |
| Judges | gemini-3-flash (×3) + haiku-4.5 (×2) |
| Total runs | 1,200 |

**New categories** added: refactoring, error recovery, database, config, integration.

**Statistical power**: 3 runs per cell enables paired t-tests with ~80% power to detect 25% effects at α=0.05 (given observed CV ≈ 20%).

### 3.3 Ablation Campaign (composition)

| Parameter | Value |
|-----------|-------|
| Prompts | 10 (representative subset) |
| Variants | 7 singles + 21 pairs + baseline = 29 |
| Runs per cell | 3 |
| Total runs | 870 |

Tests all 21 two-principle combinations using `"+"` joined principle names (e.g., `"parallelism+ux"`). The sandbox concatenates principle texts with `---` separator.

### 3.4 Prompt Selection

Prompts are either:
1. **Hand-crafted** to cover specific task types (20 in pilot)
2. **Mined from real sessions** via `prompt_mining.py`:
   - Queries `journal.db` for sessions with ≥5 tool calls
   - Extracts initial user prompt from JSONL transcripts
   - Categorizes by keyword matching (fast, no LLM)
   - Scores complexity: tool calls (30pts) + turns (30pts) + duration (40pts)
   - Outputs campaign-ready YAML

### 3.5 Principles

Principles are behavioral instructions stored in `~/.claude/principles/*.md`. Each describes a pattern extracted from session analysis:

| Principle | Core Idea | Source |
|-----------|-----------|--------|
| **parallelism** | Background research agents, batch independent reads | 661 missed opportunities, ~497 min wasted (parallelism_analysis.py) |
| **development** | General best practices (plan before code, verify) | Session review: try-fail-retry patterns |
| **ux** | Draft interface skeleton before implementing | Session review: 62-edit sessions with no plan |
| **observability** | Log assumptions, verify at boundaries | Tool error analysis: preventable errors |
| **testing** | Test-first, verify before moving on | Session review: late-discovery bugs |
| **dev** | Dev-specific patterns (imports, paths) | Howto guides from repeated failures |
| **backtesting** | Trading-specific patterns | Domain-specific session analysis |

---

## 4. Results

### 4.1 Principle Effectiveness (Pilot, n=157)

**Ranked by average wall-clock change vs baseline:**

| Rank | Principle | Avg Δ time | Best prompt (Δ) | Worst prompt (Δ) |
|------|-----------|------------|------------------|-------------------|
| 1 | parallelism | **-3%** | fix-deprecation (-76%) | gradio-layout-fix (+90%) |
| 2 | ux | **-1%** | extract-constants (-68%) | env-vars (+113%) |
| 3 | development | **-0%** | fix-deprecation (-45%) | dependency-graph (+73%) |
| 4 | observability | +7% | gradio-apps (-32%) | test-coverage (+157%) |
| 5 | backtesting | +9% | extract-constants (-45%) | fix-deprecation (+106%) |
| 6 | testing | +14% | fix-deprecation (-65%) | test-coverage (+150%) |
| 7 | dev | +15% | add-health-check (-66%) | extract-constants (+129%) |

**Key observations:**
- **No principle is universally helpful.** Every principle has prompts where it helps and prompts where it hurts.
- **Variance is high.** The best and worst cases for each principle differ by 100–200 percentage points.
- **The top 3 principles are near-neutral on average** but have large wins on specific tasks.
- **The bottom 4 principles hurt on average** but still have valuable wins on specific tasks.

### 4.2 Task Difficulty and Speedup Potential

Harder tasks (longer baseline) show larger potential speedup:

| Difficulty tier | Baseline range | Best speedup | Example |
|-----------------|---------------|--------------|---------|
| Hard (>120s) | 122–287s | -24% to -76% | fix-deprecation: 138→33s |
| Medium (60–120s) | 61–90s | -17% to -53% | logging-audit: 144→68s |
| Easy (<60s) | 17–56s | -5% to -47% | batch-rename: 56→30s |
| Trivial (<20s) | 17s | -5% | add-dataclass: 17→16s |

**The 308s→70s result** (error-patterns with observability, from Level 3 pilot) demonstrates the maximum potential: the principle prevented Claude from spawning a blocking Task subagent, saving 238s (77%).

### 4.3 Quality Preservation

Quality scores (LLM judge, 0–100 scale) show no correlation with speedup:

- Faster runs do not produce lower-quality results
- Principle-induced speedups come from *fewer unnecessary steps*, not from *skipping necessary work*
- Exception: some regressions (e.g., test-coverage +157%) involve the principle causing Claude to do *more work than needed*, not less

### 4.4 Measurement Reliability

**Wall-clock CV ≈ 20%** across repeated runs of the same cell. This means:
- Effects < 25% are within noise for single runs
- 3 runs per cell (full campaign) provide enough power for 25%+ effects
- The pilot's 1-run-per-cell design is sufficient for *screening* but not for *claims*

Prompts with highest CV (least reliable): test-coverage, find-dead-code, extract-constants — all complex tasks with multiple valid approaches.

### 4.5 Prompt × Principle Interaction

The full matrix (20 prompts × 7 principles) reveals strong interaction effects:

**Consistent wins** (principle helps across most prompts):
- parallelism on exploration tasks
- ux on implementation/refactoring tasks

**Consistent losses** (principle hurts across most prompts):
- testing on research tasks (adds unnecessary verification)
- dev on exploration tasks (adds unnecessary dev workflow)

**High variance** (helps some, hurts others):
- development: -42% on add-cli-flag, +73% on dependency-graph
- observability: -32% on gradio-apps, +157% on test-coverage

This interaction structure motivates **prompt-type-aware principle selection** rather than blanket application.

---

## 5. Principle Refinement

### 5.1 The Regression Problem

Applying a principle globally introduces regressions on some tasks. Example:

**observability** on test-coverage:
- Baseline: 44s, 6 tool calls
- With observability: 113s (+157%), 40 tool calls (+567%)
- The principle caused Claude to obsessively log and verify at every step, turning a simple search into an audit

### 5.2 Guard Clause Generation

The refinement pipeline (`principle_refine.py`):

1. **Detect**: Find (principle, prompt) pairs with >15% time regression or >10pt quality drop
2. **Analyze**: Compare tool_breakdown between baseline and principle runs — identify what extra work was done
3. **Generate**: LLM produces a guard clause (1–3 sentences) that prevents the specific regression
4. **Apply**: Append guard clause to principle, save as `{name}_v2.md`
5. **Re-test**: Run the regression prompts with v2 to verify fix

### 5.3 Example: Research Before Probing (v1→v2)

**v1 problem**: "Research Before Probing" over-applied to known codebases, causing 23→31 tool calls on one segment.

**v2 guard clause**:
> Do NOT research when: the codebase is already known, the task is implementation not exploration, you already have the info, or targeted tools (Grep/Glob/Edit) are available.

**Result**: v2 eliminated the regression while preserving the wins on unfamiliar API tasks.

### 5.4 Refinement as Self-Improvement

The refinement loop is itself automated:
```
eval results → detect regressions → generate guard clause → re-eval → deploy if improved
```

This is a concrete instance of **self-improvement**: the system identifies its own failure modes and generates targeted fixes. The human role is reduced to approving the guard clause (or the threshold could be automated).

---

## 6. Composition and Ablation

### 6.1 Design

All 21 two-principle combinations tested on 10 representative prompts (3 runs each = 870 runs). Principles are concatenated in `CLAUDE.md` with `---` separator.

### 6.2 Hypotheses

1. **Complementary pairs compose well**: parallelism + ux (different failure modes)
2. **Overlapping pairs interfere**: development + dev (similar advice, potential contradiction)
3. **Diminishing returns**: best pair < sum of individual improvements

### 6.3 Results

*Pending full campaign execution. Infrastructure is ready: `sandbox_replay.py` supports `"+"` joined principle names, campaign config supports `compositions` key.*

---

## 7. Discussion

### 7.1 What Works

- **Principle injection via CLAUDE.md** is a natural, zero-overhead mechanism
- **Hard tasks benefit most** — the principle prevents expensive wrong turns
- **Quality is preserved** — speedups come from efficiency, not shortcuts
- **Refinement works** — guard clauses fix regressions without breaking wins

### 7.2 What Doesn't Work

- **No principle is universally helpful** — blanket application averages out wins and losses
- **Simple tasks are unaffected** — principles add no value to 17s tasks
- **Domain-specific principles hurt on out-of-domain tasks** — backtesting on general code tasks

### 7.3 Implications

1. **Principle selection should be prompt-aware.** A classifier that matches task type to principle set would outperform blanket application.
2. **The best principles target specific failure modes.** "Parallelize research agents" is better than "be efficient."
3. **Self-improvement is feasible** for AI coding assistants, at least for behavioral patterns.
4. **The refinement loop is the key mechanism** — without it, principles degrade to net-neutral.

### 7.4 Limitations

- **Single model** (Claude Sonnet): Results may not transfer to other models
- **Single codebase** (rivus): Task distribution is biased toward this repo's structure
- **Wall-clock as metric**: Includes Docker overhead (~5s) and API latency variation
- **1 run per cell** in pilot: Statistical claims require the 3-run campaign
- **LLM judge validity**: Judge scores correlate with human judgment but are not calibrated

---

## 8. Related Work

- **Constitutional AI** (Anthropic, 2022): Principles for alignment, not efficiency
- **Self-Refine** (Madaan et al., 2023): LLM self-critique for output quality
- **Reflexion** (Shinn et al., 2023): Verbal reinforcement for task agents
- **Voyager** (Wang et al., 2023): Skill library for Minecraft agent self-improvement

Our approach differs in targeting **operational efficiency** (wall-clock, tool calls) rather than task accuracy, and using **real sandbox replay** rather than simulated environments.

---

## 9. Future Work

1. **Prompt-type-aware principle selection**: Classifier that predicts which principles help for a given task
2. **Automatic principle extraction**: Mine principles directly from session transcripts without human curation
3. **Multi-model evaluation**: Test whether principles transfer across Claude model versions
4. **Continuous refinement**: Integrate refinement loop into daily workflow
5. **Cost optimization**: Track API cost alongside wall-clock for total efficiency
6. **Composition search**: Find optimal principle sets for different task categories

---

## Appendices

### A. Prompt Catalog

See `eval_campaigns/full-report-v1.yaml` for the complete 50-prompt catalog across 8 categories.

### B. Principle Texts

Stored in `~/.claude/principles/*.md`. See `REPLAY_TEARCARD.md` for detailed descriptions of key principles.

### C. Statistical Methods

- **Bootstrap CI**: 10,000 resamples with replacement, 2.5th/97.5th percentiles
- **Paired t-test**: scipy.stats.ttest_rel on matched (prompt_label) pairs
- **Cohen's d**: (mean_diff) / pooled_sd
- **Win rate**: % of prompts where principle avg < baseline avg
- **CV**: std / mean of wall_clock across runs of the same cell

### D. Reproducibility

All results are regeneratable:

```bash
# Run the full campaign
python -m learning.session_review.sandbox_replay \
    --eval full-report-v1 --parallel 10 --tag full-v1

# Score with multi-judge
python -m learning.session_review.sandbox_replay \
    --score --tag full-v1

# Generate analysis
python -m learning.session_review.report_analysis --tag full-v1 --markdown

# Generate figures
python -m learning.session_review.report_figures --tag full-v1

# Verify all figures render
python -m learning.session_review.report_figures --tag full-v1 --verify

# Run ablation
python -m learning.session_review.sandbox_replay \
    --eval ablation-pairs --parallel 10 --tag ablation-v1

# Detect regressions and generate v2 principles
python -m learning.session_review.principle_refine --tag full-v1
```

Campaign YAML configs and commit SHAs are recorded in the results DB for full reproducibility.

### E. Raw Data

All data in `learning/session_review/data/sandbox_results.db`:
- `sandbox_runs` table: one row per run with full metrics and JSON traces
- Queryable by tag, principle, prompt label
- Quality scores from multiple LLM judges

### F. Figure Index

| # | Figure | File | Description |
|---|--------|------|-------------|
| 1 | Forest plot | `figures/forest_plot.png` | Principle speedup with 95% CI |
| 2 | Speedup vs baseline | `figures/speedup_vs_baseline.png` | Harder tasks benefit more |
| 3 | Heatmap | `figures/heatmap.png` | Full prompt × principle matrix |
| 4 | Quality-speed | `figures/quality_speed.png` | No quality-speed tradeoff |
| 5 | CV by prompt | `figures/cv_by_prompt.png` | Measurement reliability |
| 6 | v1→v2 refinement | `figures/refinement_v1_v2.png` | Guard clause effectiveness |
| 7 | Composition | `figures/composition.png` | Diminishing returns of combining principles |
