# Prompt Caching

Reduce costs and latency by caching repeated content across API calls.

## Quick Start

```python
from lib.llm import stream_llm, call_llm

# Cache a document, vary the prompt — 2nd+ call gets ~90% cheaper input
async for chunk in stream_llm(
    model="sonnet",
    prompt="Summarize the key themes",
    cached_content="<document>\n...\n</document>",  # cached across calls
    system="Be concise",                             # can vary freely
):
    print(chunk, end="")

# Or non-streaming
result = await call_llm("sonnet", "What are the risks?",
    cached_content=doc_text, system="Focus on financial risks")
```

## How It Works

For **Anthropic models**: `cached_content` creates a `cache_control` breakpoint in the system
message. The API hashes everything up to that breakpoint — subsequent calls with identical
content reuse the cached prefix.

For **other providers** (OpenAI, Gemini, xAI): `cached_content` is concatenated with `system`
as plain text. These providers cache repeated prefixes automatically — no explicit markup needed.

## Token Minimums

| Model                                         | Min Tokens |
|------------------------------------------------|-----------|
| Claude Opus 4.6, Claude Opus 4.5               | **4,096** |
| Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4  | **1,024** |
| Claude Haiku 4.5                                | **4,096** |

If cached content is below the minimum, the request works normally — no error, just no caching.

## Cache Hierarchy

`tools → system → messages`

**Changing earlier levels BUSTS cache on later levels:**

| Structure | Cache preserved? |
|-----------|-----------------|
| `cached_content` stable, vary `system` | ✓ |
| `cached_content` stable, vary `user` | ✓ |
| Vary `cached_content` | ✗ busted! |

## TTL (Time to Live)

- **Default `"5m"`**: 5 minutes (refreshed on each use). Write cost: 1.25x base.
- **Extended `"1h"`**: 1 hour. Write cost: 2x base. Use for long-running workflows.

```python
# 1-hour cache for agentic workflows
await call_llm("sonnet", prompt, cached_content=doc, cache_ttl="1h")
```

## Cost Savings

| Scenario | Input Cost |
|----------|-----------|
| No caching | Full price |
| Cache write (5m) | 1.25x price |
| Cache write (1h) | 2.0x price |
| Cache read | **0.1x price** (90% off) |

With 10 variants sharing a 5000-token doc:
- Without cache: 50,000 tokens at full price
- With cache: 5,000 write + 45,000 read = ~85% savings

## Latency (TTFT Reduction)

| Cached Prefix Size | TTFT Reduction |
|--------------------|----------------|
| 100K tokens        | ~79%           |
| 10K tokens         | ~31%           |

## Use Cases

### Vario: Multiple strategies on same document

```python
# All variants share cached doc — automatic via stream_variant()
for variant in ["bio", "physics", "simple"]:
    async for chunk in stream_variant(prompt, variant, document=doc):
        ...  # cache hit on doc for 2nd+ variant
```

### Brain: Multiple questions on same document

```python
for question in questions:
    result = await call_llm("sonnet", question,
        cached_content=doc, system="Answer from the document only")
```

### Via HTTP server

```bash
# Warm server with caching
http POST localhost:8120/call_llm \
    model=sonnet prompt="Summarize" \
    cached_content="<doc>...</doc>" cache_ttl=1h

# Subscription-backed (flat-rate billing)
http POST localhost:8120/call_llm \
    model=haiku prompt="Hello" subscription:=true
```

## Checking Cache Usage

Cache metrics available in API response `usage`:
- `cache_read_input_tokens` — tokens served from cache (cheap)
- `cache_creation_input_tokens` — tokens written to cache (slightly more expensive)
- `input_tokens` — uncached tokens

## Concurrency Note

A cache entry only becomes available after the first response **begins** streaming.
For parallel requests sharing the same cached content, send one request first and wait
for its first token before firing the rest — this ensures cache hits.
