# LLM Calling Overhead Analysis

**What this measures:** Startup costs, import times, subprocess overhead.
**For API latency (TTFT, tokens/sec):** Run `python -m lib.llm.benchmarks`

Benchmarked 2026-02-05 on haiku model, simple prompt ("Say hi").

## Cold vs Warm Calls

| Call Type | Latency | What's Included |
|-----------|---------|-----------------|
| **Warm (in-process)** | 600-1050ms (avg 830) | Just API call |
| **Hot server** | ~650-900ms | Warm + IPC |
| **Cold (new Python)** | ~1.0s | Python + import + API |
| **Subprocess + SDK** | ~2.1s | Spawn + cold call |
| **Subprocess + litellm** | ~2.5s | Spawn + litellm import + API |
| **claude --print** | ~5.0s | Node.js + full CLI framework |

**Hot server saves ~1.2-1.4s vs subprocess** (60% faster).

## Cold Start Breakdown

```
Python interpreter:       ~0.1s
Import anthropic:         ~0.27s
Create client:            ~0.04s
API call (haiku, warm):   ~0.6-0.8s
─────────────────────────────────
Total cold call:          ~1.0s

Additional overheads:
  litellm import:         +0.5s
  subprocess spawn:       +0.3s
  claude CLI:             +4.0s
```

## Recommendations

| Use Case | Recommended | Latency |
|----------|-------------|---------|
| In-process (already warm) | Direct SDK | ~830ms avg |
| High-volume fast | Hot server `/call_llm` | ~850ms avg |
| Background non-blocking | Hot server `/call_llm` | ~850ms avg |
| Feature-rich (aliases, search) | `lib.llm.call_llm()` | ~2.3s |
| Flat-rate subscription | Hot server with `subscription: true` | ~5s (CLI overhead) |
| Interactive CLI | `claude --print` | ~5.0s |

## Architecture

See `lib/llm/ARCHITECTURE.md` for the full picture.

```
                    hot server (port 8120)
                   ┌──────────────────────┐
                   │                      │
  POST /call_llm ──┤  subscription=false  ├──→ litellm (warm connections)
                   │                      │
                   │  subscription=true   ├──→ CLI pool (warm subprocesses)
                   │                      │
                   └──────────────────────┘
```

## Files

| File | Purpose |
|------|---------|
| `lib/llm/hot.py` | Sync client: calls hot server |
| `lib/llm/server.py` | Hot server: manages both API and CLI warm pools |
| `lib/llm/cli.py` | CLI subprocess pool (internal to server) |
| `lib/llm/stream.py` | Core async interface: `call_llm()`, `stream_llm()` |
| `lib/llm/benchmarks/latency_benchmark.py` | API latency (TTFT, tokens/sec) |

## Related

- `python -m lib.llm.benchmarks` — API latency benchmarks
- `python -m lib.llm.benchmarks --models haiku --iterations 5` — specific model
- `lib/llm/benchmarks/TODO.md` — future benchmark features