LLM Latency Benchmarks

Latency Comparison

Haiku model, simple prompt ("Say hi"), measured 2026-02-05

Warm (in-process) 830ms avg

830ms

Hot worker (future) ~850ms

~850ms

Cold (new Python) ~1000ms

~1000ms

Subprocess + SDK ~2100ms

~2100ms

Subprocess + litellm ~2300ms

~2300ms

claude --print ~5000ms

~5000ms

Cold Start Breakdown

Startup Costs

Python interpreter ~100ms

Import anthropic ~270ms

Create client ~40ms

API call (warm) ~600-1000ms

Additional Overheads

litellm import +500ms

Subprocess spawn +300ms

claude CLI startup +4000ms

Warm Call Distribution

5 consecutive calls with pre-initialized client:

Min 589ms

Average 829ms

Max 1048ms

Hot worker saves ~1.2-1.4s vs subprocess approach (60% faster)

Recommendations

Use Case	Recommended	Latency
In-process (already warm)	`anthropic.Anthropic()`	~830ms
High-volume fast calls	Hot worker (future)	~850ms
Latency-critical new call	`lib.llm.fast.fast_haiku()`	~1.0s
Background non-blocking	Subprocess + fast module	~2.1s
Feature-rich (aliases, search)	`lib.llm.call_llm()`	~2.3s
Interactive CLI	`claude --print`	~5.0s

Hot Worker Design

┌─────────────────────────────────────┐
│  llm-worker (always running)        │
│  - anthropic client pre-initialized │
│  - Listens on unix socket or HTTP   │
│  - ~850ms response time (warm)      │
└─────────────────────────────────────┘
         ▲
         │ POST /complete {"prompt": "..."}
         │
    Any caller (badge_worker, hooks, etc.)

Options:

Unix socket — fastest IPC (~10ms overhead)
HTTP localhost — easier debugging (~50ms overhead)
In hub process — if hub always runs, add /api/llm endpoint

Files

File	Purpose
`lib/llm/fast.py`	Direct Anthropic SDK (no litellm)
`lib/llm/stream.py`	litellm-based streaming
`lib/llm/benchmarks/`	API latency benchmarks (TTFT, tokens/sec)

Run Benchmarks

# API latency (TTFT, throughput)
python -m lib.llm.benchmarks
python -m lib.llm.benchmarks --models haiku --iterations 5

# Analyze history
python -m lib.llm.benchmarks.analyze_history