LLM Latency Benchmarks

Calling overhead analysis for rivus projects

Latency Comparison

Haiku model, simple prompt ("Say hi"), measured 2026-02-05

Warm (in-process) 830ms avg
830ms
Hot worker (future) ~850ms
~850ms
Cold (new Python) ~1000ms
~1000ms
Subprocess + SDK ~2100ms
~2100ms
Subprocess + litellm ~2300ms
~2300ms
claude --print ~5000ms
~5000ms

Cold Start Breakdown

Startup Costs

Python interpreter ~100ms
Import anthropic ~270ms
Create client ~40ms
API call (warm) ~600-1000ms

Additional Overheads

litellm import +500ms
Subprocess spawn +300ms
claude CLI startup +4000ms

Warm Call Distribution

5 consecutive calls with pre-initialized client:

Min 589ms
Average 829ms
Max 1048ms
Hot worker saves ~1.2-1.4s vs subprocess approach (60% faster)

Recommendations

Use Case Recommended Latency
In-process (already warm) anthropic.Anthropic() ~830ms
High-volume fast calls Hot worker (future) ~850ms
Latency-critical new call lib.llm.fast.fast_haiku() ~1.0s
Background non-blocking Subprocess + fast module ~2.1s
Feature-rich (aliases, search) lib.llm.call_llm() ~2.3s
Interactive CLI claude --print ~5.0s

Hot Worker Design

┌─────────────────────────────────────┐
│  llm-worker (always running)        │
│  - anthropic client pre-initialized │
│  - Listens on unix socket or HTTP   │
│  - ~850ms response time (warm)      │
└─────────────────────────────────────┘
         ▲
         │ POST /complete {"prompt": "..."}
         │
    Any caller (badge_worker, hooks, etc.)
        

Options:

Files

FilePurpose
lib/llm/fast.pyDirect Anthropic SDK (no litellm)
lib/llm/stream.pylitellm-based streaming
lib/llm/benchmarks/API latency benchmarks (TTFT, tokens/sec)

Run Benchmarks

# API latency (TTFT, throughput)
python -m lib.llm.benchmarks
python -m lib.llm.benchmarks --models haiku --iterations 5

# Analyze history
python -m lib.llm.benchmarks.analyze_history
        

Generated: 2026-02-05 | lib/llm/benchmarks