Calling overhead analysis for rivus projects
Haiku model, simple prompt ("Say hi"), measured 2026-02-05
5 consecutive calls with pre-initialized client:
| Use Case | Recommended | Latency |
|---|---|---|
| In-process (already warm) | anthropic.Anthropic() |
~830ms |
| High-volume fast calls | Hot worker (future) | ~850ms |
| Latency-critical new call | lib.llm.fast.fast_haiku() |
~1.0s |
| Background non-blocking | Subprocess + fast module | ~2.1s |
| Feature-rich (aliases, search) | lib.llm.call_llm() |
~2.3s |
| Interactive CLI | claude --print |
~5.0s |
┌─────────────────────────────────────┐
│ llm-worker (always running) │
│ - anthropic client pre-initialized │
│ - Listens on unix socket or HTTP │
│ - ~850ms response time (warm) │
└─────────────────────────────────────┘
▲
│ POST /complete {"prompt": "..."}
│
Any caller (badge_worker, hooks, etc.)
Options:
/api/llm endpoint| File | Purpose |
|---|---|
lib/llm/fast.py | Direct Anthropic SDK (no litellm) |
lib/llm/stream.py | litellm-based streaming |
lib/llm/benchmarks/ | API latency benchmarks (TTFT, tokens/sec) |
# API latency (TTFT, throughput)
python -m lib.llm.benchmarks
python -m lib.llm.benchmarks --models haiku --iterations 5
# Analyze history
python -m lib.llm.benchmarks.analyze_history