# lib/llm Architecture

## Overview

Three components, each serving a distinct need:

```
┌────────────────────────────────────────────────────────┐
│              hot server (port 8120)                    │
│   ┌──────────────────┐  ┌───────────────────────────┐  │
│   │  litellm pool    │  │  Direct OAuth APIs        │  │
│   │  (API billing)   │  │  (subscription billing)   │  │
│   └────────┬─────────┘  └────────────┬──────────────┘  │
│            │                         │                 │
│   POST /call_llm ────subscription────┤                 │
│   POST /stream_llm   flag picks     │                  │
│   GET  /health        which pool    │                  │
└────────────┬───────────────────────────────────────────┘
             │
    ┌────────┴─────────┐
    │   hot.py         │  Sync callers (hooks, badges)
    │   call_llm_sync()│  HTTP → server → stream.py
    │   (+ SDK fallback│
    └──────────────────┘

    ┌──────────────────┐
    │   stream.py      │  Async callers (brain, vario)
    │   call_llm()     │  Direct in-process, no server
    │   stream_llm()   │  subscription flag → OAuth API
    └──────────────────┘
```

## Components

### 1. `stream.py` — Core Async Interface

The source of truth for all LLM calls. Provides `call_llm()` and `stream_llm()`.

```python
from lib.llm import call_llm, stream_llm

# Per-token API billing (default)
result = await call_llm("haiku", "Hello")

# Flat-rate subscription billing (routes to claude_oauth.py etc)
result = await call_llm("haiku", "Hello", subscription=True)

# Streaming
async for chunk in stream_llm("opus", "Analyze this", subscription=True):
    print(chunk, end="")
```

**When to use:** Async code in long-running processes (brain, vario, any Gradio app).
These processes already have litellm imported and connections warm.

### 2. `server.py` — Hot Server (port 8120)

FastAPI server that provides a warm HTTP interface to `stream.py`.

The `subscription` field on requests picks the billing route:
- **False (default):** LiteLLM with API keys (per-token billing).
- **True:** Direct API calls with OAuth Bearer tokens (subscription billing).

**When to use:** Any caller that needs an HTTP API, especially sync callers
that can't `await` directly. The server stays warm, eliminating cold start.

### 3. `hot.py` — Sync Client

Thin sync wrapper that hits the hot server. No fallback — if server is down, fails loud.

```python
from lib.llm.hot import call_llm_sync

# Per-token billing
result = call_llm_sync("Parse this", model="haiku")

# Subscription-backed
result = call_llm_sync("Parse this", subscription=True)
```

**When to use:** Sync contexts (hooks, badge workers, subprocess scripts)
where you can't `await`. Requires hot server running (`inv llm.server`).

### 4. `cli.py` — CLI Subprocess Pool (Secondary)

Manages pre-spawned CLI processes (claude, codex, gemini) for subscription billing.
Originally the primary subscription route, now a secondary path used for specific
CLI features or as a fallback if the direct OAuth API is unavailable.

## Billing Routes

| Flag | Route | Billing | Latency |
|------|-------|---------|---------|
| `subscription=False` (default) | LiteLLM API | Per-token | ~0.8s warm |
| `subscription=True` | OAuth Direct API | Flat-rate | ~0.8s warm |
| (via cli.py) | CLI subprocess | Flat-rate | ~5s (CLI overhead) |

## Decision Tree

```
Need to call an LLM?
├── In async code (brain, vario, server)?
│   └── Use call_llm() / stream_llm() directly
│       ├── Want per-token billing? → default
│       └── Want flat-rate? → subscription=True
├── In sync code (hook, subprocess)?
│   └── Use call_llm_sync() from hot.py
│       ├── Hot server running? → HTTP call (~200ms overhead)
│       └── Server down? → Direct Anthropic SDK fallback
└── Building an HTTP client?
    └── POST to localhost:8120/call_llm
```

## Files

| File | Status | Purpose |
|------|--------|---------|
| `stream.py` | **Core** | Async `call_llm()` / `stream_llm()` |
| `server.py` | **Core** | Hot server, manages both warm pools |
| `hot.py` | **Core** | Sync client (calls hot server) |
| `cli.py` | **Internal** | CLI subprocess pool |
| `models.py` | **Core** | Model aliases, resolution |

## Migration from Old API

| Old | New |
|-----|-----|
| `from lib.llm.fast import fast_haiku` | `from lib.llm.hot import call_llm_sync` |
| `POST /fast_call` | `POST /call_llm` |

Subscription billing is now the default (`subscription=True`). Default fast model is `grok-fast` — routing is automatic.