# lib/llm - Centralized LLM Utilities

Shared LLM configuration, model definitions, and multi-route billing for rivus projects.

## Overview

`lib/llm` provides a unified interface for calling LLMs across multiple providers (Anthropic, OpenAI, Gemini, xAI) while supporting two distinct billing paths:

1.  **API Billing (default):** Per-token billing via LiteLLM using your own API keys.
2.  **Subscription Billing (`subscription=True`):** Flat-rate billing via OAuth tokens (Claude Max, Google One AI Pro, ChatGPT Plus/Pro).

## Quick Start

```python
from lib.llm import call_llm, stream_llm

# 1. Simple call (API billing)
result = await call_llm("haiku", "Hello!")

# 2. Subscription call (flat-rate)
result = await call_llm("haiku", "Hello!", subscription=True)

# 3. Streaming with web search
async for chunk in stream_llm("gemini", "Latest news?", native_web_search=True):
    print(chunk, end="")
```

---

## 1. Model Resolution & Aliases

Use short, UI-friendly aliases that resolve to full LiteLLM model IDs.

```python
from lib.llm import resolve_model, short_model_id

resolve_model("gpt")     # -> "openai/gpt-5.4"
resolve_model("sonnet")  # -> "anthropic/claude-sonnet-4-6"
resolve_model("gemini")  # -> "gemini/gemini-3.1-pro-preview"

# Get display-friendly ID
short_model_id("anthropic/claude-sonnet-4-5-20250929") # -> "sonnet-4.5"
```

### Model Categories
- `MODERN_MODELS`: Frontier models (GPT-5.4, Sonnet 4.6, Gemini 3.1 Pro, Grok 4.20).
- `FAST_CHEAP`: Lightweight models for quick tasks (Haiku 4.5, GPT-5 mini).
- `CODE_BEST`: Best models for coding (Codex, Grok Code, Opus 4.6).
- `MAXTHINK`: Models with extended reasoning (Opus 4.6 128K output, GPT-5.4-Pro 1M context, Grok 4.20 reasoning).

### Pricing

Pricing is handled automatically by `litellm.model_cost` — no hardcoded tables. To check current pricing:

```python
import litellm
info = litellm.model_cost["xai/grok-4.20-beta-0309-reasoning"]
print(f"${info['input_cost_per_token'] * 1e6:.2f}/M in, ${info['output_cost_per_token'] * 1e6:.2f}/M out")
```

---

## 2. Multi-Route Billing

### API Billing (Per-Token)
Uses `litellm` with `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, etc. Best for production apps and high-concurrency tasks.

### Subscription Billing (Flat-Rate)
Uses OAuth tokens from your local machine to bill against your personal/team subscriptions.
- **Anthropic:** Uses `claude login` credentials (macOS Keychain).
- **OpenAI:** Uses `codex login` credentials (`~/.codex/auth.json`).

```python
# Force subscription billing
await call_llm("opus", "Deep analysis", subscription=True)
```

**Note:** If subscription auth is missing or rate-limited, it automatically falls back to API billing unless `fallback_to_api=False`.

---

## 3. Advanced Features

### Native Web Search
Enable provider-native grounding. The LLM performs searches internally.
- **Supported:** Anthropic, Gemini, OpenAI (GPT-5), xAI (Grok).

```python
await call_llm("sonnet", "What happened today?", native_web_search=True)
```

### Prompt Caching (Anthropic)
Drastically reduce cost and latency for large contexts (documents, codebases).

```python
# Subsequent calls with the same cached_content are ~90% cheaper
await call_llm("opus", "Summarize this", cached_content=huge_document_text)
```

### Extended Reasoning (Thinking)
Control "thinking" effort for reasoning models.

```python
async for chunk in stream_llm("gpt-pro", "Hard math problem", reasoning_effort="high"):
    # Thinking blocks are yielded as <thinking>...</thinking>
    print(chunk, end="")
```

### Disk Caching
Persistent response caching for deterministic calls (`temperature=0`).

```python
# Instant hit on second call
await call_llm("haiku", "Static query", cache=True)
```

---

## 4. Hot Server (Latency Optimization)

For latency-sensitive sync contexts (shell hooks, badges), use the **Hot Server** to avoid Python/LiteLLM startup overhead.

1. **Start Server:** `inv llm.server` (runs on port 8120)
2. **Call Sync:**
```python
from lib.llm.hot import call_llm_sync
result = call_llm_sync("Quick task", model="haiku")
```

---

## 5. Providers Not Yet Integrated

### Cerebras (fast inference)
Extreme speed (~2,000+ tok/s output), decent quality. No API key yet.

| Model             | Cost (in/out per M) | Quality (ELO)           |
|-------------------|---------------------|-------------------------|
| Llama 3.3 70B     | $0.60 / $0.60       | ~1335 (GPT-4o Aug '24)  |
| Qwen 3 235B       | $0.60 / $1.20       | Preview                 |
| Llama 3.1 8B      | $0.10 / $0.10       | Budget                  |

Best pick: 70B — matches old 405B quality at 1/10th cost. 405B was dropped.
Use case: bulk/fast tasks where latency matters more than frontier quality.
Sign up: cerebras.ai

### Image Gen Models to Add
- **`grok-imagine-image`** — $0.02/image (replaces $0.07 grok-2-image as default)
- **`gpt-image-1.5`** — low/high quality tiers ($0.009/$0.133)
- ~~`gpt-image-1-mini`~~ — DONE (2026-02-26). Alias `gpt-image-mini`, $0.011/img med.

---

## 6. Model Registry & Health

### Check Model Availability

Compare our model registry against live provider APIs. Detects new models, stale entries, and outdated aliases.

```bash
python -m lib.llm.check_models              # all providers (OpenAI, Gemini, Anthropic, xAI)
python -m lib.llm.check_models -p openai     # single provider
python -m lib.llm.check_models -v            # verbose — show all models by family
python -m lib.llm.check_models -j            # JSON output
```

### Benchmarks

Track performance and reliability across models and billing routes.

```bash
python -m lib.llm.benchmarks --models haiku sonnet --iterations 5
http GET localhost:8120/health
```

Metrics tracked in `lib/llm/benchmarks/results/`:
- **TTFT** (Time to First Token)
- **TPS** (Tokens per Second)
- **Success Rate**

---

## 7. Gemini Consumer App (App-Only Features)

Some Gemini features are **only available in the consumer web app** and have no official API. For these, use `gemini-webapi` (reverse-engineered web endpoints, cookie auth):

| Feature              | Official API (`lib/llm`)       | Consumer app (`gemini-webapi`)  |
|----------------------|--------------------------------|---------------------------------|
| Text/chat            | `call_llm("gemini", ...)`     | `client.generate_content(...)` |
| Image gen            | `lib/llm/image_gen` (Imagen)  | Via natural language prompt     |
| Music (instrumental) | Lyria RealTime API (`v1alpha`) | N/A                             |
| Music (vocals/lyrics)| **Not available**              | Lyria 3 (when supported)        |
| Gems (custom personas)| **Not available**             | `client.create_gem(...)`        |
| Canvas               | **Not available**              | Not yet reverse-engineered      |

**Use `lib/llm` for everything that has an official API.** Only reach for `gemini-webapi` when you need app-exclusive features. The reverse-engineered endpoints can break without notice.

### Cookie Authentication

`gemini-webapi` authenticates using session cookies from your Chrome browser. `browser-cookie3` reads them from Chrome's sqlite cookie DB on disk.

```python
from gemini_webapi import GeminiClient
import browser_cookie3

# Read cookies from Chrome Profile 5 (Tim — has Gemini Pro)
cj = browser_cookie3.chrome(
    domain_name='.google.com',
    cookie_file="~/Library/Application Support/Google/Chrome/Profile 5/Cookies"
)
psid = next(c.value for c in cj if c.name == '__Secure-1PSID')
psidts = next(c.value for c in cj if c.name == '__Secure-1PSIDTS')

client = GeminiClient(psid, psidts)
await client.init()
response = await client.generate_content("Generate an image of a mountain sunset")
```

### Chrome Profiles

Cookies are per-profile. Find the right one:

```bash
for d in ~/Library/Application\ Support/Google/Chrome/Profile*/; do
    name=$(python -c "import json; print(json.load(open('${d}Preferences')).get('profile',{}).get('name','?'))" 2>/dev/null)
    echo "$(basename "$d"): $name"
done
```

| Profile   | Name      | Cookie DB Path                                                          |
|-----------|-----------|-------------------------------------------------------------------------|
| Default   | (default) | `~/Library/Application Support/Google/Chrome/Default/Cookies`           |
| Profile 5 | Tim      | `~/Library/Application Support/Google/Chrome/Profile 5/Cookies`         |

### Key Google Cookies

| Cookie               | Purpose                          |
|----------------------|----------------------------------|
| `__Secure-1PSID`     | Primary auth session ID          |
| `__Secure-1PSIDTS`   | Session timestamp (refresh)      |
| `__Secure-1PSIDCC`   | Session check cookie             |
| `NID`                | Preferences/settings             |

### Injecting Cookies into Playwright

For browser automation on authenticated Google sites (when direct login is blocked):

```python
import browser_cookie3

cj = browser_cookie3.chrome(
    domain_name='.google.com',
    cookie_file="~/Library/Application Support/Google/Chrome/Profile 5/Cookies"
)
cookies_for_pw = [
    {"name": c.name, "value": c.value, "domain": c.domain, "path": c.path}
    for c in cj if c.domain and c.value
]
await context.add_cookies(cookies_for_pw)
```

### Gotchas

- **Chrome must be closed** or the cookie DB may be locked
- **Cookies expire** — `__Secure-1PSIDTS` rotates; `gemini-webapi` handles refresh automatically
- **Profile matters** — different profiles have different Google accounts
- **macOS Keychain prompt** — first run may trigger "allow access to Chrome Safe Storage"

### Maintenance

- **`gemini-webapi` breaks when Google changes internal endpoints** — update regularly: `pip install -U gemini-webapi`
- **`browser-cookie3`** — stable, rarely needs updates
- If `gemini-webapi` stops working, check [GitHub issues](https://github.com/HanaokaYuzu/Gemini-API/issues) — usually fixed within days
