# Infrastructure

## Comment Markers

Three conventions for marking important comments:

### `# keep` — Don't delete this comment

Protects comments from AI editors and refactoring tools. "This exists for a reason — leave it."

```python
# keep — explains the retry count, not obvious from code
RETRY_COUNT = 3

BATCH_SIZE = 50  # empirically tuned for rate limit  #keep
```

### `# why` — Chesterton's fence

Explains non-obvious decisions. Prevents "cleanup" refactors that introduce bugs by removing code that looks wrong but is intentional.

```python
# why — ALB idle timeout is 60s, 37s leaves headroom for slow responses
TIMEOUT = 37

# why — append+join is 3x faster than string concat for >100 items
results = []

sleep(0.2)  #why drain pending iTerm2 escape sequences after tmux exit
```

### `# sync` — This list must stay in sync with a source

Marks consumer-side lists that must match a source of truth. Scannable by autodo for drift detection.

```
# sync — <what this lists> syncs with <source of truth>
```

In markdown use HTML comments: `<!-- sync — ... syncs with ... -->`.

### `# ssot` — This is the source of truth

Marks the canonical definition. When you change this, update the listed consumers. Consumers can also be discovered by grepping for `# sync` markers that reference this file.

```python
# ssot — port assignments; consumers: servers.md, CLAUDE.md CLI table, registry.py
```

### Summary

| Tag | Standalone | Inline suffix | Purpose |
|-----|-----------|---------------|---------|
| `# keep — ...` | yes | `#keep` | Don't delete this comment |
| `# why — ...` | yes | `#why short reason` | Chesterton's fence |
| `# sync — X syncs with Y` | yes | no | Consumer → source drift detection |
| `# ssot — X; consumers: A, B` | yes | no | Source of truth marker |

#### Format by language

| Language   | Syntax                                                   |
|------------|----------------------------------------------------------|
| Makefile   | `# sync — Go binaries + scripts syncs with tools/bin/`   |
| Python     | `# sync — handler list syncs with jobs/handlers/*.py`    |
| Markdown   | `<!-- sync — CLI tools table syncs with ~/.local/bin/ -->`|
| YAML       | `# sync — ingress rules sync with Caddyfile`             |

#### Known Drift-Prone Locations

<!-- sync — drift-prone locations syncs with actual codebase audit results -->

| File                               | What it lists               | Syncs with                              |
|------------------------------------|-----------------------------|-----------------------------------------|
| `CLAUDE.md` Structure table        | Top-level directories       | `ls -d */` in repo root                 |
| `CLAUDE.md` lib/ Packages table    | lib subpackages             | `ls lib/*/`                             |
| `CLAUDE.md` CLI Tools table        | CLI tools in ~/.local/bin   | `tools/cli/Makefile` + `tools/bin/`     |
| `tools/cli/Makefile`               | Go binaries + script links  | `tools/cli/cmd/*/` + `tools/bin/`       |
| `~/.myconf/conventions/servers.md` | Server ports + URLs         | `infra/Caddyfile`                       |
| `infra/README.md` launchd table    | launchd services            | `~/Library/LaunchAgents/local.rivus.*`  |
| `infra/README.md` backup table     | Backed-up DB files          | `infra/backup.sh` file list             |
| `~/.config/rivus/project_emojis.json` | Project emoji map        | Actual project directories; consumers: helm, hooks, doctor, chronicle |
| `~/.claude/howto/skill-triggers.md`| Skill trigger list          | `~/.claude/skills/*/skill.md`           |
| Top-level `tasks.py`               | Invoke task namespaces      | `*/tasks.py` files                      |

#### Autodo Integration

An autodo scanner check can grep for `# sync` markers, parse the "syncs with" source, and flag stale lists. Not yet implemented — tracked as a future autodo check.

## Cloudflare Tunnels

Two tunnels expose `*.jott.ninja` subdomains, one per machine:

| Tunnel | ID | Machine | DNS pattern |
|--------|------|---------|-------------|
| `rivus` | `cfe3904d-...` | Main Mac (laptop) | `X.jott.ninja` |
| `rivus-mini` | `70a478dd-...` | Mac Mini | `mini-X.jott.ninja` + `X.mini.jott.ninja` |

Both tunnels serve the same services on the same ports — the mini is a mirror of the main machine. DNS records for the mini use **two naming conventions** (both work, both maintained):
- `mini-X.jott.ninja` — points to mini tunnel (`70a478dd`)
- `X.mini.jott.ninja` — points to rivus tunnel (`cfe3904d`), routed via Access policy

**Config**: `infra/cloudflared.yml` (git-tracked, canonical for the main tunnel)

### Management (`ops cloudflared`)

```bash
ops cloudflared status                # tunnel status + connections
ops cloudflared start                 # start (launchd)
ops cloudflared stop                  # stop
ops cloudflared restart               # restart
ops cloudflared logs [-n 100]         # tail error logs
ops cloudflared sync [--fix]          # check Caddy ↔ tunnel config drift
ops cloudflared dns                   # check all DNS records resolve
ops cloudflared dns --add-missing     # create missing DNS records
```

**Plist**: `~/Library/LaunchAgents/com.cloudflare.cloudflared.plist`
**Logs**: `~/Library/Logs/com.cloudflare.cloudflared.{out,err}.log`

### Ingress

Each `*.jott.ninja` subdomain maps to a local port. See `cloudflared.yml` for full list.

### SSL Constraint: Single-Level Subdomains Only

Cloudflare's universal SSL certificate covers `*.jott.ninja` but **not** `*.*.jott.ninja`. This means:
- `vario-api.jott.ninja` — works (single-level wildcard match)
- `vario.api.jott.ninja` — **does NOT work** (two-level subdomain, no SSL coverage)

When exposing an internal `foo.api.localhost` service externally, use `foo-api.jott.ninja` (hyphenated), not `foo.api.jott.ninja`.

### Adding a New Subdomain

1. Add ingress rule to `infra/cloudflared.yml`
2. `ops cloudflared restart` to pick up config
3. `ops cloudflared dns --add-missing` to create DNS CNAME
4. For mini mirror: also add `mini-X` and `X.mini` DNS records
5. `ops cloudflared sync` to verify everything matches Caddy

## Cloudflare Access (Zero Trust)

All `*.jott.ninja` routes are protected by Cloudflare Access with Google as identity provider.

**Team name**: `tchklovski`
**Team domain**: `tchklovski.cloudflareaccess.com`
**Dashboard**: https://one.dash.cloudflare.com/

### Identity Provider

- **Google OAuth** (Web application)
- **GCP project**: `quantjoy`
- **OAuth redirect URI**: `https://tchklovski.cloudflareaccess.com/cdn-cgi/access/callback`
- **Credentials page**: https://console.cloud.google.com/apis/credentials?project=quantjoy

### Access Application

- **Name**: `rivus`
- **Domain**: `*.jott.ninja`
- **Type**: Self-hosted
- **Login method**: Google

### Logout

```
https://tchklovski.cloudflareaccess.com/cdn-cgi/access/logout
```

Also available in the hub dashboard (`hub.jott.ninja` or `hub.localhost`) under the "Cloudflare Access" accordion.

### Adding Users

To allow friends access: **Access Controls → Applications → rivus → Policies** → add their email.

The Google OAuth consent screen is unverified (100 unique user limit) — fine for personal use.

### Bypass Policies (API Endpoints)

Some subdomains serve APIs that authenticate via their own Bearer tokens, not CF Access. For these, a **CF Access bypass policy** skips the OAuth login page so API clients can reach the backend directly.

| Subdomain | App ID | Auth | Bypass Reason |
|-----------|--------|------|---------------|
| `vario-api.jott.ninja` | `17cff93b-451a-40c8-bda5-655517a1857c` | Bearer token (`sk-vario-*`) | OpenAI-compatible API — clients send `Authorization: Bearer sk-vario-...` |

**Pattern**: UI endpoints use CF Access (Google OAuth). API endpoints use CF Access bypass + application-level auth (Bearer tokens, API keys). This avoids requiring OAuth for programmatic clients while keeping UIs behind login.

### WAF Skip Rules (API Subdomains)

Cloudflare's WAF bot/AI protection can block legitimate API traffic (curl, SDK clients, automated scripts). For API subdomains, a custom WAF firewall rule skips bot protection.

**Rule**: Skip Bot Fight Mode + AI Scrapers for `vario-api.jott.ninja`
**Applies to**: Hostname equals `vario-api.jott.ninja`
**Action**: Skip — Bot Fight Mode, AI Scrapers and Crawlers

**Security note**: With WAF skip, bot protection is disabled for this subdomain. The application's own auth (Bearer tokens with budget tracking) is the primary defense. See `SECURITY_AUDIT.md` for risk analysis.

## Google Service Account (Sheets / Drive)

Service account for programmatic Google Sheets access (read, write, create in Shared Drive).

**Service account**: `rivus-assistant@quantjoy.iam.gserviceaccount.com`
**GCP project**: `quantjoy`
**Key file**: `~/.config/rivus/google-service-account.json`
**APIs enabled**: Google Sheets API, Google Drive API
**Credentials page**: https://console.cloud.google.com/iam-admin/serviceaccounts?project=quantjoy

### Shared Drive

Post April 2025, Google service accounts get **0 bytes** of personal Drive storage. All file creation goes to a Shared Drive.

- **Shared Drive**: `Rivus` on `predmachine.com` Google Workspace
- **Drive ID**: `0ABTdY5WPP5u2Uk9PVA`
- **Service account role**: Manager (required for create + delete)
- **URL**: https://drive.google.com/drive/u/1/folders/0ABTdY5WPP5u2Uk9PVA

### Permissions

| Action         | Requires            |
|----------------|---------------------|
| Read sheets    | Sheet shared with service account email as Viewer+ |
| Write sheets   | Sheet shared as Editor                              |
| Create sheets  | Manager on Shared Drive (creates in Shared Drive)   |
| Delete sheets  | Manager on Shared Drive + Drive API scope           |

### Usage

```python
from lib.gsheets import get_client, create_sheet, open_sheet, write_rows, read_rows, delete_sheet

# Create in Shared Drive
sh = create_sheet(title="My Sheet")

# Open existing
ws = open_sheet("SHEET_ID", "Tab Name")

# Write with dedup
write_rows(ws, columns=["name", "score"], rows=[...], dedup_column="name")

# Read back
data = read_rows(ws)
```

### Security Notes

- Service account has **no IAM roles** — only Sheets/Drive API access
- Cannot access any Google Drive files unless explicitly shared with its email
- Key file is gitignored (`~/.config/rivus/` is not in repo)
- Shared Drive membership is the only access vector — revoke by removing from Drive

## Cloudflare Pages (Always-On Static Content)

Publishes `.share`-marked directories to Cloudflare Pages so reports/HTML are available even when the laptop is offline.

**Project**: `rivus-static`
**URL**: `https://static.pages.jott.ninja/` (custom domain) or `https://rivus-static.pages.dev/` (fallback)
**Auth**: Protected by `*.jott.ninja` Cloudflare Access policy (same Google OAuth as tunnel)
**Account ID**: `73f2470901d87739a6288ae5c3b527fb`

### Publish

```bash
inv static.publish                     # deploy to Cloudflare Pages
inv static.publish --dry-run           # stage to /tmp/rivus-pages/ only
inv static.publish --include-large     # include dirs over 50MB (e.g., jobs/data/companies)
inv static.publish --exclude=PATH      # skip specific relative paths
```

### How It Works

1. `static/share.py` — shared `.share` scanning logic (used by both server and publisher)
2. `static/publish.py` — stages allowed files to `/tmp/rivus-pages/`, generates index pages, deploys via `npx wrangler pages deploy`
3. Skips database-backed `.share` dirs (can't serve statically)
4. Skips dirs over 50MB by default (use `--include-large` to override)
5. Generates `index.html` for root and every directory (CF Pages has no auto-listing)

### DNS

CNAME record in `jott.ninja` zone:

| Name           | Type  | Target                   | Proxy   |
|----------------|-------|--------------------------|---------|
| `static.pages` | CNAME | `rivus-static.pages.dev` | Proxied |

### API Token

Uses `CLOUDFLARE_API_TOKEN` from `~/.cloudflare.key` (scoped token: `billing_and_pol_edit`). Required permissions:

- Cloudflare Pages: Edit
- Account Settings: Read
- User: Memberships: Read
- Zone WAF: Write
- Firewall Services: Write

A **global API key** is stored at `~/.cloudflare-global.key` for emergency use (full account access — do not use for automation).

### One-Time Setup (already done)

1. `npx wrangler login` or set `CLOUDFLARE_API_TOKEN`
2. Created Pages project: `rivus-static` (production branch: `main`)
3. Added custom domain `static.pages.jott.ninja` via Pages API
4. Added CNAME DNS record `static.pages` → `rivus-static.pages.dev`

## Data Backup

Large data files (DBs, parquet, archives) are **not in git** — they're gitignored and backed up to three targets in parallel.

**Script**: `infra/backup.sh`
**Launchd**: `local.rivus.backup` (hourly, every 300s)
**Log**: `~/.local/log/rivus-backup.log`
**Manual**: `ops backup` or `bash infra/backup.sh`

### Backup Targets

| Target              | Destination                                      | Notes                                        |
|---------------------|--------------------------------------------------|----------------------------------------------|
| **SanDisk 4TB**     | `/Volumes/Sandisk-4TB/offload/rivus-data/`       | Local SSD. Skipped silently if detached      |
| **SanDisk sessions**| `/Volumes/Sandisk-4TB/offload/rivus-sessions/`   | rsync of `~/.claude/projects/` (no --delete) |
| **Cloudflare R2**   | `rivus-backup` bucket                            | Cloud. Files >500MB skipped. Change-detected |
| **NFS (trader)**    | `trader:/nfs/rivus-backup/`                      | rsync -az. Skipped if trader unreachable     |

All three run concurrently via background subshells.

### What's Backed Up

Sorted smallest → largest. The script auto-discovers untracked `.db` files >1MB and warns.

| Source                                           | Size  | Description                                |
|--------------------------------------------------|-------|--------------------------------------------|
| `tools/semisupply/data/semisupply.db`            | ~180K | Semiconductor supply chain data            |
| `finance/lib/data/returns_cache.db`              | ~1M   | Cached financial returns                   |
| `intel/people/data/face_cache/cache.db`          | ~2M   | Face recognition cache (actively updated)  |
| `vario/strategies/data/experiments.db`           | ~3M   | Vario strategy experiments                 |
| `projects/healthygamer/data/healthygamer.db`     | ~5M   | HealthyGamer project data                  |
| `doctor/data/doctor.db`                          | ~8M   | Doctor issues, triage, scan history        |
| `learning/session_review/data/failures.db`       | ~10M  | Session review failure analysis            |
| `helm/data/hub.db`                               | ~10M  | Helm session hub                           |
| `helm/data/watch.db`                             | ~13M  | Session watch/dashboard data               |
| `finance/vic_analysis/data/embeddings.db`        | ~13M  | VIC analysis embeddings                    |
| `intel/companies/data/companies.db`              | ~21M  | Company intelligence profiles              |
| `projects/supplychain/data/supplychain.db`       | ~13M  | Supply chain entity graph                  |
| `intel/people/data/people.db`                    | ~23M  | People dossiers and intel                  |
| `finance/lib/prices/data/crosslist.db`           | ~45M  | Price service cross-listing                |
| `lib/ingest/data/content.db`                     | ~46M  | Content acquisition cache                  |
| `learning/session_review/data/sandbox_results.db`| ~52M  | Sandbox evaluation results                 |
| `learning/data/learning.db`                      | ~63M  | Knowledge DB (learnings + principles)      |
| `helm/data/corpus.db`                            | ~124M | Session corpus (chat history, searchable)  |
| `jobs/data/jobs.db`                              | ~243M | Job queue, results, diagnostics            |
| `jobs/data/vic_ideas/vic_ideas.db`               | ~470M | VIC idea analysis (R2-skipped >500M)       |
| `intel/people/data/raw/crunchbase_2024.parquet`  | ~640M | Crunchbase raw data (R2-skipped)           |
| `finance/lib/prices/data/daily_prices.db`        | ~1.0G | Historical daily prices (R2-skipped)       |
| `jobs/data/vic_ideas/vic_pages.db`               | ~1.7G | VIC page snapshots (R2-skipped)            |
| `learning/data/archives/*.xz`                    | varies| Compressed learning archives               |

### Behavior

- **Hourly** via launchd (300s interval)
- **SanDisk**: Incremental copy (newer-than check). Warns if no backup in 24h
- **SanDisk sessions**: rsync of `~/.claude/projects/` — incremental, no `--delete` (preserves removed sessions)
- **R2**: Change-detected — md5 for files <50MB, size for larger. Daily VACUUM before upload
- **NFS**: rsync with compression (`-az`). Progress shown per file `[N/24]`
- **VACUUM**: Daily (not hourly) — prevents unnecessary R2 re-uploads. Files >500MB skipped
- **Error notifications**: osascript macOS notification on any backup failure
- **Health check**: "All backup targets healthy" notification every 3 days on success
- **DB discovery**: osascript notification when new untracked `.db` files >1MB are found

### Git LFS

**LFS is not used.** All large file types are gitignored:

```
*.db *.parquet *.xz *.mp3 *.mp4 *.m4a *.wav
```

`.gitattributes` is empty. If you need to track a new large file type, add it to `.gitignore` and to `infra/backup.sh`.

### Manual Operations

```bash
ops backup                   # Run backup now (all 3 targets)
bash infra/backup.sh         # Same, directly

cat ~/.local/log/rivus-backup.log     # Check log
launchctl print gui/501/local.rivus.backup  # Check launchd status
```

## sqlite-vec (Local Vector Search)

Persistent vector storage for semantic search, using `sqlite-vec` extension.
No server process, no exclusive file locks — SQLite WAL mode allows concurrent readers.

**Library**: `lib/vectors/` — thin wrapper around sqlite-vec `vec0` virtual tables
**Package**: `sqlite-vec` (v0.1.6+), installed via pip
**Storage**: SQLite WAL-mode database, one DB file per consumer

### Vector Databases

| Path | Consumer | Collection | Content |
|------|----------|------------|---------|
| `~/.coord/watch_vectors/` | `watch/` | `sessions` | Session topic trees + badges (for `/jump`) |
| `~/.coord/learning_vectors/` | `learning/` | `learnings`, `principles` | Knowledge DB items (for `learn find -s`) |

### Usage

```python
from lib.vectors import VectorStore

with VectorStore("~/.coord/watch_vectors") as vs:
    vs.ensure_collection("sessions", dim=1536)
    vs.upsert("sessions", id="abc123", vector=[...], payload={"project": "rivus"})
    results = vs.search("sessions", query_vector=[...], limit=5)
    # [{id: "abc123", score: 0.87, payload: {...}}, ...]
```

### Embedding Model

**OpenAI `text-embedding-3-small`** — 1536 dims, $0.02/M tokens
- Configured in `lib/llm/embed.py` as `DEFAULT_EMBED_MODEL`
- API key: `OPENAI_API_KEY` in `~/.config/rivus/env`
- Cost at our scale (~1000 docs): effectively $0

### Backup

Vector DBs are **not** in the SanDisk backup (they're reconstructable from source data). If lost, re-embed:
- Sessions: watch API re-embeds on next prompt to each session
- Learnings: `learn embed` re-embeds all items

### Tests

```bash
pytest lib/vectors/tests/ -v    # 12 tests, ~0.7s
```

## Browser Cookie Extraction

For reading Chrome cookies programmatically (e.g., authenticated access to Google services, Gemini consumer app), see `lib/llm/README.md` § "Gemini Consumer App" — covers `browser-cookie3`, Chrome profiles, cookie injection into Playwright, and `gemini-webapi`.

## Trading Data (NVMe / Moneygun)

Live signal detection and trade execution data from the moneygun trading system. Stored on external NVMe drive.

**Location**: `/Users/tchklovski/all-code/nvme/paper/`

### Directory Structure

| Directory | Contents | Format |
|-----------|----------|--------|
| `o/` | **Observations** — raw event stream from signal sources | JSONL |
| `tlogs/` | **Trade logs** — execution records with timestamps | JSONL |

### Event Schema (`o/twitter.jsonl`)

```json
{
  "unified_id": "TW-1504886502840745984",
  "symbols": [],
  "datetime_utc": "2022-03-18T18:24:36+00:00",
  "channel": "twitter",
  "recommend": "dont_know",
  "author": "IBDinvestors",
  "url": "https://twitter.com/...",
  "text": "...",
  "other": {"date_eastern": "Fri 2022-03-18 14:24:36 EDT"}
}
```

### Trade Log Schema (`tlogs/recent-trades.jsonl`)

```json
{
  "trader": "PAPER",
  "eid": "TW-...",
  "sid": "TW-...-BABA",
  "dt_utc": "2022-03-18T18:56:45+00:00",
  "sym": "BABA",
  "usd": 200,
  "action": "BUY",
  "av_pr": 109.55,
  "filled": 1,
  "exit_utc": "2022-03-18T15:45:00-04:00",
  "otype": "MIDPRICE",
  "stage": ">>"
}
```

Key fields: `eid` links back to event in `o/`, `dt_utc` is when the trading decision was executed (discovery timestamp), `exit_utc` is planned exit time.

### Current Data

| File | Records | Date Range | Channels |
|------|---------|------------|----------|
| `o/twitter.jsonl` | 17,967 | 2022-03-18 to 2022-03-22 | twitter |
| `tlogs/recent-trades.jsonl` | 1,826 | same period | — |

### Related Repos on NVMe

| Path | Description |
|------|-------------|
| `nvme/moneygun/` | Moneygun codebase (symlink to `~/all-code/moneygun/`) |
| `nvme/cachedir/`, `nvme/bigcachedir/` | Finnhub diskcache (daily candles, profiles) |
| `nvme/seekingalpha/` | SeekingAlpha scraped data |
| `nvme/sharadar/` | Sharadar fundamental data |
| `nvme/twitter/`, `nvme/twit/` | Raw Twitter data archives |

### Relevance to VIC Analysis

The `o/` observation schema is the canonical format for live event detection. VIC idea discovery should produce events in this format, with:
- `channel: "vic"` (not twitter)
- `datetime_utc`: when we first discovered/scraped the idea (not VIC's `posted_at`)
- Discovery lag = `datetime_utc` - `posted_at` — the time between publication and our detection

Currently VIC analysis uses `posted_at` (publication date) as entry price date. In a live system, entry would be at discovery time, after discovery lag. See `finance/vic_analysis/returns.py` for how entry dates drive alpha calculation.

## Taming macOS Background Daemons

macOS runs CPU-hungry background daemons that can interfere with development work. Common offenders:

| Daemon | What it does | Typical CPU |
|--------|-------------|-------------|
| `mediaanalysisd` | Photos ML analysis (faces, objects, scenes) | 80-100% |
| `photolibraryd` | Photos library indexing | 30-60% |
| `mds_stores` | Spotlight indexing | 20-50% |
| `bird` | iCloud sync | 10-40% |

### What Works

```bash
# Check what's hogging CPU
ps aux --sort=-%cpu | head -20

# BEST: Freeze the process entirely (0% CPU, stays in memory)
sudo kill -STOP $(pgrep mediaanalysisd)

# Unfreeze when you're done working
sudo kill -CONT $(pgrep mediaanalysisd)

# Alternative: Kill it (will respawn eventually, buys ~minutes)
sudo kill $(pgrep mediaanalysisd)
```

### What Doesn't Work

| Approach | Why it fails |
|----------|-------------|
| `renice 20` | Only helps under CPU contention — if cores are idle, daemon still takes 100% |
| `launchctl bootout` | Blocked by SIP for Apple daemons |
| `launchctl disable` | Same — SIP protects Apple launch agents |
| `cputhrottle` | No longer in Homebrew |

### Notes

- **`kill -STOP`** is the most effective option — completely freezes the process at 0% CPU while SIP blocks all launchctl approaches.
- **Spotlight** can be excluded per-directory: System Settings → Siri & Spotlight → Spotlight Privacy → add directories.
- These daemons fire up after photo imports, OS updates, or long idle periods. They'll finish eventually, but "eventually" can be hours.

## Chrome Tab Cleanup

Chrome accumulates duplicate tabs (watch, jobs, learning, vario) and YouTube tabs that eat RAM. With 100+ tabs, Chrome can reach 10+ GB.

**Script**: `infra/chrome-cleanup.sh`

```bash
bash infra/chrome-cleanup.sh             # close duplicates + YouTube, show before/after
bash infra/chrome-cleanup.sh --dry-run   # just show counts, don't close anything
```

Keeps one tab per group (watch/jobs/learning/vario), closes all YouTube tabs. Typical savings: 2-4 GB.

## SSL Certificates (Conda Env)

The conda `rivus` env bundles its own CA cert store (`ssl/cacert.pem`), which can go stale and break HTTPS for httpx, aiohttp, etc. Fix: point Python at the macOS system cert store, which Apple keeps current.

**Setting**: `SSL_CERT_FILE=/etc/ssl/cert.pem`

Set in two places so all paths are covered:

| Where                                                        | Covers                          |
|--------------------------------------------------------------|---------------------------------|
| `~/.config/rivus/env`                                        | launchd services (via wrapper)  |
| `envs/rivus/etc/conda/activate.d/ssl_fix.sh`                | interactive shells (`conda activate`) |

**Symptom**: `[SSL: CERTIFICATE_VERIFY_FAILED] unable to get local issuer certificate` on any HTTPS request.

**Why not fix the bundle?** Replacing `cacert.pem` with system certs works temporarily, but conda can overwrite it on `mamba update`. The env var takes precedence and uses macOS's auto-updating cert store — no maintenance needed.

## Caddy (Local Reverse Proxy)

Routes `*.localhost` → local ports. Canonical file: `infra/Caddyfile`, symlinked to `/opt/homebrew/etc/Caddyfile`.

```bash
brew services restart caddy    # reload after Caddyfile changes
```

## launchd Services

Always-on rivus services via macOS launchd. Auto-restart on crash, survive reboots.

**Plists**: `~/Library/LaunchAgents/local.rivus.*.plist`
**Logs**: `~/Library/Logs/rivus/<name>.{stdout,stderr}.log`
**Manage**: `inv svc.status`, `inv svc.start`, `inv svc.stop`, `inv svc.restart`

### Current Services

| Label | What | Port |
|-------|------|------|
| `local.rivus.vario-ui` | Vario app (Extract + Studio + Strategies) | 7960 |
| `local.rivus.llm-server` | Hot LLM server (lib/llm wrapper) | 8120 |
| `local.rivus.jobs-runner` | Jobs pipeline runner | — |
| `local.rivus.jobs-dashboard` | Jobs NiceGUI dashboard | 7890 |
| `local.rivus.moltbook` | Moltbook agent | — |
| `local.rivus.helm-server` | Session intelligence API (hooks, hist, badge) | 8130 |
| `local.rivus.backup-sandisk` | Hourly data backup to SanDisk 4TB | — |

### Python Environment (`~/.rivus-env/`)

Services use a **dedicated virtualenv** at `~/.rivus-env/`, not the conda `rivus` env. This is the canonical Python for all launchd services and manual service restarts:

```bash
~/.rivus-env/bin/python static/server.py   # correct
conda activate rivus && python ...          # wrong for services
```

The wrapper script adds `~/.rivus-env/bin` to PATH automatically.

### Wrapper

All plists use `bin/launchd-wrapper.sh` as the program. It:
1. Sources `~/.config/rivus/env` (API keys launchd doesn't have)
2. Sets PATH to `~/.rivus-env/bin` + homebrew
3. `cd`s to rivus root
4. `exec "$@"` — runs the actual command

### Environment File (`~/.config/rivus/env`)

API keys and credentials for launchd services. **Not git-tracked** (contains secrets).

| Key | Used by |
|-----|---------|
| `ANTHROPIC_API_KEY` | lib/llm (Anthropic models) |
| `OPENAI_API_KEY` | lib/llm (OpenAI models) |
| `GEMINI_API_KEY` | lib/llm (Gemini models) |
| `GOOGLE_API_KEY` | lib/llm (same as GEMINI for some SDKs) |
| `GROQ_API_KEY` | lib/llm (Groq), transcription |
| `XAI_API_KEY` | lib/llm (xAI/Grok models) |
| `SERPER_API_KEY` | Web search (brain, rivu) |
| `FINNHUB_API_KEY` | Earnings data (jobs) |
| `BRIGHTDATA_*` | Proxy/scraping (browser, jobs) |

### Vario API Keys (`~/.config/rivus/vario-api-keys.txt`)

Per-user API keys for vario's external API (`vario-api.jott.ninja`). **Not git-tracked** (outside repo). Keys are stored in vario's SQLite DB (`vario/api/vario_api.db`) — the text file is a readable reference copy.

Auth: `x-api-key: sk-vario-...` header or `Authorization: Bearer sk-vario-...`. Both work on both `/v1/chat/completions` (OpenAI) and `/v1/messages` (Anthropic) endpoints.

Manage keys via CLI: `vario api keys-create --user NAME --budget 50`, `vario api keys-list`.

**Stale keys** are the #1 cause of launchd auth failures. Keys update in your shell profile but not in this file. Quick check:

```bash
# Compare shell vs env file
diff <(printenv | grep -E '^(ANTHROPIC|OPENAI|GEMINI|GROQ|XAI)_API_KEY' | sort) \
     <(grep -E '^(ANTHROPIC|OPENAI|GEMINI|GROQ|XAI)_API_KEY' ~/.config/rivus/env | sort)
```

### Steal/Restore Pattern

For services that run via launchd but you sometimes want to run interactively (dev, debugging):

1. `inv <project>.server` checks if launchd has the service loaded
2. If yes: stops launchd (`bootout`), waits for port release
3. Runs uvicorn/nicegui interactively (you see logs, Ctrl-C to stop)
4. On exit (`finally`): re-bootstraps the launchd service

See `tasks.py` `llm_server()` for the reference implementation.

### Adding a New launchd Service

1. Add entry to `bin/gen-plists.sh` (format: `name|command args`)
2. Run `make install-plists` to regenerate all plists
3. Add label to `LAUNCHD_SERVICES` in `tasks.py`
4. Add any needed API keys to `~/.config/rivus/env`
5. Load: `launchctl bootstrap gui/501 ~/Library/LaunchAgents/local.rivus.<name>.plist`
6. Verify: `inv svc.status`

> `make install` also symlinks iterm2d into iTerm2 AutoLaunch.

### KeepAlive Convention

All long-running services use the **dict form** of KeepAlive, not a simple boolean:

```xml
<key>KeepAlive</key>
<dict>
    <key>SuccessfulExit</key>
    <false/>
    <key>Crashed</key>
    <true/>
</dict>
<key>ThrottleInterval</key>
<integer>30</integer>
```

| Scenario | Behavior |
|----------|----------|
| Clean exit (exit 0) | **Stays down** — `ops stop` / `SIGTERM` works without launchd fighting you |
| Crash (SIGSEGV, SIGILL, SIGBUS) | Restarts after ThrottleInterval |
| Non-zero exit (bug, bad config) | Restarts after ThrottleInterval |

**Why not `<true/>`?** Plain `<true/>` restarts on *every* exit, including clean `SIGTERM` stops. This forces you to `launchctl bootout` instead of just killing the process. The dict form lets a clean exit stay down while still auto-recovering from crashes.

**One-shot jobs** (backup, jobs-runner) use `KeepAlive: <false/>` — they run on schedule or demand, not continuously.

**ThrottleInterval: 30** — if a service crash-loops (bad config, missing dep), 30s gives time to read the error log before the next restart. The 10s default hammers the logs.

### Troubleshooting

```bash
# Check if loaded
launchctl print gui/501/local.rivus.<name>

# Check logs
tail -50 ~/Library/Logs/rivus/<name>.stderr.log

# Restart (keeps loaded, kills+restarts process)
launchctl kickstart -k gui/501/local.rivus.<name>

# Full reload (unload + load)
launchctl bootout gui/501/local.rivus.<name>
launchctl bootstrap gui/501 ~/Library/LaunchAgents/local.rivus.<name>.plist
```