Rivus

A system that ingests the world, reasons about it across models,
and learns from its own mistakes to get better over time.

Key Notes & Questions
  1. Vision: We improve LLM performance in domains, not just on single queries. The unit of value is a vertical, not a prompt.
  2. Shadow mode: Deploy alongside current workflows to observe — can we do better, or offer automation where there is none?
  3. Franchise model: Can we package this for partners addressing specific verticals or geographies?
  4. Demo idea: Show with vs. without the knowledge base — populate the KB, then solve a problem, demonstrating the compounding benefit of learned principles on process quality.
  5. Capability example: Find VC portfolio companies that may want to use this — the system can research, filter, and score prospects autonomously.

Open questions

Vision

Where to point AI: The best use of rapidly improving AI capabilities is to apply them to two things:

  1. The codebase itself — letting AI improve the tools, pipelines, and infrastructure it runs on.
  2. Acquiring and refining skills — the rigorous, measurable capabilities (not just prompts) involved in building products: research, evaluation, synthesis, domain reasoning.

Both uses create self-improving feedback loops. Better tools produce better work; better skills produce better tools. Businesses that focus AI here compound their advantage — each cycle makes the next one faster and more capable.

The main skill: The central capability we are building is how AI should autonomously work a project — driving tasks end-to-end while knowing when and how to resort to human assistance along the way.

This inverts the typical AI copilot model. Instead of a human driving with AI assistance, the AI drives the project — scoping work, executing plans, verifying results, filing follow-ups — and escalates to the human for judgment calls, design decisions, and quality gates. The human becomes the reviewer and strategist, not the typist. Getting this right is the meta-skill that makes all other capabilities compound.

Core thesis: By thinking 2–3× as much — generating alternatives, evaluating from more angles, building up background knowledge — you can almost always improve on a first-pass decision. The opportunity is to do more work systematically and then learn from comparing the richer result to what the single pass missed. Rivus makes that extra work automatic, not effortful.

Rivus is self-improving AI. Every session teaches the next; mistakes become principles; principles compound. More done, less human effort — and the gap only grows.

Rivus builds skilled domain experts. Like skillz: start by accumulating deep domain experience — from the web, from users, from subject-matter experts — especially around decision-making in a given domain. Then reason with that accumulated knowledge to sharpen the system’s judgment on each new decision. Delivers browsable portals, queryable MCP servers, change notifications, and bulk data export.

Rivus builds causal models. Predicts actions and choices from small-to-medium domain data (100–100K facts). Not just correlations — testable, explainable models of what drives decisions.

Why it produces high-quality output

Abundant input

Broad data collection across many sources and formats

Self-healing processing

Pipelines detect failures and retry with different strategies

Superior observability

Every tool call, error, and correction is visible and searchable

Self-tuning

Mistakes become principles; principles improve every future session


How It Works

Data flows through three stages: collect, process, output. Linked items () open the live tool. Multi-model reasoning is pervasive — see How We Reason below.

INGEST pull in the world Jobs — self-healing, semaphored, error triage Browser — navigate, screenshot, self-tuning Newsflow — live monitoring, periodic refresh Extract — parse HTML, PDF, YouTube Transcribe — speaker, tone, text PROCESS enrich & understand Analyze — concepts, faces, entities Score — classify, rank Enrich — cross-reference, link jobs pipeline: analyze → score → enrich OUTPUT & ACTION act on what was learned Intel — company & people dossiers Finance — analysis & backtests SemanticNet — structured knowledge Present — reports, demos & portals PERVASIVE across every stage Run Inspect Improve multi-model reasoning iterative strategies learning from errors & suggestions SELF-MANAGEMENT Autonomy overnight TODOs, decisions Supervisor sidekick hooks, session grid Doctor watch, auto-fix, chronicle Ops + Lib CLI, servers, shared libs INSTITUTIONAL KNOWLEDGE every session inherits what the system has learned Skills 40+ encoded expert workflows Conventions CLAUDE.md at every level — living rules Principles extracted from 600+ sessions

How We Reason

What makes this different from vanilla Claude Code. Multi-model reasoning is pervasive — every stage uses it, not just one.

Vario Run across all frontier models, multiple variants per model Generate → Evaluate → Iterate until quality converges Claude, GPT, Gemini, Grok — breadth + depth Reasoning Strategies 19 strategies debate, reflexion, tree search, best-of-n, ensemble 10 composable stages generate, score, critique, refine, verify, vote 9 analytical lenses game theory, economics, evolution, ecology Learning System Session review → principle extraction 25K+ instances, vectorized for retrieval mistakes → principles → better sessions Measure & Evaluate Real use: production prompts scored by LLM judges Benchmarks: reasoning (MMLU-Pro, HLE) and text crafting (creative writing, BrowseComp) sandbox replay, strategy tournaments, A/B scoring Vanilla Claude Code: single model, no memory between sessions, no self-evaluation Rivus: multi-model consensus, learned principles, strategy selection, benchmark-driven improvement

Example Outputs

Concrete deliverables you can read, share, and act on.

🔍

Company & People Dossiers

Structured YAML profiles + prose dossiers with TFTF scoring, bull/bear investment memos, competitive landscape analysis. Cross-referenced with SEC filings and patents.

💰

Financial Analysis

Earnings call × price alignment at ~250ms resolution. Backtests, screening, bottleneck analysis. Which claim caused which price move?

🏭

Supply Chain Graphs

500+ semiconductor companies with supplier/customer/competitor edges. Wave-based discovery from anchor companies outward.

📊

Reports & Portals

HTML reports, interactive portals, research writeups. Published to static content server for sharing.

🧠

Learned Principles

25K+ instances distilled into actionable principles. Materialized to ~/.claude/principles/ — every future session inherits what was learned.

Skills & Workflows

40+ encoded expert workflows. /commit, /debug, /present-project — invoke with a slash command, get a structured multi-step process.


Amplification

How one developer’s hour becomes ten. The system multiplies human effort through three mechanisms.

🎓

Learning Loop

Every session is reviewed. Mistakes become principles. Principles feed future sessions. The system gets measurably better over time.

664 sessions reviewed → 25K+ instances → principles materialized

Learning deep dive →

🔱

Multi-Model Reasoning

The “+10 IQ points” engine. Instead of trusting a single generation, Vario does more work: generates alternatives broadly, evaluates from multiple angles, builds up background (rubrics, precedent, first principles) before committing to an answer. The gap between 1× and 3× effort is often worth closing.

19 strategies · 4–8 models · iterative convergence

Vario deep dive →

🤖

Autonomous Operation

Jobs run 24/7. Supervisor watches sessions. Doctor auto-fixes failures. Work happens while the developer sleeps.

17+ pipeline handlers · overnight TODOs · idle-aware scheduling

The compound effect: autonomous pipelines discover data around the clock. Multi-model reasoning produces higher quality analysis per prompt. Learning from mistakes means each session is faster and more accurate than the last. Skills encode expert workflows so common patterns take seconds instead of minutes.

Result: one developer managing work that would otherwise require a team — with quality that improves automatically.


Components

Each module, what it does, and its key sub-components. Sorted by size.

🔱

vario

24,131 LOC · 101 files

Unified LLM workbench. Extracts content from URLs, runs parallel prompts across 4–8 models, evaluates with strategies and judges, and iteratively refines via generate-evaluate-iterate loops.

Extract Studio Engine Reasoning Strategies Gradio UI
📚

lib

20,633 LOC · 116 files

Shared library layer used by every module. Async LLM calls with model aliasing, image generation across 4 providers, vector search, semantic storage, billing monitoring, notifications, and proxy management.

lib/llm lib/vectors lib/semnet lib/billing lib/notify lib/brightdata lib/ytdl
🎓

learning

18,274 LOC · 39 files

Self-improvement system. Reviews coding sessions for patterns, extracts principles from experience, embeds knowledge into vector DB for fast retrieval. The system literally learns from its own work.

Session Review Pattern Discovery Principles Embeddings Pond
👷

jobs

16,668 LOC · 52 files

Self-healing pipeline engine with LLM error triage, semaphored concurrency, and version-aware staleness. 20+ autonomous pipelines across 5 domains:

YouTube — 6 channels (a16z, DML, HG, Dwarkesh, Lex, PLTR) Earnings — large-cap backfill, transcripts, IR Company Research — VIC ideas, enrichment, scoring Supply Chain — 500+ semis, anchor → expand graph Newsflow — live monitoring, curated URLs
🔍

intel

13,544 LOC · 36 files

Entity intelligence pipeline. Discovers companies via web search and SEC filings, fetches data at 3 cost tiers, enriches from free APIs (patents, GitHub, news), and synthesizes dossiers with LLM analysis.

Companies People TFTF Framework Discover Fetch Analyze
🏥

doctor

11,961 LOC · 36 files

Project health monitoring with auto-fix. Watches file changes, runs tests, tracks status. Chronicle sub-module analyzes coding sessions with D3 topic graphs and timeline visualizations.

Watch Auto-Fix Chronicle Collaboration Topic Graph
🌐

browser

10,777 LOC · 45 files

Playwright-based browser automation. Headless browsing with proxy escalation (direct → stealth → Bright Data → full browser). Content ingestion for HTML, PDF, and YouTube transcripts.

Agent Server Ingest Proxy Escalation Cache
🤖

supervisor

10,318 LOC · 51 files

Autonomous work orchestrator. Manages long-running operations, coordinates sidekick agents, runs periodic tasks. Bridges learning outputs into actionable knowledge for autonomous sessions.

Autonomous Sidekick Event Loop Periodic Benchmarks
💰

finance

9,867 LOC · 45 files

Market analysis toolkit. Earnings call processing, backtesting framework, corporate ownership tracking, and bottleneck analysis. Integrates with Finnhub for real-time market data.

Earnings Backtest Ownership Bottleneck Analysis
🏭

tools

7,638 LOC · 31 files

Specialized production utilities. Supply chain graph analysis (companies, relationships, bottlenecks), Japan market scrapers (EDINET filings, Kabutan stocks), and media processing.

Supply Chain EDINET Kabutan Media
⚙️

ops

4,713 LOC · 16 files

Operations CLI and server management. Session management, iTerm2 control, resource monitoring, developer tools. Single point of control for all services via ops command.

CLI Watch Resmon Devtools
🧪

explorations

4,466 LOC · 24 files

Experiments and prototypes. LiteLLM testing, Grok search, problem-solving strategies, iTerm2 automation gym. Ideas that prove out graduate into full modules.

LiteLLM CLI Grok Search Problem Solving Gym

Component Deep Dives

What each module actually does, what’s working, and what’s next.

🔱

vario

19 strategies from 10 composable stages, 9 analytical lenses — automated problem-solving at scale
24,131 LOC · 101 files
Extract Studio Engine Reasoning Strategies Gradio UI

What works now

  • Studio — run same prompt across 4–8 models in parallel, generate → evaluate → iterate until quality converges
  • Reasoning Strategies — 19 encoded strategies (chain-of-thought, self-critique, ensemble, lens variations) with SQL-backed move tracking
  • Extract — fetch any URL, parse HTML/PDF/YouTube, extract structured facts

How it fits together

  • Extract pulls in content → Studio generates across models → evaluates and iterates to convergence
  • Reasoning Strategies benchmark: compare which approach works best for which problem type
  • Unified Gradio UI at vario.localhost with Extract, Studio, and Prompts tabs

Coming next

Multi-source analysis: brain search "query" → fetch top 5 results in parallel → synthesize across sources. Auto-search fallback when input isn’t a URL.

📚

lib

Shared foundation that every module imports — eliminates duplication across the system
20,633 LOC · 116 files
lib/llm lib/vectors lib/semnet lib/billing lib/notify lib/brightdata lib/ytdl

What works now

  • lib/llm — async calls to 6+ providers with model aliasing, streaming, pricing, web search
  • lib/vectors — Qdrant local vector search for semantic retrieval (learnings, sessions)
  • lib/semnet — 3-level semantic storage (doc summaries, chunks, claims) with SQLite + Qdrant, domain adapters
  • lib/billing — real-time API cost monitoring with Gradio dashboard
  • lib/notify — tiered Pushover + local notifications (info/warning/critical)
  • lib/coord — multi-session conflict avoidance (file claims, activity tracking)

Also includes

  • lib/brightdata — proxy zones, YouTube datasets, Web Unlocker
  • lib/ytdl — yt-dlp wrappers with auth and proxy management
  • lib/discovery_ops — shared discovery pipeline infra (cache, search, BD client) reused by 5+ projects
  • lib/config_validation — self-documenting YAML errors + LLM-powered “freehand config”

Coming next

Memory system (lib/memory): PostgreSQL + pgvector for self-organizing knowledge store with hybrid retrieval and applicability scoring.

🎓

learning

The system literally learns from its own mistakes — session review → principles → sandbox testing
18,274 LOC · 39 files
Session Review Pattern Discovery Principles Embeddings Pond

What works now

  • Session review — parse Claude/Gemini transcripts, extract error→repair pairs (664+ analyzed)
  • Principles DB — 25K+ instances linked to principles, auto-classified by LLM, materialized to ~/.claude/principles/*.md
  • Failure mining — multi-model judges (Gemini, Grok, Claude) score which tool call fixed each error, majority vote consensus
  • Sandbox eval — Docker-based replay: run Claude against specific commits, measure wall-clock time, tool calls, result quality

Gyms (self-improvement)

  • Badge gym — test prompt variants by replaying real sessions, score quality, pick best
  • Fetchability gym — probe URLs with httpx/proxy/unlocker in parallel, build site × method matrix
  • Principles flow back into sessions via CLAUDE.md + ~/.claude/principles/

Coming next

Sidekick gym: test which interventions (auto-badge, convention warnings, boilerplate detection) are actually helpful vs noisy. Close the evaluation loop.

👷

jobs

20+ self-healing pipelines with LLM error triage, semaphored concurrency, and version-aware staleness
16,668 LOC · 52 files
YouTube (6 channels) Earnings Research Company Analysis Supply Chain Newsflow Dashboard Diagnostics

What works now

  • Stage-aware tracking — items flow through fetch → extract → score independently, with per-stage timing and concurrency limits
  • Error intelligence — every exception auto-classified by LLM as transient/item-specific/systemic/code-bug, drives retry/skip/pause
  • Version-aware staleness — code changes → hash changes → items marked stale → one-click reprocess
  • Gradio dashboard at jobs.localhost with live stats, error drill-down, job control

Pipeline examples

  • Supply chain — 2 jobs build a graph of 500+ semiconductor companies with supplier/customer/competitor edges
  • VIC research — idea discovery → content processing → enrichment → scoring
  • Multi-job workflows — one job’s output feeds the next via tracker_query discovery

Coming next

Cascade reprocessing: fix a parser → extract stage re-runs → downstream stages auto-mark stale. Validation circuit breaker when semantic failure rate exceeds threshold.

🔍

intel

From a company name to a full investment dossier with TFTF scoring — automated end to end
13,544 LOC · 36 files
Companies People TFTF Framework Discover Fetch Analyze

What works now

  • Companies pipeline — Serper search → SEC EDGAR → Bright Data scraping (3 cost tiers) → free API enrichment (patents, GitHub, news) → LLM synthesis
  • People pipeline — discover via search → fetch profiles → enrich from SEC forms → cluster & analyze for VC theses
  • TFTF framework — Technology, Financials, Team, Fit scoring with bull/bear investment memos

Outputs

  • Structured YAML profiles + prose dossiers (Markdown)
  • Competitive landscape analysis
  • Cross-referenced with SEC filings and patent data

Coming next

Consolidate with jobs-based company analysis (currently siloed — separate data dirs, separate prompts). Unified watchlist and shared prompt templates.

🏥

doctor

Watches every file change, runs tests, auto-fixes failures using Claude Code — while avoiding conflicts with human sessions
11,961 LOC · 36 files
Watch Auto-Fix Chronicle Collaboration Topic Graph

What works now

  • Watch + auto-fix — FSEvents file watcher, auto-runs tests, spawns Claude to fix failures
  • Chronicle — analyzes coding sessions with D3 topic graphs, timeline, accomplishment extraction
  • Collaboration — publishes status to shared YAML so user sessions know doctor is active

How it works

  • Doctor claims files before editing via lib/coord — no conflicts with user sessions
  • Session intelligence API (port 8130) powers /hist, /jump, badges

Coming next

Connect error intelligence from jobs (currently separate) so operational failures inform project health. Cross-project resource coordination.

🌐

browser

5-level escalation ladder: free → stealth → proxy → unlocker → full Playwright — pay only when needed
10,777 LOC · 45 files
Agent Server Ingest Proxy Escalation Cache

What works now

  • Direct + stealth — free, ~200-500ms, handles most sites
  • Bright Data proxy — residential IPs, ~500-800ms, bypasses geo-blocks
  • Web Unlocker + Playwright — for JS-heavy sites, CAPTCHA, login walls
  • Refusal detection — paywalls, CAPTCHAs, login walls identified automatically
  • HTTP server — single Playwright instance on :8100, control via curl

Content handling

  • HTML, PDF, YouTube transcripts all supported
  • 10-min TTL cache with fetch_mode metadata
  • Auto-escalation on failure via --escalate

Coming next

Cookie-based authentication: extract from real Chrome profile for sites requiring login (Google, Gmail, Gemini consumer).

🤖

supervisor

Watches every Claude session in real time — auto-badges, event tracking, live session grid
10,318 LOC · 51 files
Autonomous Sidekick Event Loop Periodic Benchmarks

What works now

  • Sidekick hooks — SessionStart/PostToolUse events → auto-badge generation, event recording, resource tracking
  • Watch UI — live session grid with activity, timeline, topic graph tabs
  • Passive error observation — tails session JSONL files for errors → classifies → writes doctor-compatible logs
  • Idle detection — atomic timestamp tracking across all sessions for autonomous work scheduling

Session intelligence

  • Watch API (port 8130) serves /hist, /jump, /recap, badge data
  • Vector search across all session transcripts

Coming next

Principle violation detection: LLM-powered checks against learned principles during active sessions. Shadow worker phase for safe verification in separate worktrees.

💰

finance

Tick-level price alignment with earnings transcripts — which claim caused which price move?
9,867 LOC · 45 files
Earnings Backtest Ownership Bottleneck Analysis

What works now

  • Earnings backtest — NBBO + trades + transcript aligned at ~250ms resolution
  • Finnhub screening — 1-min candles for big-move discovery and calendar events
  • IB integration — Stockloader service with persistent TWS/Gateway connection, 60-req/10min pacing
  • VIC returns calculator — cross-listing DB + symbol resolution + price library

Data sources

  • Interactive Brokers (tick data)
  • Finnhub (fundamentals, candles, filings)
  • SEC EDGAR (ownership, filings)
  • Redis time series for real-time prices

Coming next

Bottleneck analysis framework: map supply chain constraints (electricity, transformers, water, labor, permits) → identify winners/losers upstream and downstream.

🛠️

tools

Graduated prototypes: supply chain graph of 500+ semiconductor companies with relationship edges
7,638 LOC · 31 files
Supply Chain EDINET Kabutan Media

What works now

  • Supply chain graph — SQLite with supplier/customer/competitor edges, wave-based discovery (anchors → expand frontier)
  • Entity resolution — ticker → Finnhub → GLEIF → PermID matching
  • EDINET — Japanese financial filings scraper
  • Kabutan — Japan stock data collection

Origin

  • Started in explorations/, graduated to production when proven
  • Supply chain data also fed by jobs pipeline (2 dedicated handlers)

Coming next

Consolidate supply chain data from jobs pipeline into unified graph. Expand beyond semiconductors to other industries.


Codebase Size Map

Visual representation of relative module sizes. Area proportional to lines of code.

🔱 vario
24K LOC
📚 lib
21K LOC
🎓 learning
18K LOC
👷 jobs
17K LOC
🔍 intel
14K LOC
🏥 doctor
12K LOC
🌐 browser
11K
🤖 supervisor
10K LOC
💰 finance
10K LOC
🏭 tools
8K LOC
⚙️ ops 4K
🧪 explorations 4K
📊 present 3K
🎯 projects 2K

Evolution Timeline

From first commit to today. Two months, two phases of rapid growth.

January 2026

Foundation — Core systems established

542
Commits
~25
Dirs Created
  • vario Unified LLM workbench: Extract, Studio, Reasoning Strategies
  • browser Playwright automation with proxy escalation
  • lib/llm Async LLM framework with model aliasing
  • learning Session review and pattern discovery
  • explorations Experiments: LiteLLM, Grok search, iTerm2 gym
  • benchmarks BrowseComp, HLE, terminal benchmarks
  • tools Supply chain analysis, EDINET, Kabutan scrapers
  • infra Caddy reverse proxy, Cloudflare tunnel, launchd

February 2026

Expansion — 16 new directories, system deepens

695
Commits
+16
New Dirs
  • intel Company & people dossier pipelines with TFTF scoring
  • jobs 17-handler pipeline orchestration with Gradio dashboard
  • doctor Chronicle session analysis, D3 topic graphs, auto-fix
  • supervisor Autonomous orchestrator, sidekick agents, event loop
  • finance Earnings analysis, backtesting, Finnhub integration
  • ops Unified CLI for servers, sessions, iTerm2 control
  • lib/vectors Qdrant local vector search for semantic retrieval
  • lib/notify Unified Pushover + local notification system
  • projects Long-running goals: VC intel, skill acquisition
  • present AI ops pitch, search quality reports

Where Software Is Going

The developer becomes a simultaneous chess player.

A chess grandmaster in a simul walks from board to board — each position is different, each opponent plays their own game, but the grandmaster sees patterns across all of them and makes strong moves in seconds. The boards don’t wait for each other. The grandmaster’s strength isn’t just depth on one board — it’s breadth across many, with enough depth on each to win.

That’s what this system is for. A single developer managing parallel research pipelines, autonomous data jobs, live monitoring, self-improving code agents, and investment analysis — all at once. Each “board” runs on its own, escalates when stuck, and learns from its mistakes. The developer walks the room, makes the calls that matter, and moves on.

The bottleneck shifts from doing the work to directing the work. Rivus is the room full of boards.


Technology Stack

LLM providers, external services, infrastructure, and storage powering the system.

LLM Providers

  • Anthropic (Claude)
  • OpenAI (GPT, Embeddings)
  • Google (Gemini, Imagen)
  • xAI (Grok)
  • Groq (fast inference)
  • MiniMax

External APIs

  • Bright Data (proxies)
  • Serper (search)
  • Finnhub (markets)
  • SEC EDGAR (filings)
  • PatentsView
  • Pushover (notifications)

Infrastructure

  • Caddy (reverse proxy)
  • Cloudflare Tunnel
  • Cloudflare Pages
  • launchd (8 services)
  • tmux + iTerm2
  • Playwright

Storage

  • SQLite (multiple DBs)
  • Qdrant (vector search)
  • Redis (time series)
  • Parquet (datasets)

UI & Visualization

  • Gradio 6 (apps)
  • D3.js (graphs)
  • SVG (diagrams)
  • HTML reports

Python Stack

  • asyncio / httpx
  • Click (CLIs)
  • Pandas (data)
  • Invoke (tasks)
  • Loguru (logging)
  • Pydantic (models)

Appendix: Module Map

Every module placed in its pipeline stage, with LOC counts and data flow arrows. Cross-cutting infrastructure at the bottom.

INGEST pull in the world REASON & LEARN make sense of it OUTPUT & ACTION act on what was learned browser 11K LOC Playwright automation, proxy escalation, cache direct → stealth → proxy → unlocker → full browser Extract HTML, PDF, YouTube → structured content (browser → jobs, vario) jobs 17K LOC Pipeline orchestration: 17+ handlers discover → fetch → extract → score → enrich Gradio dashboard, error intelligence, version-aware vario Studio + Engine Multi-model gen × 4-8 models → evaluate → iterate breadth (parallel models) + depth (converge on quality) Reasoning Strategies 19 strategies, 10 stages, 9 lenses vario total: 24K LOC learning 18K LOC Session review → pattern discovery → principles 25K+ instances, vector embeddings, sandbox eval principles → CLAUDE.md → future sessions intel 14K LOC Company & people dossiers Serper → SEC → Bright Data → free APIs → LLM synthesis TFTF scoring, bull/bear memos, competitive landscape finance 10K LOC Earnings × price alignment, backtesting Finnhub, IB tick data, Redis time series tools 8K LOC Supply chain graph (500+ companies), EDINET, Kabutan present 3K LOC enriched data principles improve future reasoning CROSS-CUTTING lib 21K LOC LLM, vectors, semnet, notify doctor 12K LOC health watch, auto-fix, chronicle supervisor 10K LOC autonomous work, sidekick, hooks ops 5K LOC CLI, servers, iTerm2 explorations 4K LOC prototypes → graduate up used by all stages ~28K LOC ~42K LOC ~35K LOC 159K lines of Python across 618 files Ingest Reason & Learn Output & Action Cross-cutting