> **Note (2026-03-24):** intel/learning/ideas/semnet UIs consolidated into `kb.localhost` (port 7840). Old standalone URLs (intel.localhost, learning.localhost, ideas.localhost, semnet.localhost) are retired.

# Company Funding Signal Intelligence — Design

**Date**: 2026-03-09
**Status**: Approved design (revised after multi-model review — Gemini, Grok, Opus)
**Location**: `intel/funding/` — new submodule of Intel

## Problem

Intel has 9.5K companies with static profiles. No temporal dimension — we don't know who's raising, who's in trouble, what's hot. Funding events are scattered across SEC filings, newsletters, social signals, and job postings. Nobody aggregates + scores these automatically.

## Why

Company strength research is core to Intel (#1 priority). This adds a *temporal/predictive* layer. Two consumers:

- **VC pitch**: "Your portfolio company X just filed a quiet Form D — looks like a bridge round" (demonstrates insight depth beyond what VCs themselves track)
- **Investing**: Funding momentum as signal for public market adjacencies (TSMC benefits when AI startups raise for GPU clusters)

## Approach

**Hybrid: newsletter ingestion + predictive signals**, phased:

1. High-signal, low-effort sources first (Form D + newsletters)
2. Predictive signals incrementally (job postings, GitHub, executive moves)
3. Cross-reference: Form D filed + no press release + hiring freeze → likely down round

## Data Sources — Tiered by Effort and Signal Quality

### Phase 1 (days — high signal, low effort)

| Source             | Signal                                  | Method                    | Frequency |
|--------------------|-----------------------------------------|---------------------------|-----------|
| SEC Form D (EDGAR) | Company filed → raising privately       | EDGAR XBRL feed, free     | Daily     |
| Axios Pro Rata     | Deal announcements, fund news           | Email → parse or RSS      | Daily     |
| Term Sheet         | Deal flow digest                        | Email → parse             | Daily     |
| TechCrunch         | Round announcements                     | RSS feed (summaries only) | Daily     |
| Crunchbase News    | Rounds + trend pieces                   | RSS                       | Daily     |
| WARN notices       | Mass layoffs (100+ employees required)  | State WARN feeds, free    | Daily     |

**Note on TechCrunch**: RSS summaries are sufficient for entity extraction + round detection. Full article scraping at hourly frequency will get rate-limited/blocked within a week. Scrape full text only when RSS summary is insufficient, via headless browser + residential proxy.

**Note on Form D**: Form D is a *lagging* indicator — filed within 15 days of first sale, so the money is already in the bank. Excellent for confirming rounds and the "quiet Form D" insight, but not predictive of active fundraising. Also: amendments (Form D/A) are common and must be distinguished from initial filings. "Total Amount Sold" is cumulative across the offering, not per-round — a $10M filing followed by $25M amendment means $25M total, not $35M.

### Phase 2 (weeks — predictive signals)

| Source              | Signal                                   | Method                  | Frequency |
|---------------------|------------------------------------------|-------------------------|-----------|
| ATS job boards      | Hiring surge = raised, freeze = trouble  | Greenhouse/Lever/Ashby JSON feeds, free | Weekly |
| Job posting content | "VP Corp Dev" = M&A, "Head of IR" = IPO | LLM pass over titles    | Weekly    |
| Executive departures| CFO/CRO leaving late-stage = red flag    | News mentions + LinkedIn | Weekly   |
| GitHub activity     | Dev tools: active contributors, PR volume| GitHub API, free        | Weekly    |
| HN Show/Launch      | Pre/post-raise signal                    | HN API, free            | Daily     |
| X/Twitter           | Rumors break first ("$CO raising $100M") | API v2 semantic search  | Hourly    |

**ATS over LinkedIn**: ~90% of tracked companies use Greenhouse, Lever, or Ashby. Their job boards are publicly accessible JSON/XML feeds — free, legal, hourly, cleaner data than LinkedIn scraping. Track open/closed/stale requisitions.

**GitHub signal**: Commit velocity alone is a weak proxy (one script can generate 1K commits). Use active contributors, PR volume, and issue resolution time instead.

### Phase 3 (future — deeper signals)

| Source              | Signal                                   | Method                  | Frequency |
|---------------------|------------------------------------------|-------------------------|-----------|
| Glassdoor/Blind     | Employee morale, runway chatter          | Scrape                  | Monthly   |
| Web traffic         | Growth vs flatline                       | SimilarWeb free tier    | Monthly   |
| Patent filings      | IPO/acquisition prep                     | USPTO API               | Monthly   |
| The Information     | Insider scoops                           | Paid, $400/yr           | Daily     |
| UCC filings         | Venture debt = runway signal             | State-level, free       | Monthly   |
| App store rankings  | Consumer/mobile: download trends         | Sensor Tower free tier  | Monthly   |
| Domain/DNS changes  | New domains = expansion, lapsed = trouble| RDAP/WHOIS, free        | Weekly    |
| State biz filings   | New registrations, lapsed standing       | State APIs, free        | Monthly   |
| Conference speakers | Appearing/disappearing from events       | Scrape speaker lists    | Monthly   |

## Data Model

Seven tables in `intel/companies/data/companies.db`:

```sql
-- Confirmed/reconciled funding rounds (assembled from multiple mentions)
CREATE TABLE funding_rounds (
    id               INTEGER PRIMARY KEY,
    company_id       INTEGER REFERENCES companies(id),
    round_type       TEXT,         -- seed, pre_seed, series_a, ..., bridge, debt, grant
    amount_usd       REAL,
    date             TEXT,         -- YYYY-MM-DD (best estimate from sources)
    valuation_usd    REAL,         -- post-money, if known
    confidence       TEXT,         -- confirmed, reported, rumored, inferred
    status           TEXT,         -- active, closed, amended
    created_at       TEXT DEFAULT (datetime('now')),
    updated_at       TEXT DEFAULT (datetime('now'))
);
-- No UNIQUE constraint — dedup handled by round reconciliation stage

-- Investors per round (replaces co_investors JSON blob)
CREATE TABLE round_investors (
    round_id     INTEGER REFERENCES funding_rounds(id),
    investor_id  INTEGER REFERENCES companies(id),
    role         TEXT,            -- lead, co_lead, participant
    PRIMARY KEY (round_id, investor_id)
);

-- Raw news/newsletter mentions with sentiment (ingest landing zone)
CREATE TABLE funding_mentions (
    id               INTEGER PRIMARY KEY,
    company_id       INTEGER REFERENCES companies(id),  -- NULL if unresolved
    source           TEXT,         -- axios, term_sheet, techcrunch, hn, x_twitter, etc.
    title            TEXT,
    snippet          TEXT,
    url              TEXT,
    mention_type     TEXT,         -- round_announced, rumor, layoff, pivot, launch, form_d
    sentiment        TEXT,         -- positive, negative, neutral
    round_id         INTEGER REFERENCES funding_rounds(id),  -- set by reconciliation
    published_at     TEXT,
    ingested_at      TEXT DEFAULT (datetime('now')),
    UNIQUE(source, url)
);

-- Unresolved mentions — company discovery queue
CREATE TABLE unresolved_mentions (
    id               INTEGER PRIMARY KEY,
    raw_company_name TEXT,
    source           TEXT,
    url              TEXT,
    snippet          TEXT,
    resolution_attempts INTEGER DEFAULT 0,
    created_at       TEXT DEFAULT (datetime('now')),
    resolved_at      TEXT,         -- NULL until resolved
    resolved_company_id INTEGER REFERENCES companies(id)
);

-- Signal time series (append-only — source of truth for all signals)
CREATE TABLE company_signals_log (
    id               INTEGER PRIMARY KEY,
    company_id       INTEGER REFERENCES companies(id),
    signal_type      TEXT,         -- hiring_velocity, github_contributors, exec_departure, form_d, etc.
    value            REAL,
    measured_at      TEXT,         -- when this measurement was taken
    source           TEXT
);

-- Aggregated signals per company (materialized view, rebuilt from signals_log)
CREATE TABLE company_signals (
    company_id       INTEGER PRIMARY KEY REFERENCES companies(id),
    hiring_velocity  REAL,         -- jobs posted per week, trailing 4 weeks
    hiring_delta     REAL,         -- % change vs prior 4 weeks
    exec_departures  INTEGER,      -- C-suite departures in last 90 days
    github_contributors INTEGER,   -- active contributors, trailing 30 days
    github_delta     REAL,         -- % change vs prior 30 days
    web_traffic_rank INTEGER,      -- SimilarWeb global rank
    last_form_d      TEXT,         -- most recent Form D filing date
    last_press_round TEXT,         -- most recent announced round date
    signal_score     REAL,         -- composite: raise_likelihood 0-100
    distress_score   REAL,         -- composite: trouble_likelihood 0-100
    updated_at       TEXT DEFAULT (datetime('now'))
);

-- Alerts — push notifications for state changes
CREATE TABLE funding_alerts (
    id               INTEGER PRIMARY KEY,
    company_id       INTEGER REFERENCES companies(id),
    alert_type       TEXT,         -- score_threshold, new_form_d, sentiment_shift, new_round
    severity         TEXT,         -- high, medium, low
    message          TEXT,
    triggered_at     TEXT DEFAULT (datetime('now')),
    acknowledged     INTEGER DEFAULT 0
);
```

Key decisions:
- **`round_investors` junction table** replaces `co_investors` JSON — enables "What is [investor] doing?" queries, investor activity tracking, sector heat detection
- **`company_signals_log`** is the append-only source of truth for all signals; `company_signals` is a materialized snapshot rebuilt from it. Time series enables sparklines, delta computation, and backtesting
- **`funding_mentions`** is the ingest landing zone; mentions link to `funding_rounds` via `round_id` after reconciliation
- **`unresolved_mentions`** queue surfaces companies not in the 9.5K registry — turns the pipeline into a company *discovery* mechanism
- **`funding_alerts`** enables push notifications (the VC pitch use case requires push, not just pull)
- **`confidence` on rounds** distinguishes Form D confirmed amounts from TechCrunch rumor pieces
- Signal scores are composites with temporal decay, computed by scorer — not LLM-generated

## Pipeline Architecture

```
                    ┌─────────────┐
                    │  Scheduler  │  (jobs framework — daily/weekly crons)
                    └──────┬──────┘
                           │
          ├────────────────┼────────────────┤
          ▼                ▼                ▼
    ┌───────────┐   ┌───────────┐   ┌───────────┐
    │  Ingest   │   │  Ingest   │   │  Ingest   │
    │ Form D    │   │Newsletters│   │ Signals   │
    │ (EDGAR)   │   │(RSS/email)│   │ (ATS/GH)  │
    └─────┬─────┘   └─────┬─────┘   └─────┬─────┘
          │               │               │
          ▼               ▼               ▼
    ┌─────────────────────────────────────────┐
    │           Entity Resolution             │
    │  name + URL/domain → companies.id       │
    │  unresolved → unresolved_mentions queue │
    └────────────────────┬────────────────────┘
                         │
                         ▼
    ┌─────────────────────────────────────────┐
    │        Round Reconciliation             │
    │  cluster mentions ±30d same company     │
    │  merge into canonical funding_round     │
    │  assign confidence tier                 │
    └────────────────────┬────────────────────┘
                         │
              ├──────────┼──────────┤
              ▼          ▼          ▼
        funding_rounds  funding_mentions  company_signals_log
              │                           │
              ▼                           ▼
    ┌─────────────────────┐    ┌──────────────────┐
    │  Materialize View   │    │  Signal Scorer   │
    │  company_signals    │    │  raise_likelihood│
    │  (from signals_log) │    │  distress_score  │
    └─────────────────────┘    │  + decay + stage │
                               └────────┬─────────┘
             │
             ▼
    ┌──────────────────────┐
    │   Alert Engine       │
    │  score thresholds    │
    │  state changes       │
    │  → funding_alerts    │
    │  → push (Slack/email)│
    └────────┬─────────────┘
             ▼
    ┌─────────────────────┐
    │   Intel UI / CLI    │
    │  radar + alerts     │
    │  company timeline   │
    │  investor activity  │
    └─────────────────────┘
```

### Components

| Component            | Location                         | What it does                                                                  |
|----------------------|----------------------------------|-------------------------------------------------------------------------------|
| EDGAR ingester       | `intel/funding/edgar.py`         | Poll EDGAR RSS for Form D filings, parse XBRL, distinguish initial vs amendment |
| Newsletter ingester  | `intel/funding/newsletters.py`   | Parse RSS feeds (TC, Crunchbase, HN), extract company + round + sentiment     |
| Signal collector     | `intel/funding/signals.py`       | ATS job feeds, GitHub API, web traffic → append to company_signals_log        |
| Entity resolver      | `intel/companies/resolve.py`     | Name + URL/domain → companies.id. Shared with supply chain. Unresolved → queue |
| Round reconciler     | `intel/funding/reconcile.py`     | Cluster mentions ±30d, merge into canonical rounds, set confidence tiers      |
| Signal scorer        | `intel/funding/scorer.py`        | Composite scoring with decay, stage-awareness, pattern rules                  |
| Alert engine         | `intel/funding/alerts.py`        | Detect state changes, emit alerts, push notifications                         |
| Source health monitor| `intel/funding/health.py`        | Track last successful ingest, error rate, staleness per source                |
| Jobs integration     | `jobs/handlers/funding_intel.py` | Job handler: daily newsletter ingest, weekly signal refresh, daily EDGAR      |

### Entity Resolution (shared capability)

The hardest part. "Anthropic raises $2B" in a TechCrunch headline → company_id 4523. Must handle both modes:

- **Batch mode**: Processing a corpus — 500 articles, 50 VC portfolio pages, supply chain graph. Thousands of mentions in one run.
- **Streaming mode**: Daily ingest — 20-30 mentions from RSS. Per-mention is fine.

Resolution cascade:
1. **URL/domain match** — extract URLs/domains from mention context, match against `companies.domain` (~high precision, limited recall)
2. **Exact match** — company name in `companies.name` (~35-40% hit rate against real newsletter text — lower than you'd think because of product names, abbreviations, "the SF-based fintech")
3. **Fuzzy match** — Levenshtein/trigram against `companies.name` + `company_identifiers` (~20-25%)
4. **LLM batch** — unmatched mentions batched 50-100 at a time to flash, **with surrounding sentence/paragraph context** (not just bare names): "Given these company mentions with context, return the canonical name" (~15%). Context is essential for disambiguation ("Mercury" alone is ambiguous; "Mercury, the banking startup" is not)
5. **Cache resolved** — all matches cached in `lookup_cache` so we never re-resolve
6. **Unresolved queue** — anything that fails all steps goes to `unresolved_mentions` for human review or future matching. These are often the most interesting — new companies not yet tracked

**Measurement harness**: Log every resolution with method used (url/exact/fuzzy/llm), confidence, and company_id. Spot-check weekly. This is how you know if hit rates are improving.

**Realistic estimates**: Budget 5-7 days for entity resolver with measurement harness, not 2-3. The cascade is correct architecture but real-world text is messier than you expect: "ChatGPT" not "OpenAI", "Waymo" not "Alphabet subsidiary", "payments giant Stripe" not "Stripe, Inc."

Lives in `intel/companies/resolve.py` — shared by funding pipeline AND supply chain graph building. Build this first; it unblocks both.

### Round Reconciliation

The critical step between raw mentions and confirmed rounds. Same Series B reported by TechCrunch (March 5), Axios (March 6), and EDGAR Form D (March 12) — three sources, one round. Round types also get reported inconsistently ("Series B" vs "growth round" vs "late-stage round").

Algorithm:
1. Group mentions by company within ±30 day windows
2. Within each window, cluster by overlapping amounts/investors/round type
3. Merge cluster into single `funding_round` with best available data from each source
4. Set confidence tier: confirmed (Form D + press), reported (press only), rumored (single source/vague)
5. Handle syndication dedup — same press release echoed across 3+ feeds shouldn't inflate mention counts

Lives in `intel/funding/reconcile.py`. This is a whole component — not a simple UNIQUE constraint.

## Scoring Heuristics

Rules-based first, ML later. Three layers: base signals, temporal decay, and compound patterns.

### Base Signals

| Signal                              | Raise likelihood (+) | Distress (-)          |
|-------------------------------------|----------------------|-----------------------|
| Form D filed, no press release      | +20 (stealth raise)  | —                     |
| Form D filed + press release        | +30 (announced)      | —                     |
| Hiring surge (>2x baseline, 60+ days) | +25                | —                     |
| Hiring freeze (>50% drop, 60+ days) | —                    | +30                   |
| C-suite departure                   | —                    | +25 per exec          |
| Positive news mentions (3+ in 30d, deduplicated) | +15     | —                     |
| Negative mentions (layoff, pivot)   | —                    | +20                   |
| GitHub contributor surge (dev tools)| +10                  | —                     |
| GitHub gone quiet (dev tools)       | —                    | +10                   |

**Change from original**: Form D alone no longer contributes to distress. The Form D is evidence of *raising*, not distress. Only the *combination* of Form D + other distress signals (hiring freeze, exec departures) indicates trouble.

**Hiring freeze caveat**: Requires 60+ day sustained window to avoid false positives from seasonal slowdowns (December) or post-batch-hiring completion.

### Temporal Decay

All signals decay exponentially: `score *= exp(-days_since_signal / 30)`. A Form D from yesterday is much more relevant than one filed 25 days ago. Decay applied within trailing windows, not just at window boundaries.

### Compound Pattern Rules

Where the real insight lives — these are what impress VCs:

| Pattern                                      | Result                                     |
|----------------------------------------------|--------------------------------------------|
| Form D + hiring surge + positive press       | **Confirmed raise** — score 90+, high confidence |
| Form D + no press + hiring freeze            | **Distress raise** — raise 60, distress 70 |
| Hiring freeze + exec departure + negative press | **Pre-layoff** — distress 85+           |
| Form D + hiring surge (no other signals)     | **Synergy bonus** +10 (super-additive)     |
| WARN notice filed                            | **Confirmed distress** — distress 90+      |

### Stage-Aware Weights

Signals mean different things at different stages. Use `round_type` history to classify:

| Signal          | Early stage (<Series B)           | Late stage (≥Series B)           |
|-----------------|-----------------------------------|----------------------------------|
| Hiring freeze   | May be normal (founder heads-down) | Red flag                        |
| Exec departure  | Less meaningful (small team)       | Significant (2x weight)         |
| GitHub activity | Strong signal (dev tools)          | Less relevant                   |
| Form D amount   | Any amount notable                 | Only large amounts ($50M+) notable |

### Normalization

- Hiring deltas normalized by company size/employee count (5-person startup hiring 5 = 100% but different from 10K company hiring 100)
- Mention counts deduplicated across syndicated sources before counting
- Scores 0-100, capped. Both scores can be nonzero simultaneously (emergency round = high on both)
- **Base rates**: After first full pipeline run, compute score distribution. If 80% of companies score 40-60, thresholds need adjustment. Consider percentile ranks

### Future: ML

After 6+ months of data, train XGBoost on backtested data (features: signal deltas, mention patterns; labels: actual rounds from Crunchbase/EDGAR). Expected +20-30% accuracy over rules-based.

## UI Integration

Two views in `kb.localhost/intel`:

1. **Radar** — sortable table: company, raise_likelihood, distress_score, latest signals, sparkline of mentions over time. Filter by sector/tag/universe.
2. **Company detail** — timeline of funding_rounds + funding_mentions on existing company page. Signal scores shown as gauges.

## CLI

```bash
intel funding radar --min-raise 50          # who's likely raising
intel funding distress --min-score 40       # who's in trouble
intel funding ingest --source edgar         # manual ingest trigger
intel funding resolve --batch mentions.json # bulk entity resolution
```

## Phasing

| Phase | What                                              | Effort    | Unlocks                                    |
|-------|---------------------------------------------------|-----------|--------------------------------------------|
| 1a    | Entity resolver with measurement harness          | 5-7 days  | Unblocks funding + supply chain            |
| 1b    | Schema + EDGAR ingester + TechCrunch RSS          | 2-3 days  | First funding events flowing               |
| 1c    | Round reconciliation + dedup                      | 2-3 days  | Clean funding_rounds from noisy mentions   |
| 1d    | Signal scorer (with decay + patterns) + radar CLI | 2-3 days  | "Who's raising?" answerable                |
| 1e    | Alert engine (CLI-based initially)                | 1-2 days  | Push notifications for state changes       |
| 2a    | ATS job feeds (Greenhouse/Lever) + GitHub         | 1 week    | Hiring + dev activity signals              |
| 2b    | Job posting content analysis (title-level LLM)    | 2-3 days  | M&A prep, IPO prep, restructuring signals  |
| 2c    | Investor tracking (round_investors queries)       | 2-3 days  | "What is Sequoia doing?" answerable        |
| 3     | UI integration (radar + timeline + investor view) | 1 week    | Visual demo for VC pitch                   |
| 4     | Backtesting + base rate calibration               | 3-5 days  | Calibrated weights, score distributions    |

**Effort note**: Phase 2 is likely 1.5-2x the original estimate. Entity resolver at 5-7 days (not 2-3) is the corrected estimate — real-world text is messier than anticipated.

## Resolved Questions

1. **Newsletter parsing** — RSS first. Email-only newsletters (StrictlyVC) via Zapier forwarding to a webhook, not tempmail (fragile).
2. **Form D latency** — EDGAR RSS daily is fine for MVP. Full API ($0.01/query) only if sub-day latency becomes critical.
3. **Job data source** — ATS feeds (Greenhouse/Lever/Ashby) over LinkedIn scraping. Free, legal, hourly, cleaner data.
4. **Overlap with news_mentions** — Keep `funding_mentions` separate (different schema, confidence model, reconciliation needs).

## Open Questions

1. **Distress signal sensitivity** — High distress scores are reputationally dangerous if wrong. Should high-distress items require human review (`status = 'pending_review'`) before surfacing in UI?
2. **Historical backfill** — `hiring_delta` and `github_delta` need 8+ weeks of baseline data. Backfill from Crunchbase/EDGAR for last 2 years? How much effort?
3. **Portfolio mapping prerequisite** — The VC pitch use case ("Your portfolio co X filed a Form D") requires knowing which companies are in which VC's portfolio. This is a separate data problem (Crunchbase + VC website scraping). Already partially built in VC intel work — need to connect.
4. **Source health degradation** — If TechCrunch goes down, should scores degrade gracefully (ignore missing signal) or flag staleness? Leaning: degrade gracefully + staleness warning in UI.

## Multi-Model Review Notes

Design reviewed by Gemini, Grok-reasoning, and Opus via vario ng model_debate. Key convergent findings incorporated:

- **Must-fix** (all 3 agreed): naive UNIQUE constraint, missing signal history, no temporal decay, round reconciliation as explicit stage, entity resolution needs URL/domain matching + is too optimistic on hit rates, need confidence tiers
- **High-value additions**: alerting/push layer, round_investors junction table, ATS over LinkedIn, job posting content analysis, unresolved mentions as discovery, stage-aware scoring, compound pattern rules, WARN notices
- **Scoring verdict**: reasonable starting point but needs decay, normalization by company size, interaction bonuses, base rate calibration
- **Effort correction**: Phase 2 is 1.5-2x optimistic, entity resolver needs 5-7 days not 2-3
