# Newsflow Buildout — Topic-Driven News Intelligence

**Date**: 2026-02-21
**Priority**: High — core capability gap
**Status**: Planning

## Problem

Newsflow has solid plumbing (Serper discovery, fetch/extract/score pipeline, jobs dashboard) but no way to:
1. **Define a topic from a description** — currently requires hand-written search queries
2. **Browse a topic feed** — results are buried in the jobs dashboard item tables
3. **Get a synthesis** — no summary/digest of what's new on a topic
4. **Share a feed** — no external-facing URL for a topic's news stream

Adding "the emerging future of software" today required manually writing 5 search queries. A user should be able to say "track this topic" and get a living, browsable news feed.

## Current State

| Component | Status | Gap |
|-----------|--------|-----|
| Discovery (Serper search/news/videos) | ✅ Built | Queries are manual |
| Fetch + extract + score pipeline | ✅ Built | Score stage is pass-through |
| Storage (SQLite tracker + raw files) | ✅ Built | No read-friendly layer |
| Jobs dashboard (item tables) | ✅ Built | Not a news reader |
| `add-url` CLI | ✅ Built | One-off, not topic-driven |
| Auto-diagnosis on errors | ✅ Built | — |
| RSS/Atom feed polling | 🔲 Missing | Free, structured, fastest source |
| Topic → auto-queries | 🔲 Missing | |
| Browsable topic feed UI | 🔲 Missing | |
| Daily/weekly digest | 🔲 Missing | |
| Relevance scoring (LLM) | 🔲 Missing | Score stage is stub |
| Deduplication (semantic) | 🔲 Missing | URL-only dedup today |

## Design

### Phase 1: Topic Intelligence (query generation + scoring)

**Goal**: Define a topic with a description, get smart queries and relevance scoring.

#### 1a. Topic → Query Generation

```yaml
# New topic format in jobs.yaml:
- name: future_of_software
  description: "The emerging future of software — AI coding agents, vibe coding, whether software engineers will be replaced, new development paradigms"
  # queries auto-generated from description on first discovery run
  auto_queries: true
  frequency: daily
  backfill_months: 6
```

**Implementation**:
- `generate_topic_queries(name, description) → list[dict]` in `jobs/lib/discovery.py`
- Uses LLM (haiku — cheap, fast) to generate 3-7 search queries from description
- Queries optimized for Serper news/search (short, keyword-focused, no boolean operators)
- Caches generated queries in `jobs/data/newsflow_topics/{name}/queries.yaml`
- Re-generates monthly or on description change

**Cost**: ~$0.001 per topic generation (haiku). Negligible.

#### 1a′. Topic Probing — Auto-Configure from Volume

Before committing to a backfill strategy, **probe the topic** to understand its density and auto-set config. Run at topic creation time or on first discovery.

```python
async def probe_topic(name, queries) -> dict:
    """Probe Serper to characterize a topic's volume and depth.

    Returns:
        density: "sparse" | "moderate" | "saturated"
        estimated_total: rough article count across all time
        oldest_relevant: approximate date of earliest meaningful result
        recommended_config: {backfill_months, backfill_limit, backfill_queries, frequency}
    """
```

**How it works**:
1. Run each query against Serper with no date filter, `num_results=10` — count total results reported
2. Run same queries with `after:2025-01-01` and `after:2024-01-01` — compare volume by era
3. Classify density:
   - **Sparse** (<500 total results): niche company/person. Config: `backfill_months: 36`, no limit, same queries. Want *everything*.
   - **Moderate** (500-5,000): established topic, manageable. Config: `backfill_months: 12`, `backfill_limit: 200`.
   - **Saturated** (5,000+): mainstream topic. Config: `backfill_months: 6`, `backfill_limit: 100`, generate broader `backfill_queries` via LLM.
4. Detect temporal shape: is volume growing (trending topic), steady (evergreen), or spiking (news event)?
5. Store probe results in `jobs/data/newsflow_topics/{name}/probe.yaml` — re-probe quarterly

**Cost**: 4-6 Serper calls per topic (~$0.01). Run once.

**Example probes**:

| Topic | Serper total | Density | Config |
|-------|-------------|---------|--------|
| "MECCO Group" | ~120 | sparse | 36mo, no limit, same queries |
| "Roivant Sciences" | ~2,400 | moderate | 12mo, limit 200 |
| "AI replacing software engineers" | ~50,000 | saturated | 6mo, limit 100, broader backfill queries |

**Integration**: `probe_topic()` runs automatically when a topic is added with `auto_configure: true` (or always on first discovery). Writes recommended config back to a cache file. Discovery reads cached config, falling back to topic's YAML config if no probe exists.

#### 1a″. RSS/Atom Feed Monitoring — The Cheapest High-Signal Source

RSS and Atom feeds are often the best answer for news monitoring: free, structured, low-latency, and zero API budget. Many high-value sources publish feeds that give you headlines + summaries within minutes of publication — no scraping, no rate limits, no legal gray areas.

**Why RSS before Serper for known sources**:
- **Free** — zero cost per poll, unlimited frequency
- **Structured** — title, date, summary, URL already parsed (no LLM extraction needed)
- **Fast** — poll every 15 min if you want, no rate limits
- **Reliable** — feed format rarely changes vs web scraping breaking on redesigns
- **Legal** — explicitly published for consumption

**Implementation**: Feed fetching goes through `lib/ingest` (it already handles URL fetch + content extraction). The only newsflow-specific logic is "what's new since last check" dedup — that stays in `jobs/lib/discovery.py` where URL tracking already lives. No new library needed.

**Topic config with feeds**:
```yaml
- name: ai_funding
  description: "AI startup funding rounds, VC activity"
  feeds:
    - https://techcrunch.com/feed/           # broad tech, filter by relevance
    - https://news.crunchbase.com/feed/      # funding-focused
    - https://feeds.feedburner.com/nvca      # VC industry
    - https://hnrss.org/show                 # HN Show launches
  auto_queries: true   # Serper supplements feeds for sources without RSS
  frequency: daily
```

**Feed discovery**: When adding a topic, auto-discover RSS feeds:
1. For known sources (TechCrunch, Crunchbase, HN, ArXiv, SEC), use curated feed URLs
2. For company-specific topics, check `{company_url}/feed`, `/rss`, `/atom.xml`
3. LLM can suggest feeds for a topic description ("what RSS feeds cover AI funding?")

**Integration with existing pipeline**:
- Feed items enter the same pipeline as Serper results → relevance scoring → storage
- Feed polling as a discovery strategy alongside `serper_search` in `jobs/lib/discovery.py`
- Serper fills gaps where feeds don't exist or for ad-hoc/long-tail queries

**Key feeds by domain**:

| Domain          | Feed URL pattern                     | Signal quality |
|-----------------|--------------------------------------|----------------|
| TechCrunch      | `techcrunch.com/feed/`               | High (funding, launches) |
| Crunchbase News | `news.crunchbase.com/feed/`          | High (rounds, trends) |
| Hacker News     | `hnrss.org/newest?q=KEYWORD`         | Medium (filtered) |
| ArXiv           | `export.arxiv.org/api/query?...`     | High (research) |
| SEC EDGAR       | `efts.sec.gov/LATEST/search-index?...` | High (filings) |
| YouTube channels| Via `yt-dlp --flat-playlist` or API  | Medium (investor talks) |
| Substack        | `{author}.substack.com/feed`         | Varies by author |
| Company blogs   | `{company}/feed` or `/blog/rss`      | High for company-specific |

**Cost**: Zero. This is the single highest-ROI addition to newsflow.

#### 1b. Relevance Scoring

Replace the pass-through score stage with LLM relevance scoring:

```python
async def score_article(title, snippet, topic_description) -> dict:
    """Score article relevance to topic. Returns score 0-100 + reason."""
```

- Uses haiku/flash-lite (cheapest model that can judge relevance)
- Scores: 0-30 (noise), 30-60 (tangential), 60-80 (relevant), 80-100 (core)
- Articles <30 get `_priority` deprioritized (still stored, just sorted down)
- Score stored in results table, shown in dashboard

**Cost**: ~$0.002 per article. At 100 articles/day across all topics = $0.20/day.

#### 1c. Curated Seed Collection (replacing keyword backfill)

Keyword backfill is wrong for broad topics — "AI coding" returns thousands of articles, mostly noise. Instead: **ask the LLM to identify the defining works**, then fetch those specifically.

```python
async def seed_topic(name, description) -> list[dict]:
    """Generate a curated reading list for a topic.

    Returns ~50-100 URLs: the most important articles, papers, talks,
    and blog posts that define the current state of this topic.
    """
```

- LLM (sonnet) with web search grounding generates a ranked reading list
- Organized by subtopic/theme (e.g., "foundational arguments", "counter-arguments", "real-world examples")
- Each entry: URL, title, why it matters (1 sentence), approximate date
- Seeded as work items with high priority so they're fetched first
- Run once per topic at creation time, optionally refreshed quarterly

**Why this is better than keyword backfill**:
- 50 curated articles > 5,000 keyword matches
- Captures seminal pieces that may not rank for current keywords
- Organized by theme, not by search query
- Quality floor: every article was selected for a reason

**Example for "future of software"**:
- Andrej Karpathy's "vibe coding" tweet/thread
- "Is AI Coming for Software Engineers?" (Atlantic, Bloomberg, etc.)
- Cognition Labs' Devin announcement + reactions
- Cursor/Windsurf/Claude Code adoption stories
- Stack Overflow developer survey on AI usage
- Counter-arguments: "Why AI won't replace programmers"

### Phase 2: Browsable Feed UI

**Goal**: A reader-friendly UI for browsing topic feeds.

#### 2a. Feed Page (Gradio or static HTML)

**Option A: Gradio tab in jobs dashboard**
- New "Feed" tab alongside Backfill/Live
- Left sidebar: topic list with unread counts
- Main area: article cards (title, source, date, relevance score, snippet)
- Click → full extracted text in panel
- Filter by: date range, relevance threshold, source type

**Option B: Static HTML generation**
- `pulse.localhost` — standalone Gradio app (port 7870)
- Same layout but independent from jobs dashboard
- Can be shared externally via cloudflared

**Recommendation**: Option B — separate app, cleaner separation of concerns.

#### 2b. Article Cards

Each article card shows:
- **Title** (linked to original URL)
- **Source domain** + publication date
- **Relevance score** badge (color-coded: green/yellow/grey)
- **Snippet** (first 2-3 sentences of extracted text)
- **Topic** tag
- **Read/unread** state

### Phase 3: Digest & Synthesis

**Goal**: Daily/weekly summaries of what's new on each topic.

#### 3a. Daily Digest Generation

- Cron-triggered (or runner guard): at end of day, collect all articles scored >60 for each topic
- LLM synthesis: "Summarize today's key developments on [topic]"
- Output: markdown digest stored in `jobs/data/newsflow_topics/{name}/digests/YYYY-MM-DD.md`
- Pushover notification with digest summary

#### 3b. Weekly Roundup

- Aggregate daily digests into weekly themes
- Identify: what's new this week, what's trending, what's fading
- Compare with previous weeks: "AI coding agents discussion volume up 40%"

#### 3c. Flow Auto-Calibration

The system should continuously understand **how much is coming in** per topic and adapt. A topic that was quiet last month might be trending now (news event, product launch), or vice versa.

**Mechanism**: On each discovery run, record items-found-per-topic. Weekly, query the last 7 days of discovery data per topic and auto-adjust:

```python
async def calibrate_topic(name, conn) -> dict:
    """Assess recent inflow and recommend config adjustments.

    Queries last 7 days of discovered items for this topic.
    Returns recommended changes to frequency, scoring threshold, alerts.
    """
```

**Calibration rules**:

| 7-day volume | Classification | Auto-action |
|-------------|----------------|-------------|
| 0-5 | Quiet | Switch to weekly if daily. No digest needed. |
| 5-20 | Normal | Keep current frequency. Daily digest if scored. |
| 20-50 | Active | Ensure daily frequency. Raise score threshold to reduce noise. |
| 50-100 | Busy | Daily frequency + tighter relevance filter (score >50 instead of >30). |
| 100+ | Firehose | Alert operator. Auto-raise score threshold to >70. Consider splitting into sub-topics. |

**What gets adjusted**:
- **Discovery frequency**: daily ↔ weekly (no point polling daily for a topic that produces 2 articles/week)
- **Score threshold for digests**: quiet topics include everything; busy topics only surface the best
- **Alert on volume change**: "future_of_software: volume up 3× this week (was 15/wk, now 48/wk)" — could mean a news event worth attention
- **Digest cadence**: quiet topics get weekly digests, active topics get daily

**Storage**: `jobs/data/newsflow_topics/{name}/calibration.yaml` — updated weekly by a runner guard or scheduled task. Includes history for trend detection.

**Why this matters**: Without calibration, every topic gets the same treatment. A daily poll on "MECCO Group" wastes Serper calls (nothing new). A weekly poll on "AI coding agents" during a hype cycle misses 80% of content. The system should tune itself to the actual flow rate of each topic.

### Phase 4: Advanced

- **Semantic deduplication** — embeddings + clustering to collapse duplicate stories
- **Cross-topic connections** — "This semiconductor supply chain article is also relevant to your future_of_software topic"
- **Alert on novelty** — "First mention of [concept] in this topic — new signal?"
- **GDELT integration** — high-volume global event detection (from design/newsflow_scaling.md)

## File Changes

### Phase 1

| File | Change |
|------|--------|
| `jobs/lib/discovery.py` | Add `generate_topic_queries()`, modify `serper_search` to auto-generate |
| `jobs/handlers/company_research.py` | Replace stub score stage with LLM relevance scoring |
| `jobs/jobs.yaml` | Add `description` + `auto_queries` fields to topic format |
| `jobs/data/newsflow_topics/` | New directory for cached queries + digests |

### Phase 2

| File | Change |
|------|--------|
| `jobs/newsflow_app.py` | New Gradio app for browsable feed |
| `jobs/newsflow_feed.py` | Feed data layer: query articles, render cards |
| `infra/Caddyfile` | Add `pulse.localhost` reverse proxy |

### Phase 3

| File | Change |
|------|--------|
| `jobs/newsflow_digest.py` | Digest generation (daily + weekly) |
| `jobs/jobs.yaml` | Add digest guard or scheduled task |

## Verification

1. Add topic with description only → queries auto-generated
2. Discovery picks up articles → relevance scored
3. Feed UI shows articles sorted by relevance and date
4. Daily digest generated with meaningful summary
5. Pushover notification with digest highlights
