# Newsflow Scaling Design

## Current State

| Aspect           | Current                                      | Bottleneck                                   |
|------------------|----------------------------------------------|----------------------------------------------|
| Topics monitored | 21 (companies, investors, themes)            | Linear growth in Serper API calls            |
| Queue            | 6,063 items (95% pending, 4.5% done)         | Processing slower than discovery             |
| Fetch workers    | 10 concurrent, 600/hour max                  | Proxy/paywall handling is slow (2-10s/item)  |
| Discovery        | Serper API (~60 req/min)                     | Rate limited, one pass/topic/day             |
| Storage          | File-per-item under `companies/{co}/`        | Git LFS pressure at scale                    |

## Goals

1. **Scale query volume** — Monitor 100+ topics efficiently
2. **Group related topics** — Batch queries, share results across overlapping interests
3. **Verification / double-sourcing** — Corroborate important findings from multiple sources
4. **Reduce cost** — Smart filtering before expensive extraction

---

## News APIs Comparison

### Tier 1: Enterprise (Best Features, Higher Cost)

| API | Free Tier | Paid | Sources | Key Features |
|-----|-----------|------|---------|--------------|
| **NewsAPI.ai** | 2,000 tokens | $90+/mo | Global | Full-text, AI enrichment |
| **Perigon** | 15-day trial | Enterprise | 150,000+ | Clustering, sentiment, real-time push |
| **NewsCatcher** | 15k/mo (RapidAPI) | $399/mo+ | 70,000+ | Built-in dedup, clustering, NER |
| **AYLIEN** | 14-day trial | Enterprise | 90,000+ | 26 NLP tags, 90-day history |

### Tier 2: Mid-Range (Good Balance)

| API | Free Tier | Paid | Sources | Key Features |
|-----|-----------|------|---------|--------------|
| **NewsData.io** | 200 credits/day | $349/mo | 87,000+ | Sentiment, 7yr history, 206 countries |
| **NewsAPI.org** | Dev use only | Contact | 150,000+ | Simple, mature |
| **Mediastack** | 500 calls/mo | $25-$250/mo | 7,500+ | Easy JSON API |

### Tier 3: Free/Open

| API | Cost | Key Features | Notes |
|-----|------|--------------|-------|
| **GDELT** | Free | 15-min updates, 100+ languages, anomaly detection | Requires processing work |
| **GNews** | Freemium | Google News wrapper | Limited features |

### Financial News Specialists

| API | Free | Paid | Notes |
|-----|------|------|-------|
| **Finlight** | Limited | ~$30/mo | Financial focus, sentiment |
| **Tiingo** | Yes | $30/mo | News + market data |
| **FMP** | Yes | Varies | 40k+ stocks, fundamentals |

**Recommendation**: Start with **GDELT** (free, excellent for volume/anomaly detection) + **NewsData.io** ($349/mo, good balance) for production.

---

## Python Libraries

### Aggregation

| Library | Purpose |
|---------|---------|
| `feedparser` | RSS/Atom parsing |
| `newspaper3k` | Article extraction from URLs |
| `gdeltPyR` | GDELT 1.0/2.0 → Pandas |
| `gdelt-doc-api` | GDELT DOC 2.0 search |

### Deduplication & Clustering

| Tool | Approach | Best For |
|------|----------|----------|
| `dedupe` | ML fuzzy matching | Entity resolution |
| `SemHash` | Semantic embeddings + ANN | Fast dedup for LLM data |
| `BERTopic` | Embeddings + UMAP + HDBSCAN | Topic modeling, story clustering |
| `sentence-transformers` + `HDBSCAN` | Custom pipeline | Flexible clustering |

**Recommended pipeline**:
1. Generate embeddings with `sentence-transformers`
2. Reduce dimensions with `UMAP`
3. Cluster with `HDBSCAN`
4. Extract keywords with `KeyBERT`

### Event Detection (Volume Spikes)

| Tool | Type | Use |
|------|------|-----|
| `PyOD` | Python | Multivariate outlier detection |
| `ADTK` | Python | Time series anomaly detection |
| `Isolation Forest` | sklearn | Tree-based anomaly isolation |
| GDELT Timeseries | Built-in | 60min trending entity detection |

---

## Architecture Options

### Option A: Multi-API Fan-Out

```
Topic Groups (batched)
    │
    ├──► Serper (current)
    ├──► GDELT (free, high volume)
    ├──► NewsData.io (backup, sentiment)
    │
    ▼
Dedup Layer (SemHash/BERTopic)
    │
    ▼
Importance Filter (cheap LLM: haiku)
    │
    ├─► High priority → immediate processing
    └─► Low priority → batch queue
```

**Pros**: Redundancy, better coverage, built-in verification
**Cons**: API cost, dedup complexity

### Option B: GDELT-First + Targeted Supplements

```
GDELT (bulk discovery, free)
    │
    ├──► Volume spike detection
    ├──► Entity extraction
    │
    ▼
Filter: matches topic list?
    │
    ├─► Match → fetch full article via brain/
    └─► Spike detected → trigger Serper for deeper search
```

**Pros**: Low cost, scales to 1000s of topics, anomaly detection free
**Cons**: 15-min delay, requires more processing code

### Option C: Tiered Topic Monitoring

```
Tier 1 (daily, market-moving): Serper, high concurrency
Tier 2 (weekly, strategic): GDELT + NewsData.io
Tier 3 (monthly, background): GDELT only

All tiers → shared dedup → shared extraction pipeline
```

**Pros**: Cost-efficient, prioritizes what matters
**Cons**: May miss breaking news on lower tiers

---

## Verification / Double-Sourcing Design

### Core Idea: Source-Count Corroboration

For financial news, verification is simple: **did multiple independent outlets report it?**

| Source Count | Confidence | Action |
|--------------|------------|--------|
| 3+ (Reuters, Bloomberg, WSJ, etc.) | High | Auto-accept |
| 2 independent sources | Medium | Accept, note sources |
| 1 source only | Low | Flag for review |
| Company IR confirms | Verified | Highest confidence |

### Source Tiers

```yaml
tier_1:  # High credibility, editorial standards
  - reuters.com
  - bloomberg.com
  - wsj.com
  - ft.com
  - sec.gov  # Official filings

tier_2:  # Good credibility, some editorializing
  - cnbc.com
  - seekingalpha.com
  - thestreet.com
  - barrons.com

tier_3:  # Lower credibility, verify if sole source
  - benzinga.com
  - investing.com
  - random blogs
```

### Verification Pipeline

```
New article discovered
    │
    ▼
Count existing sources for same story (dedup cluster)
    │
    ├─► 2+ tier_1 sources → confidence: high
    ├─► 1 tier_1 + tier_2 → confidence: medium
    ├─► 1 source only → trigger verification search
    │       │
    │       ▼
    │   Search GDELT + NewsData for corroboration (48h window)
    │       │
    │       ├─► Found → add sources, upgrade confidence
    │       └─► Not found → flag: "single_source", keep but deprioritize
    │
    ▼
Store: article + source_count + confidence + source_tiers
```

### Implementation

```python
# In handlers/company_research.py

SOURCE_TIERS = {
    "reuters.com": 1, "bloomberg.com": 1, "wsj.com": 1, "ft.com": 1, "sec.gov": 1,
    "cnbc.com": 2, "seekingalpha.com": 2, "barrons.com": 2,
}

def score_confidence(sources: list[str]) -> tuple[str, int]:
    """Score confidence based on source count and tiers."""
    tiers = [SOURCE_TIERS.get(domain(s), 3) for s in sources]
    tier1_count = sum(1 for t in tiers if t == 1)

    if tier1_count >= 2:
        return "high", 3
    elif tier1_count >= 1 and len(sources) >= 2:
        return "medium", 2
    elif len(sources) >= 2:
        return "medium", 2
    else:
        return "low", 1

async def stage_verify(item: dict, tracker, job) -> dict:
    """Verify single-source articles via multi-source search."""

    extract_result = get_result(conn, job.id, item["key"], "extract")
    existing_sources = extract_result.get("sources", [item["url"]])

    confidence, score = score_confidence(existing_sources)

    if confidence != "low":
        return {"confidence": confidence, "sources": existing_sources, "verified": True}

    # Single source — search for corroboration
    title = extract_result.get("title", "")
    company = item.get("topic", "")

    query = f'"{company}" {title[:50]}'
    hits = await search_gdelt(query, hours=48)
    hits += await search_newsdata(query, hours=48)

    all_sources = dedupe_sources(existing_sources + [h["url"] for h in hits])
    confidence, score = score_confidence(all_sources)

    return {
        "confidence": confidence,
        "sources": all_sources,
        "corroboration_found": len(hits) > 0,
        "verified": confidence != "low"
    }
```

---

## Topic Grouping Strategy

### Current: 21 Independent Topics

Each topic = separate Serper query = linear API cost growth.

### Proposed: Hierarchical Topic Groups

```yaml
topic_groups:
  semiconductors:
    queries:
      - "semiconductor supply chain"
      - "DRAM NAND pricing"
    companies: [mu, sk_hynix, samsung_semi]
    investors: []

  ai_infrastructure:
    queries:
      - "AI chips datacenter"
      - "GPU demand"
    companies: [nvidia, amd, coreweave]
    investors: []

  financial_leaders:
    queries: []  # search by name
    companies: []
    investors: [druckenmiller, ron_baron, david_tepper]
```

**Benefits**:
- Shared results across related companies (SK Hynix article mentions Micron → both get it)
- Compound queries: `"DRAM" AND ("Micron" OR "SK Hynix" OR "Samsung")` → 1 API call, 3 topics
- Group-level anomaly detection: spike in "semiconductors" group triggers deeper search

### Implementation

```python
# In lib/discovery.py

class GroupedNewsDiscovery(BaseDiscovery):
    """Discover news for topic groups with shared queries."""

    async def discover(self) -> list[WorkItem]:
        items = []
        for group in self.config["topic_groups"]:
            # Compound query for group
            query = build_compound_query(group)
            results = await self.search_api.search(query)

            # Fan out to individual topics
            for result in results:
                matching_topics = match_to_topics(result, group)
                for topic in matching_topics:
                    items.append(WorkItem(
                        key=f"{topic}:{result['url_hash']}",
                        data={**result, "topic": topic, "group": group["name"]}
                    ))

        return dedupe_by_url(items)
```

---

## Storage Scaling

### Current Problem

6,063 items × 3-5 files = 20K+ files → Git LFS strain, slow operations.

### Solution: Per-Topic SQLite

```
jobs/data/newsflow/
├── newsflow.db           # Main tracker (work items, results)
├── articles.db           # Raw content (LFS tracked)
│   └── articles (url_hash, topic, raw_html, fetched_at)
├── topics/
│   └── {topic}/
│       └── digests/      # Monthly summaries (small, git-tracked)
└── index.jsonl           # Quick lookup (optional)
```

```sql
-- articles.db schema
CREATE TABLE articles (
    url_hash TEXT PRIMARY KEY,
    url TEXT NOT NULL,
    topics TEXT,  -- JSON array of topics this article matches
    raw_html TEXT,
    raw_json TEXT,  -- For API responses
    fetched_at TEXT,
    extracted_text TEXT,
    metadata TEXT  -- JSON: title, author, published_at, etc.
);

CREATE INDEX idx_topics ON articles(topics);
CREATE INDEX idx_fetched ON articles(fetched_at);
```

---

## Implementation Phases

### Phase 1: Add GDELT Discovery (1-2 days)

- [ ] Add `gdelt-doc-api` to dependencies
- [ ] Create `GDELTDiscovery` strategy in `lib/discovery.py`
- [ ] Configure as secondary source for existing topics
- [ ] Test with 2-3 topics, measure coverage vs Serper

### Phase 2: Deduplication Layer (2-3 days)

- [ ] Add `sentence-transformers`, `hdbscan` dependencies
- [ ] Create `lib/dedup.py` with embedding-based dedup
- [ ] Integrate into `multi_source` discovery strategy
- [ ] Store canonical URL + duplicates list

### Phase 3: Topic Grouping (2-3 days)

- [ ] Refactor `jobs.yaml` newsflow config to use topic groups
- [ ] Implement compound query builder
- [ ] Add cross-topic result sharing
- [ ] Update dashboard to show group-level stats

### Phase 4: Verification Pipeline (2-3 days)

- [ ] Add `verify` stage to newsflow pipeline
- [ ] Implement source-tier scoring (Reuters/Bloomberg = tier 1, etc.)
- [ ] Add GDELT + NewsData.io corroboration search for single-source items
- [ ] Store confidence scores, surface in dashboard

### Phase 5: Event Detection (2-3 days)

- [ ] Add volume tracking per topic/group
- [ ] Implement z-score spike detection
- [ ] Auto-trigger deeper search on spikes
- [ ] Alert mechanism (log, dashboard highlight)

---

## Cost Estimates

| Component | Current | After Scaling |
|-----------|---------|---------------|
| Serper | ~$50/mo (est) | ~$30/mo (reduced via grouping) |
| GDELT | $0 | $0 |
| NewsData.io | $0 | $349/mo (verification tier) |
| Embeddings | $0 (local) | $0 (local sentence-transformers) |
| LLM (scoring) | ~$10/mo | ~$20/mo (more items) |
| **Total** | ~$60/mo | ~$400/mo |

**ROI**: 5x more topics monitored, verification, better coverage.

---

## Key Decisions Needed

1. **Primary scaling approach**: Option A (multi-API) vs B (GDELT-first) vs C (tiered)?
2. **Source tier list**: Which outlets count as tier 1 (auto-trust)?
3. **Topic grouping granularity**: How many groups? How much overlap?
4. **Budget**: Is $400/mo acceptable for 100+ topics with verification?

---

## References

- [NewsAPI.ai 2025 Comparison](https://newsapi.ai/blog/best-news-api-comparison-2025/)
- [GDELT Project](https://www.gdeltproject.org/)
- [NewsCatcher Deduplication Guide](https://www.newscatcherapi.com/docs/v3/documentation/guides-and-concepts/articles-deduplication)
- [BERTopic Documentation](https://maartengr.github.io/BERTopic/)