Deep Research From Scratch Is Not Deep Enough

The case for pre-accumulated domain data as the foundation of useful AI.

1. The problem with "deep research"

Every frontier model now offers a "deep research" mode: give it a question, it searches the web, reads dozens of pages, synthesizes an answer. It feels impressive. But for serious domain work, it falls short in predictable ways:

Core thesis

The bottleneck isn't the model's reasoning — it's the data the model has access to. Give a frontier LLM deep, curated, domain-specific data and its output goes from "interesting summary" to "actionable intelligence."

2. The data advantage

We aim to have the best data in each vertical we enter. Not just web scraping — a layered data strategy that deepens over time:

Phase 1 — Now

  • Web at scale
  • Autonomous pipelines crawl, extract, structure
  • Self-healing ingestion (proxy escalation, error triage)
  • Multi-source cross-referencing

Phase 2 — With funding

  • Licensed databases
  • Industry reports, proprietary datasets
  • Government and regulatory filings
  • Subscription data feeds

Phase 3 — With scale

  • Expert networks
  • Domain practitioners contributing knowledge
  • Verified facts, insider context
  • Human-in-the-loop curation

Each phase compounds on the previous. Licensed data fills gaps the web can't reach. Expert knowledge adds the context that neither web nor databases capture. The LLM reasons over all three layers simultaneously.

3. Small vertical, outsized impact

The key insight: you don't need to boil the ocean. Pick a vertical small enough to achieve data completeness, but valuable enough that completeness matters.

In a well-defined vertical, the system can know every relevant organization, every funder, every open RFP, every peer, every competing initiative. That's not a research assistant — that's an unfair advantage.

Why verticals win

A horizontal AI tool helps a little with everything. A vertical AI tool — with complete domain data — is transformatively better at the things that matter most to its users. The value per user is 10x higher, which means willingness to pay is 10x higher.

4. Example: educational nonprofits

Live partnership — Technovation

We partnered with Technovation, a global nonprofit that teaches girls to build technology to solve community problems. (Disclosure: Tara Chklovski, the founder and CEO, is my wife.)

The vertical: educational nonprofits — organizations that deliver learning programs, seek grants, report to funders, and compete for limited philanthropic dollars.

What becomes possible with deep vertical data:

For the nonprofit

  • Every relevant funder — foundations, government programs, corporate giving — mapped, scored, matched
  • Every open RFP — monitored in real time, with fit scoring against the org's mission and capabilities
  • Peer landscape — who else does similar work, where are the gaps, what's the competitive positioning
  • Grant writing — drafts that know the funder's priorities, the org's track record, and the vertical's language

For funders

  • All educational nonprofits — comprehensive landscape of who's doing what, where, at what scale
  • Due diligence — financial health, leadership stability, outcome data, regulatory status — automated
  • Gap analysis — where is funding concentrated, where are underserved regions or populations
  • Portfolio monitoring — track grantee progress, surface risks, flag opportunities for follow-on

None of this is possible with generic deep research. It requires pre-accumulated data: every org cataloged, every funder mapped, every RFP tracked. The AI becomes useful precisely because the data is already there when the question is asked.

5. The pattern generalizes

Educational nonprofits is one vertical. The pattern works anywhere the data can be bounded and the users are underserved by generic tools:

The playbook is the same each time: accumulate the data, structure it, let the LLM reason over the complete picture. The vertical specificity is what makes the output worth paying for.

Summary

Deep research from scratch hits a ceiling because the data isn't there. We build the data layer first — web now, licensed and expert data with funding — so the LLM has the complete picture before the user even asks. Small verticals, deep data, outsized value.