Improve Your Agentic AI Trading With a Great Data Pipeline
The model isn't the edge in agentic AI trading. The data pipeline is. If the agent is making decisions from stale prices, missing sentiment, and no whale data, it doesn't matter how good Codex 5.5 or Opus 4.7 is. Today I'm walking through the five-source pipeline I run behind my Polymarket agent, how the sources get fused into one unstructured master file, and the actual bets the agent picked from that data.
Watch the video:
Why the data pipeline is the actual moat
Everyone wants to talk about the model. Which Codex, which Claude, which reasoning effort. That's the wrong end of the problem. LLMs by themselves know nothing about the real world right now — they need fresh, structured, multi-source data to anchor any decision they make on a live market. The model is a calculator. The pipeline is what gives it numbers worth calculating on.
This is the same insight I leaned on in the heartbeat split-agent setup: cheap structured data goes through small/fast models, decisions go through the strong model. But before any of that, you need the data to exist in the first place. That's what this pipeline is.
The five sources
For Polymarket-style prediction-market agents, the sources I run are:
- Kalshi — competitor data. WebSocket (free) or API to pull the same-shape markets from a different venue. If Kalshi prices a binary at 0.62 and Polymarket prices it at 0.55, that gap is a signal on its own.
- Reddit — sentiment via Surfagent browser automation. Logged-in scraping of relevant subreddits for top posts and recent news. The browser route matters here because Reddit's official API has gotten increasingly hostile.
- Polymarket whales — large-bet wallets on-chain. Not 100% real-time but very close, and the API is free. Whale flow on a specific market is one of the highest-signal indicators I've found, especially on niche or thin markets.
- X / Twitter — also via Surfagent. Latest + top search on the keyword. Same browser-automation route as Reddit, same logged-in advantage.
- Google / Chrome — general news search via Surfagent. Catches breaking news that hasn't hit the markets yet but is about to.
All five are sources you can swap out depending on the market. If you're trading Hyperliquid instead of Polymarket, drop the Polymarket whales source and add an on-chain Hyperliquid order-flow source. If you're trading sports, lean harder on news + odds-comparison sites. The shape stays the same.
The master unstructured file
Every source dumps into one place: master_unstructured.txt. That's the file the decision agent actually reads.
The naming is deliberate. It's not a structured database. It's not normalized JSON. It's just raw text appended from every pipeline run, with light section headers so the model can find what it needs. The reason: LLMs are good at unstructured text. They're great at finding signal in messy prose. Forcing the pipeline to produce a perfectly structured schema upfront wastes engineering time on something the model doesn't need.
The orchestration file is a small data.md that describes each pipeline component — the Kalshi pipeline markdown, the Surfagent pipeline markdown, the whale pipeline markdown. When I tell Codex "execute the full data pipeline for keyword X," it reads data.md, runs each component in sequence, and appends results to master_unstructured.txt.
A live run on Bitcoin
The demo in the video: kick off the full pipeline with the keyword Bitcoin, then have the agent find a trade.
What the pipeline does in real time:
- Google News search for Bitcoin — Surfagent scrolls and the LLM extracts sentiment
- X search for Bitcoin (latest, then top) — same flow
- Reddit — scrolls through relevant subreddits, captures top posts
- Polymarket whale collector, category crypto — pulls recent large wallet activity on crypto markets
- Kalshi collector — Bitcoin-related markets and pricing on Kalshi for comparison
Pipeline summary from the run: roughly 60 observations with mixed sentiment, recent negative context around the sub-60K range and liquidation headlines, heavy whale activity in a short window on both up-and-down markets (not a clean one-sided signal), Kalshi pricing most upside threshold markets as no-favored.
I then ran a /goal on the agent: "based on the master_unstructured file and the markets on Polymarket, look for a good price with expected value, do calculations, think hard." The agent pulled Polymarket markets via API, cross-referenced against the gathered data, and surfaced one trade:
Will Bitcoin reach $200K by December 31? Yes at 0.002 — ~97x upside if it hits.
Probably won't print, but the expected-value math at that price was non-terrible given the data. I put $10 on it for the demo and moved on. The pipeline did its job — surfaced a candidate from fused multi-source data, not from the model hallucinating about price action.
The Formula 1 example — why pipeline beats vibes
Before the Bitcoin run, I'd done a similar pipeline run earlier in the day on sports and Formula 1. Output: Kimi Antonelli to win, priced at 0.56 on Polymarket. The pipeline surfaced it as the best available expected-value trade given collected data — historical pole-position win rates, recent qualifying form, sentiment.
I bought 44 yes shares at $25. Fifteen minutes after the start, position was up 28%. By the time I checked again later, up 60%. The model didn't "know" F1. The pipeline gave it the priors it needed to compute the expected value correctly.
What this changes about how you build agents
If you're building anything in this niche — Polymarket bots, Hyperliquid perp agents, Kalshi prediction-market players — spend at least half your engineering time on the pipeline, not on the prompt. Specifically:
- Multi-source by default. One source is a guess. Three sources is a hypothesis. Five sources is a position.
- Fuse in unstructured text. Don't waste effort normalizing into a schema the LLM doesn't need. Append, section-header, move on.
- Browser automation is a real edge. The platforms that have the best data (X, Reddit, Polymarket UI) are the ones most hostile to scrapers. A logged-in browser agent like Surfagent gets past most of that with no API key.
- Pipeline is reusable, prompts are not. Same five sources, different keyword. Same five sources, different platform. Pipelines compound. Prompts don't.
This pipeline is also what powers the bets I've been showing in the Polymarket bot build, the Hyperliquid agent, and the Codex vs Claude head-to-head. Different platforms, different agents, same underlying data shape.
What's next
The five sources here are a baseline, not a ceiling. The pipelines I haven't covered yet but will in upcoming videos: on-chain order-flow for Hyperliquid, options flow scraping for stock-related prediction markets, and an LLM-graded sentiment layer that runs on top of the raw text dump to give the decision agent pre-scored signals instead of raw prose. If you want to follow the upgrades to this setup, the AI_automata Discord is where I post first.
Resources
- Agentic AI Trading guide — the full pillar covering the niche.
- Heartbeat split-agent setup — once you have data, this is how the decision loop runs on it.
- Polymarket AI trading bot from scratch — the agent this pipeline feeds.
- Hyperliquid AI agent trader — same pipeline shape, different venue.
- Surfagent browser automation — the browser layer that powers the X, Reddit, and Google sources.
- Agentic AI trading for beginners — start here if this is the first post you've landed on.
- Polymarket — the prediction-market venue used in the demo.
- Hyperliquid — the perp DEX used in related agent builds.
- AI_automata Discord — pipeline upgrades and longer-run results land here first.