All About AI

Improve Your Agentic AI Trading With a Great Data Pipeline

The model isn't the edge in agentic AI trading. The data pipeline is. If the agent is making decisions from stale prices, missing sentiment, and no whale data, it doesn't matter how good Codex 5.5 or Opus 4.7 is. Today I'm walking through the five-source pipeline I run behind my Polymarket agent, how the sources get fused into one unstructured master file, and the actual bets the agent picked from that data.

Watch the video:

Why the data pipeline is the actual moat

Everyone wants to talk about the model. Which Codex, which Claude, which reasoning effort. That's the wrong end of the problem. LLMs by themselves know nothing about the real world right now — they need fresh, structured, multi-source data to anchor any decision they make on a live market. The model is a calculator. The pipeline is what gives it numbers worth calculating on.

This is the same insight I leaned on in the heartbeat split-agent setup: cheap structured data goes through small/fast models, decisions go through the strong model. But before any of that, you need the data to exist in the first place. That's what this pipeline is.

The five sources

For Polymarket-style prediction-market agents, the sources I run are:

All five are sources you can swap out depending on the market. If you're trading Hyperliquid instead of Polymarket, drop the Polymarket whales source and add an on-chain Hyperliquid order-flow source. If you're trading sports, lean harder on news + odds-comparison sites. The shape stays the same.

The master unstructured file

Every source dumps into one place: master_unstructured.txt. That's the file the decision agent actually reads.

The naming is deliberate. It's not a structured database. It's not normalized JSON. It's just raw text appended from every pipeline run, with light section headers so the model can find what it needs. The reason: LLMs are good at unstructured text. They're great at finding signal in messy prose. Forcing the pipeline to produce a perfectly structured schema upfront wastes engineering time on something the model doesn't need.

The orchestration file is a small data.md that describes each pipeline component — the Kalshi pipeline markdown, the Surfagent pipeline markdown, the whale pipeline markdown. When I tell Codex "execute the full data pipeline for keyword X," it reads data.md, runs each component in sequence, and appends results to master_unstructured.txt.

A live run on Bitcoin

The demo in the video: kick off the full pipeline with the keyword Bitcoin, then have the agent find a trade.

What the pipeline does in real time:

  1. Google News search for Bitcoin — Surfagent scrolls and the LLM extracts sentiment
  2. X search for Bitcoin (latest, then top) — same flow
  3. Reddit — scrolls through relevant subreddits, captures top posts
  4. Polymarket whale collector, category crypto — pulls recent large wallet activity on crypto markets
  5. Kalshi collector — Bitcoin-related markets and pricing on Kalshi for comparison

Pipeline summary from the run: roughly 60 observations with mixed sentiment, recent negative context around the sub-60K range and liquidation headlines, heavy whale activity in a short window on both up-and-down markets (not a clean one-sided signal), Kalshi pricing most upside threshold markets as no-favored.

I then ran a /goal on the agent: "based on the master_unstructured file and the markets on Polymarket, look for a good price with expected value, do calculations, think hard." The agent pulled Polymarket markets via API, cross-referenced against the gathered data, and surfaced one trade:

Will Bitcoin reach $200K by December 31? Yes at 0.002 — ~97x upside if it hits.

Probably won't print, but the expected-value math at that price was non-terrible given the data. I put $10 on it for the demo and moved on. The pipeline did its job — surfaced a candidate from fused multi-source data, not from the model hallucinating about price action.

The Formula 1 example — why pipeline beats vibes

Before the Bitcoin run, I'd done a similar pipeline run earlier in the day on sports and Formula 1. Output: Kimi Antonelli to win, priced at 0.56 on Polymarket. The pipeline surfaced it as the best available expected-value trade given collected data — historical pole-position win rates, recent qualifying form, sentiment.

I bought 44 yes shares at $25. Fifteen minutes after the start, position was up 28%. By the time I checked again later, up 60%. The model didn't "know" F1. The pipeline gave it the priors it needed to compute the expected value correctly.

What this changes about how you build agents

If you're building anything in this niche — Polymarket bots, Hyperliquid perp agents, Kalshi prediction-market players — spend at least half your engineering time on the pipeline, not on the prompt. Specifically:

This pipeline is also what powers the bets I've been showing in the Polymarket bot build, the Hyperliquid agent, and the Codex vs Claude head-to-head. Different platforms, different agents, same underlying data shape.

What's next

The five sources here are a baseline, not a ceiling. The pipelines I haven't covered yet but will in upcoming videos: on-chain order-flow for Hyperliquid, options flow scraping for stock-related prediction markets, and an LLM-graded sentiment layer that runs on top of the raw text dump to give the decision agent pre-scored signals instead of raw prose. If you want to follow the upgrades to this setup, the AI_automata Discord is where I post first.

Resources