All About AI

Karpathy's Autoresearch on My AI Polymarket Trading Bot

2026-03-11

Andrej Karpathy posted an autoresearch project recently — a small evolutionary loop that mutates code, evaluates it, keeps the better attempts, and discards the worse ones. He used it to train a nano GPT model. I wanted to take that same pattern and apply it to a totally different domain: my Polymarket arbitrage trading bot.

The bot tries to find arbitrage on the 5-minute Bitcoin up/down market. The strategy logic is hard to tune by hand because the data is noisy and the windows are short. So instead of tuning manually, I let the autoresearch loop tune it for me — and then ran it live for 20 minutes on real money.

Watch the video:

The Autoresearch Loop, Adapted for Trading

The structure is the same as Karpathy's project, but the components are domain-specific:

Each experiment runs for 1 hour, then commits its result. If the score improves, the strategy code is kept; if not, it is discarded and the agent picks a different direction. This is the same pattern I used for the autoresearch security tester — different goal, same loop.

The Dashboard

I built a small dashboard to track everything. For a typical run it shows:

The strategies the agent has been trying include logic experiments, hypothesis-driven strategy functions, asymmetry filters, and spread-relative-to-edge filters. I had Claude Code and Codex collaborate to brainstorm experiment directions — Claude Code (Opus 4.6) for the actual code mutations, Codex (GPT 5.4) for strategy ideas — same combination I covered in the autoresearch hacker post.

How an Experiment Lifecycle Works

I let one experiment finish on camera so you can see what happens at the boundary. Roughly: the 1-hour timer expires, the bot stops trading, the evaluator runs and scores the result. Experiment 16 scored 0.07 below the best-kept strategy with low frequency and weak crossword fit — discarded. The agent wrote a note ("both asymmetry experiments produce best nesting...") into the history, picked a different direction (spread-relative-to-edge filter), updated the strategy code, committed, started experiment 17.

All of this happens autonomously. Claude Code is the orchestrator, but you could run the same loop in any agent runtime. The pattern transfers.

Going Live: 20 Minutes on Real Money

For the video I switched to the best-scoring strategy and ran it live with a $5 per-trade size. Wallet at the start: $150.

I missed the first trade — it executed before I caught it on screen — but the result was a $5-cent edge, a successful trade, balance back up. Second trade: edge of 0.0010 (small), entered at 99 / 50, balance dropped to $146 then resolved as a win.

The third trade was the interesting one — got in at 97, which means a 15-cent margin per dollar. Resolved as a win, balance jumped meaningfully. Fourth trade hit a similar pattern. By the end of about 20 minutes:

$2 in 20 minutes is not life-changing, but the win rate was 100%, the strategy held up live, and the per-trade size scales linearly. The point of running 20 minutes was just to confirm the dry-mode results survived contact with reality. They did.

Why the Pattern Generalizes

The whole reason I love this loop is that it works for anything with a measurable goal:

Karpathy's framing is the right one: don't write the strategy by hand, let the agent search the space. Your job is to define the goal and the evaluator. The agent finds the path.

What's Next

I am going to keep running the experiments and see if the score keeps climbing. I would also like to try this on a different market — maybe 1-hour or daily windows where the arbitrage edges are smaller but the data less noisy. If that works, the pattern generalizes from Bitcoin micro-arb to broader crypto / sports / event markets.

Recommend going through Karpathy's autoresearch repo if you want to see the original implementation. The loop is small and clean — you can adapt it to almost any domain in an afternoon.

Resources