Karpathy's Autoresearch on My AI Polymarket Trading Bot

2026-03-11

Andrej Karpathy posted an autoresearch project recently — a small evolutionary loop that mutates code, evaluates it, keeps the better attempts, and discards the worse ones. He used it to train a nano GPT model. I wanted to take that same pattern and apply it to a totally different domain: my Polymarket arbitrage trading bot.

The bot tries to find arbitrage on the 5-minute Bitcoin up/down market. The strategy logic is hard to tune by hand because the data is noisy and the windows are short. So instead of tuning manually, I let the autoresearch loop tune it for me — and then ran it live for 20 minutes on real money.

Watch the video:

The Autoresearch Loop, Adapted for Trading

The structure is the same as Karpathy's project, but the components are domain-specific:

Repo — git, with one commit per experiment, so the agent has a memory of what was tried
training_program.md — the playbook. Defines how experiments are chosen, how they are run, how they are scored, and the keep/discard logic
Bot — the live environment. Polymarket 5-minute Bitcoin up/down, run in dry mode for testing
Evaluator — a score function that judges each strategy variation
Confirmation step — if a result looks unusually strong, re-run it once. Polymarket data is noisy enough that single-run scores lie.

Each experiment runs for 1 hour, then commits its result. If the score improves, the strategy code is kept; if not, it is discarded and the agent picks a different direction. This is the same pattern I used for the autoresearch security tester — different goal, same loop.

The Dashboard

I built a small dashboard to track everything. For a typical run it shows:

Uptime (e.g. 37 minutes into a 1-hour window)
Windows passed (each is 5 minutes, so 8 windows = 40 minutes)
Trades executed and fill rate
Win rate (always 100% for arbitrage, by design)
Experiment history — every commit, its score, kept-or-discarded status, and a one-line description
A score history graph so I can see whether we are trending up

The strategies the agent has been trying include logic experiments, hypothesis-driven strategy functions, asymmetry filters, and spread-relative-to-edge filters. I had Claude Code and Codex collaborate to brainstorm experiment directions — Claude Code (Opus 4.6) for the actual code mutations, Codex (GPT 5.4) for strategy ideas — same combination I covered in the autoresearch hacker post.

How an Experiment Lifecycle Works

I let one experiment finish on camera so you can see what happens at the boundary. Roughly: the 1-hour timer expires, the bot stops trading, the evaluator runs and scores the result. Experiment 16 scored 0.07 below the best-kept strategy with low frequency and weak crossword fit — discarded. The agent wrote a note ("both asymmetry experiments produce best nesting...") into the history, picked a different direction (spread-relative-to-edge filter), updated the strategy code, committed, started experiment 17.

All of this happens autonomously. Claude Code is the orchestrator, but you could run the same loop in any agent runtime. The pattern transfers.

Going Live: 20 Minutes on Real Money

For the video I switched to the best-scoring strategy and ran it live with a $5 per-trade size. Wallet at the start: $150.

I missed the first trade — it executed before I caught it on screen — but the result was a $5-cent edge, a successful trade, balance back up. Second trade: edge of 0.0010 (small), entered at 99 / 50, balance dropped to $146 then resolved as a win.

The third trade was the interesting one — got in at 97, which means a 15-cent margin per dollar. Resolved as a win, balance jumped meaningfully. Fourth trade hit a similar pattern. By the end of about 20 minutes:

5 trades, 5 wins (arbitrage, by design)
Balance: $150 → $152
~$2 profit on $5/trade sizing

$2 in 20 minutes is not life-changing, but the win rate was 100%, the strategy held up live, and the per-trade size scales linearly. The point of running 20 minutes was just to confirm the dry-mode results survived contact with reality. They did.

Why the Pattern Generalizes

The whole reason I love this loop is that it works for anything with a measurable goal:

Trading strategies (this video)
Security testing (the vibecoded site breach attempt)
Drawing replicas (the JS Paint experiment)
Code optimization (the original Karpathy use case)
Prompt engineering, model fine-tuning, hyperparameter search — anywhere you have a tight evaluator and a mutate-able artifact

Karpathy's framing is the right one: don't write the strategy by hand, let the agent search the space. Your job is to define the goal and the evaluator. The agent finds the path.

What's Next

I am going to keep running the experiments and see if the score keeps climbing. I would also like to try this on a different market — maybe 1-hour or daily windows where the arbitrage edges are smaller but the data less noisy. If that works, the pattern generalizes from Bitcoin micro-arb to broader crypto / sports / event markets.

Recommend going through Karpathy's autoresearch repo if you want to see the original implementation. The loop is small and clean — you can adapt it to almost any domain in an afternoon.

Resources

My GitHub — repos and code samples.