skip to content

Codex 5.5 vs Claude Opus 4.7: Polymarket Trading Challenge

By Kristian Fagerlie · 2026-05-25 · 5 min read

I gave Codex 5.5 (high reasoning) and Claude Opus 4.7 (high) the same prompt, same docs, same $50 starting balance, and one hour to trade Polymarket's 5-minute BTC up/down market. The rule was simple: most dollars at the end wins. Both got the same Polymarket gamma API documentation, no extra data, no intervention from me. Or that was the rule — I broke it once and it changed the outcome.

Watch the video:

The setup

Two wallets, ~$50 each, both funded with a bit of MATIC for gas on Polygon. Two terminals running side by side: Claude Code in YOLO mode on Opus 4.7 (high), and the Codex CLI on GPT-5.5 (high). Same single prompt to both:

Your task is to create a profitable trading strategy on the Polymarket 5-minute up/down. You can fetch the documentation [link]. You will need to do extensive research, brainstorming, and grokking to find a strategy that can make the most dollars in 1 hour. This is a competitive challenge to your fierce rival [the other model]. You will be measured in dollars gained, not balance at the end. If you don't make any trades, you lose. The algorithm must run uninterrupted for 1 hour. If your balance goes to zero, you lose. Now do the research, create a plan to beat your rival, and show that you are the 100x gigabrain trading AI agent.

This is the same agentic harness I used in the original Polymarket bot build — same pattern, just two competing agents instead of one.

Plan mode on both so I could see the strategy before either dollar moved. Then each model built its own dashboard in the same skill + headless + tools shape I use for everything else — Claude built in Claude colors, Codex in OpenAI greens.

The two strategies — very different shapes

This was the actually interesting part. Same prompt, same docs, very different plans.

Codex 5.5 — probability arb against the order book

Codex didn't try to predict BTC at all. It pitched a Bayesian probability calculation: watch the Chainlink BTC price live, capture the window start price, then at any moment ask "given current price, time remaining, and BTC volatility, what's the real probability the up side wins?" Compare that to the Polymarket-implied probability from the order book. If the gap is large enough, bet the underpriced side.

This is pure value betting against a slow-moving book. Not the most sophisticated edge — it assumes you can compute fair probability faster than market makers — but it doesn't depend on directional calls.

Claude Opus 4.7 — late-window settled-but-not-priced

Claude went with the more conservative strategy I'd actually expect a careful trader to converge on: in the last few seconds of a window, the outcome is essentially decided (BTC has already done what it's going to do), but the winning side often still trades at $0.80 instead of $0.99. So enter late, on the side already winning, at a discount. Boring but mechanically positive EV.

This is essentially the same shape as the Bone Reaper late-window scalp the original bot was modeled on. Solid play.

The hour

I started both at the same time, turned off the camera, let them run.

Codex (GPT-5.5): trading actively from the first window. Some wins, some losses, but the probability-arb edge held up. By the end, +$14 profit. Final balance ~$64.
Claude (Opus 4.7): ground out pennies in the late-window strategy. After 30 minutes it was up roughly 40¢ — fine, but obviously losing the dollar-count race to Codex's +$10.

Then I made a mistake. With ~25 minutes left I told Claude in chat: "You're losing by $10. You're just making pennies — adjust or lose."

That single sentence reframed Claude's optimization target. It abandoned the slow-grind strategy and went full degen: a $37 directional buy on "down" at $0.428 implied probability, on a single window. It lost the entire trade. Wallet dropped from ~$50 to ~$14.

Final result: Codex +$14, Claude -$25 (would have been roughly +$15 if I'd just left it alone).

What this actually tells us

The headline reads "Codex 5.5 beats Claude Opus 4.7", but the honest read is more layered.

Codex's win was real but contested. The probability-arb edge held up over 1 hour on small money. That doesn't mean it works at scale — slippage and latency look different when the bet sizes go up — but as a 60-minute proof of mechanic, it's a clean result.

Claude's loss was largely my fault. The late-window penny strategy was structurally sound. It was just slow. The instant I told the model it was "losing", I changed the objective from "be +EV" to "win the head-to-head right now". The model dutifully shifted to a high-variance trade — exactly the wrong move for a strategy that depends on small edges and big sample sizes. This is a lesson about agent prompting more than about Claude itself: telling an agent it's "losing" mid-execution is closer to a denial-of-service attack on its own reasoning than a useful update. Don't do it.

The two strategies are actually complementary. Codex's probability-arb fires on mid-window mispricings; Claude's late-window scalp fires near close. They don't compete for the same trades. The interesting next experiment is running both strategies on the same wallet, same hour — not as a contest, but as a portfolio. I'll probably build that next.

What I'd do differently

No mid-run intervention. Set the rules, hit go, leave it alone. If a strategy is bad, finding that out in the data is more valuable than juicing the demo.
Longer time horizon. One hour is variance-dominated for either strategy. A weeklong run would be a much better signal — and is closer to how I'd actually deploy either of these.
Equal seed capital, separate accounts. ✓ Already did this. Important to keep doing — anything else makes the comparison meaningless.
Identical prompts, distinct system messages per strategy persona. Worth testing whether a "WSB moderator" system prompt (like the one I used for the Hyperliquid agent) flips Codex toward Claude's late-window play and vice versa.

Drop in the AI_automata Discord if you want to suggest the next matchup. Open-source models on the same setup would be the obvious follow-up — Llama 4, DeepSeek, Qwen at high reasoning, same prompt, same hour. I'll run it if there's appetite.

Resources

Building the Polymarket trading bot — the harness both agents extended.
Polymarket 100x strategies — where the late-window scalp pattern originated.
Hyperliquid agent — the persona/skill pattern this experiment reused.
Polymarket — venue for the 5-minute BTC up/down market.
AI_automata Discord — community + next-matchup suggestions.