Codex 5.5 vs Claude Code: Hyperliquid Trading Challenge
After Codex 5.5 won the Polymarket head-to-head, I wanted a different venue. Same matchup, but on Hyperliquid's XYZ perp markets — equity, commodity, and FX perps instead of 5-minute BTC. Same rules: $100 budget, 1 hour, most dollars wins. Codex won again, by a wider margin. Two-for-two against Opus 4.7. This post walks through both runs and what specifically pulled Codex ahead.
Watch the video:
Why XYZ perps
The Hyperliquid XYZ perp specification is a wide menu: Brent and WTI oil, S&P 500, natural gas, silver, gold, FX pairs, and individual stocks like Tesla, Nvidia, Google, MRVL, HOOD. The full setup behind the wallet, API key, and HIP-3 DEX gotcha is in the earlier Hyperliquid agent post; this challenge runs on top of that same harness.
Crypto was explicitly excluded — too easy for the model to fall back on patterns from its training data. The interesting question is what happens when an LLM has to reason about NVDA earnings or an oil short with no easy "I've seen this before" path.
The setup (and one caveat)
Same prompt template as the Polymarket challenge. Both models given:
- $100 starting budget, max risk margin
- 1 hour to run, leverage allowed
- 15 minutes max for research + planning, then trade
- Allowed to spawn monitor agents to modify/cancel/add trades on the fly
- Live Hyperliquid docs + WebSocket access for prices
- "Ground time with bash date command" — explicit anti-stale-timestamp instruction
The caveat: I only have one Hyperliquid account, so the two runs were sequential, not simultaneous. Markets moved between them. Real but probably not enough to flip a 13-point spread.
Run 1 — Claude Opus 4.7 (Claude Code, high)
Plan came back with three named trades:
- Trade A (headline): short Brent oil — directional macro call
- Trade B: XYZ-100 short on MRVL
- Trade C: XYZ-100 short on HOOD
Plan was confident, set up the trades, started the monitor. Then mostly… sat there. The Brent and equity shorts went in early and Claude largely held them for the full hour. Brief move up to +4% on opening favorable moves, then a bad SP 500 short call dragged the book down. Final: −3.93%, so −$3.93 on the $100 budget.
The Claude run also hit an API permission error mid-session — restart-required, no progress lost but lost minutes on the clock.
Run 2 — Codex 5.5 (high, YOLO)
Same prompt, swap in Codex's name as the rival. Plan came back broadly similar in spirit — directional shorts on commodities and equities — but the execution looked different from the moment trades started landing. Codex's monitor agent was much more active: rotating positions in and out, taking small profit on winners, cutting losers fast. Effectively more turns at the table.
Final: +9.00%, so +$9 on the $100 budget.
What made the difference
The clearest delta wasn't the strategy on paper — both planned reasonable directional shorts with monitor agents. The delta was monitor behavior.
- Claude's monitor was passive. It set up watchers, but the watchers mostly observed and reported. Claude held positions through unfavorable moves.
- Codex's monitor was active. It kept rewriting the position book — exiting on small profit, re-entering elsewhere when the first thesis went stale. More trades, faster cycle, smaller average exposure per position.
That difference shows up across both challenges now. On Polymarket Codex went for a probability-arb edge that requires fast iteration; on Hyperliquid it ran an active rotation strategy. Same underlying disposition: Codex treats "let me check this again" as a default behavior, Claude treats it as an event. For a 1-hour trading session that adds up fast.
This pattern matches what I've seen building these systems generally — see the 3-part AI agent system and the Hyperliquid skill pipeline. The harness that schedules its own re-checks usually outperforms the one that doesn't, regardless of the model.
I switched to Codex Max
Two head-to-head wins is small data, but the pattern is consistent enough that I upgraded to the Codex Max subscription and downgraded Claude Code to the $20 plan. Caveats on that decision:
- This is for trading-style tasks specifically. Heavy on numerical reasoning, schedule-driven re-evaluation, lots of small decisions.
- Opus 4.7 is still ahead on frontend work. Anything involving design taste, UI iteration, or complex visual layout, Opus produces better output. I'm still using it heavily — just on a lighter subscription tier.
- Sample size of two. If you're making a subscription decision based on this, run your own bake-off first. Two 1-hour windows are barely a sample.
What I'll test next
- Longer time horizons. 1 hour is variance city. I'm running both strategies for a full weekend now (Codex on Polymarket and Hyperliquid in parallel) to see if the edge persists or evaporates.
- Open-source models on the same harness. DeepSeek R3, Qwen 4 Max, Llama 4 — same prompt, same hour, same budget. The interesting question is whether the active-monitor behavior is a Codex property or a high-effort-reasoning property.
- Identical persona prompts. Both runs used vanilla prompts. Wrapping Claude in the WSB-Moderator persona from the original Hyperliquid post would be worth checking — if active monitoring is what made Codex win, a personality that prefers frequent re-evaluation might close the gap.
I'll report back in the AI_automata Discord as the longer runs land. If you want to suggest the next matchup or the next venue, that's the place.
Resources
- Codex vs Claude on Polymarket — the previous matchup.
- Building the Hyperliquid agent — the underlying harness both models used.
- Hyperliquid — perp DEX, XYZ markets.
- AI_automata Discord — community + matchup suggestions.
FAQ
Why XYZ perps?
The Hyperliquid XYZ perp specification is a wide menu: Brent and WTI oil, S&P 500, natural gas, silver, gold, FX pairs, and individual stocks like Tesla, Nvidia, Google, MRVL, HOOD.
What made the difference?
The clearest delta wasn't the strategy on paper — both planned reasonable directional shorts with monitor agents.
What I'll test next?
I'll report back in the AI_automata Discord as the longer runs land.