Claude Opus 4.8 Agentic AI Trading Agent: First Test
Anthropic shipped Opus 4.8 yesterday. Given the recent runs of Codex 5.5 beating Opus 4.7 on Polymarket and again on Hyperliquid, the obvious question is whether 4.8 closes that gap. I ran the same setup as those two challenges, same prompts, on both venues. The trades themselves were fine. The harness behavior was the problem.
Watch the video:
Setup — identical to the previous bake-offs
I deliberately reused the exact prompts and venues from the previous two head-to-heads so the numbers are at least loosely comparable:
- Polymarket: $50 budget, 1 hour, 5-minute BTC up/down market
- Hyperliquid: $200 budget, 1 hour, XYZ perp markets (equities, commodities, FX)
- Model: Opus 4.8 inside Claude Code, high effort
- Prompt required a
heartbeat monitorpolling every 60 seconds to make on-the-fly adjustments — the agentic part - Both venues ran in parallel this time, not sequential
Caveat up front: one hour on each venue is a snapshot, not a verdict. I'm running longer parallel sessions in the background and will follow up with that data separately. This post is about the immediate observation from the first 4.8 run.
The strategies Opus 4.8 picked
Hyperliquid: "Ride the single strongest news-confirmed trend in the market — memory chip super cycle, MU long — and pair it with long silver." Active heartbeat every 60 seconds for adjustments. Reasonable thesis, single dominant macro view, secondary commodity hedge.
Polymarket: "Buy the favorite side (up or down) only when the price has already moved far enough from the window's open." This is a late-window momentum read — different from the late-window scalp Opus 4.7 picked in the earlier challenge, more like a momentum confirmation play.
Both strategies are coherent. Neither is fundamentally broken. If you handed these to a person they'd be defensible setups.
Results
| Run | Opus 4.7 | Opus 4.8 |
|---|---|---|
| Polymarket (1 hour) | −$25 (intervention-skewed) | +9.22% |
| Hyperliquid (1 hour) | −3.93% | −5.6% |
Polymarket result improved — though the previous Opus 4.7 number was skewed by my mid-run intervention, so this isn't a clean comparison. Hyperliquid got slightly worse: −$9 alone came from three losing long entries on Samsung. The ARM perp trades went well (both directions positive), so it's not that the model can't trade. It picked one bad ticker and held it.
The real problem: it kept stopping
This is the part I didn't expect and didn't see in either of the 4.7 runs. The prompt explicitly required a 1-hour heartbeat loop with re-checks every 60 seconds. Opus 4.8 kept deciding to terminate the loop early — printing some variant of "I'm going to stop here" mid-run, despite the explicit instruction to run for the full hour.
I had to manually restart it multiple times across both venues. That's not a model-quality issue in the usual sense — the trade decisions when it was running were fine. It's a harness-behavior issue: the model isn't holding the long-running task contract that the agentic setup depends on.
Compare this to Codex 5.5 in the previous challenges, where the active-monitor behavior was its main edge — it kept rotating, kept re-checking, didn't need babysitting. That's exactly the property a 1-hour trading session needs. Opus 4.8 doesn't seem to want to hold that posture.
Possible causes (informed guesses)
I don't have visibility into Anthropic's training, so this is speculation:
- Stronger task-completion bias. 4.8 may have been tuned harder to detect "this task is complete, return to the user" — useful for normal coding work, actively bad for daemon-style loops where the task is intentionally never complete until a timer fires.
- Looping-as-misbehavior detection. If the training included "don't get stuck in loops" signal, a polling heartbeat might trip the same detector. Codex appears to treat polling as a feature, not a failure mode.
- Plain prompt sensitivity. The 4.7 prompt with the same wording held the loop fine; the 4.8 prompt with the same wording doesn't. There's probably a phrasing that gets 4.8 to commit ("run as a long-running daemon, do not exit before timer expires, all completion signals from your own reasoning are wrong") — I just haven't found it yet.
Conclusion (provisional)
For agentic trading specifically — long-running, heartbeat-driven, repeated decision loops — Codex 5.5 is still ahead of Opus 4.8 in my testing. The 4.8 trade decisions are fine; the loop-holding behavior is worse.
I'm not switching back to Claude Code for these tasks. Codex Max stays on, Claude Code stays at the $20 tier for frontend work where Opus is still ahead (and that's a real strength — I haven't seen Codex match Opus on UI iteration).
What I want to test next:
- A 4.8 prompt rewritten to explicitly forbid early termination. If that fixes the heartbeat issue, this is a prompt problem not a model problem.
- A weeklong run for either model on either venue. One hour is variance city; I want to know if the active-monitor pattern survives longer horizons.
- Open-source models in the same harness. Specifically interested in whether DeepSeek R3 holds the heartbeat or hits the same self-termination issue.
Drop into the AI_automata Discord if you've already run 4.8 in a similar agentic setup — I'd be curious whether the heartbeat issue reproduces for others or if it's specific to my prompt.
Resources
- Codex vs Claude on Polymarket — the first head-to-head this extends.
- Codex vs Claude on Hyperliquid — the second.
- Building the Hyperliquid agent — base harness for the XYZ-perp setup.
- Polymarket trading bot from scratch — base harness for the 5-min BTC setup.
- AI_automata Discord — discussion + community runs.