All About AI

Claude Opus 4.8 Agentic AI Trading Agent: First Test

Anthropic shipped Opus 4.8 yesterday. Given the recent runs of Codex 5.5 beating Opus 4.7 on Polymarket and again on Hyperliquid, the obvious question is whether 4.8 closes that gap. I ran the same setup as those two challenges, same prompts, on both venues. The trades themselves were fine. The harness behavior was the problem.

Watch the video:

Setup — identical to the previous bake-offs

I deliberately reused the exact prompts and venues from the previous two head-to-heads so the numbers are at least loosely comparable:

Caveat up front: one hour on each venue is a snapshot, not a verdict. I'm running longer parallel sessions in the background and will follow up with that data separately. This post is about the immediate observation from the first 4.8 run.

The strategies Opus 4.8 picked

Hyperliquid: "Ride the single strongest news-confirmed trend in the market — memory chip super cycle, MU long — and pair it with long silver." Active heartbeat every 60 seconds for adjustments. Reasonable thesis, single dominant macro view, secondary commodity hedge.

Polymarket: "Buy the favorite side (up or down) only when the price has already moved far enough from the window's open." This is a late-window momentum read — different from the late-window scalp Opus 4.7 picked in the earlier challenge, more like a momentum confirmation play.

Both strategies are coherent. Neither is fundamentally broken. If you handed these to a person they'd be defensible setups.

Results

RunOpus 4.7Opus 4.8
Polymarket (1 hour)−$25 (intervention-skewed)+9.22%
Hyperliquid (1 hour)−3.93%−5.6%

Polymarket result improved — though the previous Opus 4.7 number was skewed by my mid-run intervention, so this isn't a clean comparison. Hyperliquid got slightly worse: −$9 alone came from three losing long entries on Samsung. The ARM perp trades went well (both directions positive), so it's not that the model can't trade. It picked one bad ticker and held it.

The real problem: it kept stopping

This is the part I didn't expect and didn't see in either of the 4.7 runs. The prompt explicitly required a 1-hour heartbeat loop with re-checks every 60 seconds. Opus 4.8 kept deciding to terminate the loop early — printing some variant of "I'm going to stop here" mid-run, despite the explicit instruction to run for the full hour.

I had to manually restart it multiple times across both venues. That's not a model-quality issue in the usual sense — the trade decisions when it was running were fine. It's a harness-behavior issue: the model isn't holding the long-running task contract that the agentic setup depends on.

Compare this to Codex 5.5 in the previous challenges, where the active-monitor behavior was its main edge — it kept rotating, kept re-checking, didn't need babysitting. That's exactly the property a 1-hour trading session needs. Opus 4.8 doesn't seem to want to hold that posture.

Possible causes (informed guesses)

I don't have visibility into Anthropic's training, so this is speculation:

Conclusion (provisional)

For agentic trading specifically — long-running, heartbeat-driven, repeated decision loops — Codex 5.5 is still ahead of Opus 4.8 in my testing. The 4.8 trade decisions are fine; the loop-holding behavior is worse.

I'm not switching back to Claude Code for these tasks. Codex Max stays on, Claude Code stays at the $20 tier for frontend work where Opus is still ahead (and that's a real strength — I haven't seen Codex match Opus on UI iteration).

What I want to test next:

  1. A 4.8 prompt rewritten to explicitly forbid early termination. If that fixes the heartbeat issue, this is a prompt problem not a model problem.
  2. A weeklong run for either model on either venue. One hour is variance city; I want to know if the active-monitor pattern survives longer horizons.
  3. Open-source models in the same harness. Specifically interested in whether DeepSeek R3 holds the heartbeat or hits the same self-termination issue.

Drop into the AI_automata Discord if you've already run 4.8 in a similar agentic setup — I'd be curious whether the heartbeat issue reproduces for others or if it's specific to my prompt.

Resources