3 AI Agent Browser Automation Challenges That Keep Getting Harder

2026-03-08

I wanted to find the hardest UI I knew of and point my browser automation agent at it. The answer was easy: AWS console. The console is a maze even for experienced users, and most "control my browser" agents fall over the first time they try to navigate it. So I designed three escalating AWS challenges and let the agent fight through them.

The results were better than I expected — and revealing about where these agents are now.

Watch the video:

The Three Challenges

Level 1: Create an S3 bucket, upload an image, launch a static web page that displays the image with some text.
Level 2: Launch a free Linux VM, make it accessible with a graphical remote desktop, get it online, use its browser to open a YouTube video about Claude Code.
Level 3: Build and publish a small web app where users can upload a video, then displays a public page where the uploaded video can be played back.

The constraint: only the AWS console in the browser. No local CLI shortcuts allowed (in theory).

Level 1: S3 Static Site

Pure browser navigation. The agent went straight to S3, found "Create bucket," typed a bucket name, scrolled, clicked create. Then uploaded me.png and an index.html — at one point it deleted the image and had to re-add it, but recovered. Properties tab → static website hosting → enable. Set index.html as the index document. Then it had to deal with public access settings, which on AWS is its own minefield.

This is where it cheated — it gave up on the bucket policy editor and pivoted to AWS CloudShell for the policy commands. Strictly speaking that's still in the browser (CloudShell is a browser-based shell), so I let it pass. Total time: 40 minutes. Result: working static site at the bucket URL, image and text rendering.

40 minutes is slow. But here is the lesson: I told the agent "save the learnings for next time you are on AWS." It distilled the run into an AWS skill — what worked, what to skip, where the buttons live. Subsequent runs would be much faster, same pattern as the parallel browser automation approach where first run is exploration, every later run is execution.

Level 2: Linux VM with Remote Desktop

This was the hardest of the three. The agent loaded the AWS skill, went to EC2, launched an Ubuntu free-tier instance. Set up credentials and SSH access. Then attempted to install a graphical remote desktop and connect through CloudShell.

It got most of the way there — the VM was running, I could see the Ubuntu desktop in a virtual non-headless mode, and it pointed Firefox at YouTube. The page started to load but didn't fully render — probably hit free-tier memory limits. I gave it a pass anyway because the orchestration was right: it built the instance, set up access, opened a browser inside the VM, and pointed it at the URL. The last 5% was a hardware constraint, not an agent failure.

Level 3: Video Upload Web App

This one it cheated on. I told it to use the browser. I walked away. I came back to find it had basically used CloudShell for everything — wrote the HTML/CSS/server in the shell, deployed it via AWS commands. Total elapsed: 3-4 minutes.

The result was a working app though. Drag-and-drop video upload, a public playback page with a direct link, error logging that I had to fix once when an upload failed. I tested it from my MacBook with a 200 MB video — uploaded fine, played back from the public URL on the original Mac mini.

So I gave it a pass on technique grounds, but the lesson is clear: if you don't constrain the agent, it will use whatever shortcut is fastest. Sometimes that is what you want, sometimes not. For browser-only testing you need to enforce the constraint at the tool level, not the prompt level.

What This Tells Us About 2026 Agents

A year ago, getting an agent to even find the right S3 button reliably was a project. Today, the agent passes the equivalent of three AWS console certifications in under an hour. The unlock isn't just bigger models — it is the combination of:

Persistent skills — when something works, the agent saves the recipe
Recon-first navigation — mapping the page before clicking, which makes navigation accurate
Tool fallbacks — when the UI fights back, drop to CLI
Long-running persistence — see the long-running browser automation post for how this looks at scale

Combined, these primitives mean an autonomous agent can now navigate a system as hostile as the AWS console with reasonable success. That is a significant capability shift.

What's Next

The constraint problem is what I want to solve next. "Use only the browser" needs to be enforceable, not just polite. I think there is a way to gate the available tools per-task — only expose browser tools, no shell, no AWS CLI — and let the agent figure things out within the box. That gets us a fairer test of pure browser capability.

I also have a half-built theory about how these agents are evolving — something to do with selection pressure on skills that get reused vs. discarded. I'll do a video on that soon. Same energy as the vibe-coding-as-evolution post.

Resources

My GitHub — repos and code samples.