How My Claude Code Sonnet 4.6 AI Agent Navigates Chrome Autonomously

2026-02-18

A lot of people have asked how I actually control Chrome from my Claude Code AI agent. The answer is a single browser.js file connected to Chrome's debugging port via the Chrome DevTools Protocol. Today I want to walk through exactly how that setup works, because it is simpler than most people think.

This is the foundation under almost every browser automation video on the channel — it predates Surfagent and is what Surfagent eventually grew out of.

Watch the video:

The Architecture: Two Phases

The whole thing is two phases:

Phase 1: Launch Chrome in debug mode. A small shell script launches Chrome with the --remote-debugging-port=9222 flag. This opens a socket that any local process can connect to and drive the browser through.
Phase 2: Remote control via JavaScript. The agent runs browser.js commands that connect to the debugging socket and issue Chrome DevTools Protocol calls. Click, navigate, list, type, screenshot — all CDP under the hood.

That is the whole stack. No Selenium, no Playwright managing browser binaries, no cloud automation services. Just Chrome's built-in debugging port and a JS file that knows how to talk to it.

The Commands

Inside browser.js there is a small library of commands:

browser.js list — list all open tabs (so the agent knows which tab index is which)
browser.js open <url> — navigate the current tab to a URL
browser.js elements — list all clickable elements on the current page (this is the one I use most)
browser.js click <id> — click an element by its ID from the elements list
browser.js content — return the page's content for the agent to reason about
browser.js screenshot — capture the current state

Each one maps to a CDP method. The TypeScript / JavaScript files in the project are mostly just glue — receive args, format the CDP request, send, return result.

Demo: Hacker News in Three Commands

Live walkthrough. I prompted Claude Code (running Sonnet 4.6 — and I want to flag that Sonnet 4.6 is strong on agentic work, sometimes preferable to Opus for these kinds of tasks):

"Use the browser.js command list to go to hackernews.com."

Three commands fired in sequence:

browser.js list — confirms one tab is open
browser.js open https://news.ycombinator.com — navigates that tab
browser.js list again — confirms tab 0 is now on news.ycombinator.com

Then I asked: "Click on the first post." Claude Code reasoned through it: take a screenshot, get the page content, find the first post link, click element zero. Three commands again — screenshot, content, click 0. Done. Reading "Garment N Notional Language on Hacker News."

Why This Beats Virtual Mouse Approaches

A lot of agent frameworks try to control the browser by simulating mouse movements and pixel-level clicks. That works, but it is slow and fragile. My approach goes through the DOM directly:

Faster — no animations, no settle time, no waiting for the cursor to move
More accurate — clicks the actual element by selector, not pixel coordinates that can shift with viewport size
Adapts to any page — the elements command always returns whatever is currently visible and clickable

This DOM-first approach is what makes the parallel browser automation pattern feasible — see my parallel browser automation post for what happens when you scale this to multiple sub-agents.

Combining With Other Skills

The browser commands are primitives. They get composed into higher-level skills. For example, my X skill knows about post composition, scheduling, draft saving — but under the hood, every action it takes calls one of the browser.js primitives.

To demo this in the video I prompted: "Use the X skill to compose a draft of 'hello YouTube, this is my skillsmd.store page'." Claude went straight to compose/post, used the X skill flow (which is more efficient than having Claude figure out X from scratch every time), pasted the draft text, took a screenshot to confirm. End to end in a few seconds.

That stacking — primitives + skills — is why the agent gets fast over time. The first time Claude does something new, it figures out the page from scratch using the primitives. The second time, the saved skill makes it nearly instant.

Why I'm Showing This

The reason I keep getting questions about this is that "control your browser from an AI agent" sounds magical, but the actual implementation is small. If you want to set this up yourself:

Launch Chrome with --remote-debugging-port=9222
Write a small JS file that connects to localhost:9222 via CDP
Expose a few commands (list, open, click, content, elements, screenshot)
Have Claude Code call those commands as bash from inside its skills

That's it. The whole setup is maybe 200 lines of JavaScript. No magic.

If enough people are interested, I'll publish the full browser.js with all commands at skillsmd.store so you don't have to write your own. Let me know in the YouTube comments.

Resources

Skills MD store — agent setups and skills
My GitHub — other repos and code samples