How My Claude Code Sonnet 4.6 AI Agent Navigates Chrome Autonomously
A lot of people have asked how I actually control Chrome from my Claude Code AI agent. The answer is a single browser.js file connected to Chrome's debugging port via the Chrome DevTools Protocol. Today I want to walk through exactly how that setup works, because it is simpler than most people think.
This is the foundation under almost every browser automation video on the channel — it predates Surfagent and is what Surfagent eventually grew out of.
Watch the video:
The Architecture: Two Phases
The whole thing is two phases:
- Phase 1: Launch Chrome in debug mode. A small shell script launches Chrome with the
--remote-debugging-port=9222flag. This opens a socket that any local process can connect to and drive the browser through. - Phase 2: Remote control via JavaScript. The agent runs
browser.jscommands that connect to the debugging socket and issue Chrome DevTools Protocol calls. Click, navigate, list, type, screenshot — all CDP under the hood.
That is the whole stack. No Selenium, no Playwright managing browser binaries, no cloud automation services. Just Chrome's built-in debugging port and a JS file that knows how to talk to it.
The Commands
Inside browser.js there is a small library of commands:
browser.js list— list all open tabs (so the agent knows which tab index is which)browser.js open <url>— navigate the current tab to a URLbrowser.js elements— list all clickable elements on the current page (this is the one I use most)browser.js click <id>— click an element by its ID from the elements listbrowser.js content— return the page's content for the agent to reason aboutbrowser.js screenshot— capture the current state
Each one maps to a CDP method. The TypeScript / JavaScript files in the project are mostly just glue — receive args, format the CDP request, send, return result.
Demo: Hacker News in Three Commands
Live walkthrough. I prompted Claude Code (running Sonnet 4.6 — and I want to flag that Sonnet 4.6 is strong on agentic work, sometimes preferable to Opus for these kinds of tasks):
"Use the browser.js command list to go to hackernews.com."
Three commands fired in sequence:
browser.js list— confirms one tab is openbrowser.js open https://news.ycombinator.com— navigates that tabbrowser.js listagain — confirms tab 0 is now on news.ycombinator.com
Then I asked: "Click on the first post." Claude Code reasoned through it: take a screenshot, get the page content, find the first post link, click element zero. Three commands again — screenshot, content, click 0. Done. Reading "Garment N Notional Language on Hacker News."
Why This Beats Virtual Mouse Approaches
A lot of agent frameworks try to control the browser by simulating mouse movements and pixel-level clicks. That works, but it is slow and fragile. My approach goes through the DOM directly:
- Faster — no animations, no settle time, no waiting for the cursor to move
- More accurate — clicks the actual element by selector, not pixel coordinates that can shift with viewport size
- Adapts to any page — the elements command always returns whatever is currently visible and clickable
This DOM-first approach is what makes the parallel browser automation pattern feasible — see my parallel browser automation post for what happens when you scale this to multiple sub-agents.
Combining With Other Skills
The browser commands are primitives. They get composed into higher-level skills. For example, my X skill knows about post composition, scheduling, draft saving — but under the hood, every action it takes calls one of the browser.js primitives.
To demo this in the video I prompted: "Use the X skill to compose a draft of 'hello YouTube, this is my skillsmd.store page'." Claude went straight to compose/post, used the X skill flow (which is more efficient than having Claude figure out X from scratch every time), pasted the draft text, took a screenshot to confirm. End to end in a few seconds.
That stacking — primitives + skills — is why the agent gets fast over time. The first time Claude does something new, it figures out the page from scratch using the primitives. The second time, the saved skill makes it nearly instant.
Why I'm Showing This
The reason I keep getting questions about this is that "control your browser from an AI agent" sounds magical, but the actual implementation is small. If you want to set this up yourself:
- Launch Chrome with
--remote-debugging-port=9222 - Write a small JS file that connects to
localhost:9222via CDP - Expose a few commands (list, open, click, content, elements, screenshot)
- Have Claude Code call those commands as bash from inside its skills
That's it. The whole setup is maybe 200 lines of JavaScript. No magic.
If enough people are interested, I'll publish the full browser.js with all commands at skillsmd.store so you don't have to write your own. Let me know in the YouTube comments.
Resources
- Skills MD store — agent setups and skills
- My GitHub — other repos and code samples