UPDATE: AI Is Now Closer Than Ever to Automating Content Creation
I do this check-in roughly every six months — how good are AI models at automating short-form video content? The answer this time is meaningfully better than last time. The pipeline I have been running on my secondary channel (which just hit 20K subs and put up a clip with 8.9M views) is now stitched together cleanly enough that one prompt produces three publish-ready short-form clips from a one-hour podcast.
Today I want to walk through the pipeline end to end, then show four real examples — a podcast clip, a viral reaction video, an interview with multi-person face tracking, and a more traditional react-to-clip flow.
Watch the video:
The Pipeline
The whole flow is eight stages, fully automated:
- Source video in — your own video, a podcast, a stream, anything. URL or local file.
- Extract audio — FFmpeg pulls the audio track. Saves a lot of time later.
- Transcribe with timestamps — local Whisper model on the Mac. Timestamps are critical because the next step needs them.
- Pick viral moments — Opus 4.7 reads the timestamped transcript and picks the candidates. This is where the model does the most thinking.
- Face detection (YOLO) — finds every face in every frame. Necessary so we know what to keep in the reframe.
- Active speaker detection (Light ASD) — figures out who is currently speaking. This is what makes the multi-person podcast clips work — the camera follows whoever is talking.
- Reframe — turn the source 16:9 into vertical short-form, following the active speaker.
- Retention editing — Remotion (code-based, scriptable) handles captions, zoom punches, flash transitions, meme sound effects, optional background music.
The whole pipeline is driven by a Claude Code skill — same general pattern as my 3-part AI agent automation framework: skill + headless Claude + tools. Drop a URL in, walk away, come back to three finished MP4s.
Demo 1: Diary of a CEO Podcast
I gave it a Diary of a CEO episode (89 minutes). The prompt: "I have a new video assigned, make three clips." It pulled the URL, kicked off the pipeline, and Opus started reading the transcript hunting for moments. About 10 minutes later I had three MP4s.
The third clip was a male fertility one — a doctor showing three vials representing fertility trajectory across decades. The clip itself was structurally fine but the framing got slightly thrown off because the focus subject was the vials in the doctor's hand rather than a face. So the active-speaker detection sometimes fights with non-face focal points. Worth knowing — the pipeline is great for talking-head content, slightly weaker for prop-heavy content.
Demo 2: Automated Upload via Surfagent
This is what I wanted to show. After clip generation, I prompted "upload clip 3, pick a good title, set it to private." Surfagent took over: opened YouTube Studio (already logged in via the persistent Chrome session), uploaded the file, picked a title ("A doctor just exposed what's happening to male fertility"), set visibility to private, hit save. End-to-end in under a minute.
API uploads via the YouTube Data API are also possible, but Surfagent's logged-in-browser approach skips OAuth setup entirely. Useful if you have a dedicated Mac mini that just exists to be your "agent's browser."
Demo 3: Charlie Moist Reddit Mod Reaction
Different style. I cleared context, gave it a Charlie Moist viral reaction video, ran the same pipeline. The number-one clip was excellent — the head tracking was clean throughout, the captions punched at the right moments, the cuts landed. One full minute of clip with consistent speaker focus. This pipeline handles single-presenter reaction content very well.
Demo 4: Multi-Person Interview Switching
The hardest test. I ran an interview clip with two speakers exchanging quickly. The combination of YOLO + Light ASD nailed the switching — when speaker A talks, the frame centers on A; when B answers, it switches to B. The transitions were tight enough that a casual viewer wouldn't notice the automated reframing.
I also tested a more traditional financial-advice-style interview ("$10 coffee compounded over 40 years") and the framing held throughout.
Where We Are vs. 6 Months Ago
The combination that gets us here is:
- Whisper local — fast enough on M-series Macs that transcription stops being a bottleneck
- YOLO + Light ASD — the active-speaker piece is the unlock, not the face detection alone
- Remotion — code-based video editing that an LLM can drive directly
- Opus 4.7 for moment selection — strong enough at long-context comprehension to pick out the actually viral 30-second windows from an hour of conversation
Stack that up and the pipeline is now reliable enough that I would trust it with a real content workflow. Six months ago there were too many manual fix-up steps. Now there are maybe one or two — and I have not even integrated retention-tested thumbnail generation yet, which is the obvious next step.
This is the same automation philosophy as my Claude Code passive income setup — small loops, each costing essentially zero on the Claude Max plan, each producing real output.
What's Next
I have not actually been posting these clips. That is the next step. I want to see what the publish-test results look like — does the pipeline-generated content perform comparably to manually edited clips? My guess is "yes, with a 10-20% gap that closes as I tune the moment-selection prompt." But that is hypothesis until I run it.
If you want a follow-up post on the publish results, leave a comment on the video. I'll do another check-in in 4-6 weeks once I have data.
Resources
- Hostinger n8n self-hosting — sponsor of this video. Use code
ALLABOUTAIfor 10% off yearly plans on top of their existing 70% off. - Surfagent — my browser automation tool, used for the YouTube upload step.
- My GitHub — repos and code samples.
FAQ
How can AI fully automate short-form video content?
An eight-stage pipeline (audio extract → Whisper transcription → Opus moment-pick → YOLO face detection → Light ASD active-speaker detection → vertical reframe → Remotion editing → upload) produces three publish-ready clips from a one-hour podcast in about 10 minutes.
What is Light ASD in video editing?
Light Active Speaker Detection (ASD) is a machine learning model that identifies which face in a multi-person video is currently speaking, enabling automated reframing that follows the speaker across cuts.
Can AI pick viral moments from a podcast automatically?
Yes — feeding the timestamped Whisper transcript to Claude Opus 4.7 and asking it to identify the most-engaging 30-second windows produces clip candidates that match what human editors would choose, with some misses on prop-heavy content.
How do you upload short-form clips automatically?
Use a browser automation tool like Surfagent driven by Claude Code to navigate YouTube Studio in your already-logged-in Chrome session — no OAuth setup required, just a real browser tab.