Nvidia Nemotron 3 Nano Omni: First Test and Impression
Nvidia just dropped another entry in their open-source Nemotron series, and this one finally goes properly multimodal. The Nemotron 3 Nano Omni is a 30B mixture-of-experts model with around 3B active parameters, and it can ingest images, audio, video, and PDFs out of the box — not just text.
I spent some time today building a quick "drop anything in, get text out" app to put it through its paces, and then took it over to opencode for a tool-calling test. Here is what I found.
Watch the video:
What is Nemotron 3 Nano Omni?
Nemotron 3 Nano Omni is part of Nvidia's open-source Nemotron family. The "Nano" line is built to be runnable on your own hardware if you have the GPU for it, while still being genuinely capable. The "Omni" variant is the multimodal version — same base model, but with native support for visual, audio, and document inputs alongside text.
The architecture is a 30B MoE with roughly 3B active parameters per token, which keeps inference cost low while preserving the capacity benefits of a much larger model. It also has a reasoning mode with an adjustable thinking budget, similar to what we have seen in other recent large language models.
For this test, I ran it via Nvidia's hosted API endpoint, but the same model weights are available on Hugging Face, so you can self-host if you have the hardware.
Building a Multimodal Drop-in App
I wanted something simple: a single React Vite app where I can drag in any file — image, audio clip, video, or PDF — and the model spits text back. I built it out using Claude Code in a few minutes. Nothing fancy, just a clean interface, the model name set to nemotron-3-nano-omni-reasoning-30b in the env, and a base URL pointing at Nvidia's API.
This kind of "everything in, text out" flow is incredibly useful inside an agent pipeline. If you are building autonomous AI agents and need them to handle arbitrary file inputs, having a single multimodal model do the heavy lifting beats wiring up separate Whisper, OCR, and vision models.
Testing the Modalities
I ran the model through every input type the Omni supports.
Images
First test: a cyberpunk-style digital illustration with no text. The model returned a detailed, atmospheric description — colors, mood, composition, the works. Then I dropped in a slide from the Nemotron release deck, which mostly contained text. It pulled every line cleanly, including small details like "Available today, April 28th" and the Nvidia logo position. Very similar quality to what we have seen from GPT-4 Vision, but running on a much smaller open-source model.
Audio
Next, a short MP3 clip. Transcription came back fast and accurate — picked up the speaker discussing a Polish charity for children with cancer, including names and context. On cloud inference this was effectively instant; on local hardware your mileage will depend on your GPU.
PDFs
This was the most impressive of the four. I dropped a 35-page PDF in and the model started OCR'ing page by page, extracting text at speed. The integration could be cleaner — there was a bit of UI flicker — but the underlying speed and accuracy were solid.
Video
Finally, a short MP4 of a girl skating in a skate park at dusk. The model transcribed both the visual frames and the background audio together: described the wide shot, her clothing, the trick she lands, the camera movement, the energetic music. It is doing frame analysis and audio transcription as one combined pass, which is the kind of thing that used to take a multi-stage pipeline to pull off.
Reasoning Mode
Nemotron Nano Omni has a built-in reasoning mode with an adjustable thinking budget — you can dial up how many tokens it spends thinking before answering. I asked it to explain quantum computing to a five-year-old, with reasoning shown.
It spent around 3,000 tokens reasoning, then produced: "A regular computer uses lights that are either on or off, but a quantum computer uses magic lights that can be both on and off at the same time. So it can try many possibilities together and when you look it picks the one answer."
That is a Schrödinger's cat metaphor delivered cleanly. Not bad for a 30B open model. I also tried my usual trick question — "It's a nice day, should I drive or walk to the car wash to wash the car?" — and as with every other reasoning model I have tested (Opus, GPT-5, etc.), it missed the obvious contradiction. Still an open problem.
Tool Calling in opencode
The last thing I wanted to verify was tool calling. I added Nemotron Nano Omni to my opencode config and asked it to build a single-file HTML page that calls OpenAI's GPT-image generation model and renders the result. Documentation and API key were provided in context.
It one-shotted it. Wrote the HTML, opened it in Chrome, accepted a prompt, and rendered the returned image cleanly. I tried a couple of different prompts (Jinx and Shaco from League of Legends in TCG card style) and the second result was actually nicely formatted. The point was to verify that function calling and tool use work reliably on this model — they do.
My Take
Nemotron 3 Nano Omni feels like the strongest open-source multimodal model I have tested in a while, especially given the size. The "drop any file, get text" pattern is dead simple to build but enormously useful as a primitive. If you are working on agentic workflows, this kind of model belongs in your toolkit.
For local inference, if you have the hardware to run a 30B MoE, this is worth pulling down and trying. For everyone else, the hosted API on Nvidia is the easy path.
Resources
- Nvidia Nemotron 3 Nano Omni on Hugging Face — model weights and inference instructions
- Surfagent — my browser automation agent project
- My GitHub — repos and code samples