Nemotron 3 Nano Omni: Open Multimodal AI Model Released

ai-technology · 2026-04-28

NVIDIA has released Nemotron 3 Nano Omni, a multimodal AI model that natively processes text, images, video, and audio. Built on the 30B-A3B backbone, it improves accuracy over its predecessor Nemotron Nano V2 VL across all modalities, with leading results in document understanding, long audio-video comprehension, and agentic computer use. The model employs token-reduction techniques for lower latency and higher throughput. Checkpoints in BF16, FP8, and FP4 formats, along with training data and code, are open-sourced to support further research.

Key facts

Nemotron 3 Nano Omni is the first Nemotron model to natively support audio inputs.
It is built on the Nemotron 3 Nano 30B-A3B backbone.
The model delivers consistent accuracy improvements over Nemotron Nano V2 VL.
It achieves leading results in real-world document understanding, long audio-video comprehension, and agentic computer use.
Innovative multimodal token-reduction techniques reduce inference latency and increase throughput.
Model checkpoints are released in BF16, FP8, and FP4 formats.
Portions of training data and codebase are also released.
The model supports text, images, video, and audio inputs.

Entities

Institutions

NVIDIA
arXiv
HuggingFace
Megatron-LM
Transformer Engine
Megatron Energon
NeMo-RL
NeMo Gym
Nemo-Gym
NeMo Data Designer
MediaPerf
MMlongBench-Doc
OCRBenchV2
WorldSense
DailyOmni
VoiceBench

Sources

arXiv cs.AI — 2026-04-29
Hugging Face Blog — 2026-04-28