Nemotron 3 Nano Omni: Open Multimodal AI Model Released
NVIDIA has released Nemotron 3 Nano Omni, a multimodal AI model that natively processes text, images, video, and audio. Built on the 30B-A3B backbone, it improves accuracy over its predecessor Nemotron Nano V2 VL across all modalities, with leading results in document understanding, long audio-video comprehension, and agentic computer use. The model employs token-reduction techniques for lower latency and higher throughput. Checkpoints in BF16, FP8, and FP4 formats, along with training data and code, are open-sourced to support further research.
Key facts
- Nemotron 3 Nano Omni is the first Nemotron model to natively support audio inputs.
- It is built on the Nemotron 3 Nano 30B-A3B backbone.
- The model delivers consistent accuracy improvements over Nemotron Nano V2 VL.
- It achieves leading results in real-world document understanding, long audio-video comprehension, and agentic computer use.
- Innovative multimodal token-reduction techniques reduce inference latency and increase throughput.
- Model checkpoints are released in BF16, FP8, and FP4 formats.
- Portions of training data and codebase are also released.
- The model supports text, images, video, and audio inputs.
Entities
Institutions
- NVIDIA
- arXiv
- HuggingFace
- Megatron-LM
- Transformer Engine
- Megatron Energon
- NeMo-RL
- NeMo Gym
- Nemo-Gym
- NeMo Data Designer
- MediaPerf
- MMlongBench-Doc
- OCRBenchV2
- WorldSense
- DailyOmni
- VoiceBench