NVIDIA Nemotron 3 Nano Omni: Multimodal AI for Documents, Audio, and Video
NVIDIA has unveiled the Nemotron 3 Nano Omni, a versatile model designed for analyzing documents, reasoning with images, recognizing speech, understanding audio-video content, and general reasoning tasks. This new addition to the Nemotron series integrates text, images, video, and audio, achieving leading accuracy on benchmarks such as MMlongbench-Doc and VoiceBench, and stands out as the most cost-effective open video understanding model on MediaPerf. Its architecture incorporates a hybrid Mamba-Transformer backbone paired with a C-RADIOv4-H vision encoder. Key advancements include dynamic resolution processing, Conv3D temporal compression, and Efficient Video Sampling, resulting in up to 9x higher throughput and 2.9x faster reasoning. Training was conducted using multimodal alignment and reinforcement learning on NVIDIA H100 and B200 clusters, with checkpoints accessible on HuggingFace.
Key facts
- NVIDIA Nemotron 3 Nano Omni is a new omni-modal understanding model.
- It achieves top accuracy on MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni, and VoiceBench.
- It is the most cost-efficient open video understanding model on MediaPerf.
- The architecture uses a hybrid Mamba-Transformer MoE backbone with C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder.
- Dynamic resolution supports 1,024 to 13,312 visual patches per image.
- Conv3D tubelet embedding fuses consecutive video frames to reduce tokens.
- EVS drops redundant video tokens during inference.
- Audio input can be up to 20 minutes, with LLM context supporting 5+ hours.
- Training used staged multimodal alignment, context extension, preference optimization, and RL.
- Checkpoints available on HuggingFace in BF16, FP8, and NVFP4 formats.
Entities
Institutions
- NVIDIA
- HuggingFace
- Megatron-LM
- Transformer Engine
- Megatron Energon
- NeMo-RL
- NeMo Gym
- Nemo-Gym
- NeMo Data Designer
- MediaPerf
- MMlongBench-Doc
- OCRBenchV2
- WorldSense
- DailyOmni
- VoiceBench