ARTFEED — Contemporary Art Intelligence

Nemotron 3 Nano Omni: Open Multimodal AI Model Released

ai-technology · 2026-04-28

NVIDIA has released Nemotron 3 Nano Omni, a multimodal AI model that natively processes text, images, video, and audio. Built on the 30B-A3B backbone, it improves accuracy over its predecessor Nemotron Nano V2 VL across all modalities, with leading results in document understanding, long audio-video comprehension, and agentic computer use. The model employs token-reduction techniques for lower latency and higher throughput. Checkpoints in BF16, FP8, and FP4 formats, along with training data and code, are open-sourced to support further research.

Key facts

  • Nemotron 3 Nano Omni is the first Nemotron model to natively support audio inputs.
  • It is built on the Nemotron 3 Nano 30B-A3B backbone.
  • The model delivers consistent accuracy improvements over Nemotron Nano V2 VL.
  • It achieves leading results in real-world document understanding, long audio-video comprehension, and agentic computer use.
  • Innovative multimodal token-reduction techniques reduce inference latency and increase throughput.
  • Model checkpoints are released in BF16, FP8, and FP4 formats.
  • Portions of training data and codebase are also released.
  • The model supports text, images, video, and audio inputs.

Entities

Institutions

  • NVIDIA
  • arXiv
  • HuggingFace
  • Megatron-LM
  • Transformer Engine
  • Megatron Energon
  • NeMo-RL
  • NeMo Gym
  • Nemo-Gym
  • NeMo Data Designer
  • MediaPerf
  • MMlongBench-Doc
  • OCRBenchV2
  • WorldSense
  • DailyOmni
  • VoiceBench

Sources