Valley3: Omni Foundation Model for E-commerce with Multilingual Audio

ai-technology · 2026-05-06

Valley3 is an advanced multimodal large language model (MLLM) tailored for a variety of global e-commerce applications, demonstrating a cohesive comprehension and reasoning ability across text, images, video, and audio. A standout aspect is its inherent multilingual audio functionality for e-commerce, achieved by adapting vision-language models to handle audio-visual tasks, especially in short-video contexts. The model undergoes a four-phase omni e-commerce continued pre-training process, gradually developing audio comprehension, cross-modal instruction adherence, e-commerce expertise, and long-context reasoning. After training, Valley3 is further refined with capabilities for long-chain reasoning and adjustable reasoning modes, which include one non-thinking mode and three unique thinking modes. This research is detailed in arXiv:2605.01278.

Key facts

Valley3 is an omni multimodal large language model for e-commerce.
It handles text, images, video, and audio.
It features native multilingual audio capability for e-commerce.
It is developed via a four-stage omni e-commerce continued pre-training pipeline.
Post-training enables long-chain reasoning with controllable modes.
There is one non-thinking mode and three distinct thinking modes.
The model is designed for short-video scenarios.
The work is published on arXiv with ID 2605.01278.

Valley3: Omni Foundation Model for E-commerce with Multilingual Audio

Key facts

Entities

Institutions

Sources