Valley3: Omni Foundation Model for E-commerce with Multilingual Audio
Valley3 is an advanced multimodal large language model (MLLM) tailored for a variety of global e-commerce applications, demonstrating a cohesive comprehension and reasoning ability across text, images, video, and audio. A standout aspect is its inherent multilingual audio functionality for e-commerce, achieved by adapting vision-language models to handle audio-visual tasks, especially in short-video contexts. The model undergoes a four-phase omni e-commerce continued pre-training process, gradually developing audio comprehension, cross-modal instruction adherence, e-commerce expertise, and long-context reasoning. After training, Valley3 is further refined with capabilities for long-chain reasoning and adjustable reasoning modes, which include one non-thinking mode and three unique thinking modes. This research is detailed in arXiv:2605.01278.
Key facts
- Valley3 is an omni multimodal large language model for e-commerce.
- It handles text, images, video, and audio.
- It features native multilingual audio capability for e-commerce.
- It is developed via a four-stage omni e-commerce continued pre-training pipeline.
- Post-training enables long-chain reasoning with controllable modes.
- There is one non-thinking mode and three distinct thinking modes.
- The model is designed for short-video scenarios.
- The work is published on arXiv with ID 2605.01278.
Entities
Institutions
- arXiv