SU-01: A 30B-A3B Model Achieves Gold-Medal-Level Olympiad Reasoning via Simple Scaling
The newly developed AI model, SU-01, has reached gold-medal-level success in challenges presented by the International Mathematical Olympiad (IMO) and the International Physics Olympiad (IPhO). Utilizing a straightforward and cohesive approach, this model is founded on a 30B-A3B architecture that underwent supervised fine-tuning (SFT) with around 340K sub-8K-token trajectories, followed by 200 reinforcement learning (RL) iterations. The methodology incorporates a reverse-perplexity curriculum for SFT to promote thorough proof-search and self-verification, alongside a two-phase RL process that transitions from verifiable rewards to proof-level RL, and test-time scaling to enhance solving capabilities. This research is elaborated in the arXiv paper 2605.13301, titled 'Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling,' showcasing a notable leap in tackling complex mathematical and scientific problems.
Key facts
- SU-01 achieves gold-medal-level performance on IMO and IPhO problems.
- Model uses a 30B-A3B backbone.
- Trained on around 340K sub-8K-token trajectories.
- Training involved 200 RL steps.
- Recipe includes reverse-perplexity curriculum for SFT.
- Two-stage RL pipeline: verifiable rewards then proof-level RL.
- Test-time scaling is used to boost performance.
- Paper published on arXiv with ID 2605.13301.
Entities
Institutions
- arXiv