SEVO: Data-Centric Approach Boosts Robot Manipulation Robustness
A new data-centric method called SEVO (Semantic-Enhanced Virtual Observation) improves cross-environment manipulation robustness for Vision-Language-Action (VLA) and imitation-learning policies without modifying the policy architecture. Developed by researchers at arXiv, SEVO transforms raw RGB camera streams using three mechanisms: body-fixed cameras covering the full workspace, active red-spectrum illumination to normalize object appearance, and real-time YOLO segmentation overlay providing background-invariant semantic cues. The approach addresses a critical failure mode where policies trained via community toolchains on low-cost hardware achieve high success rates under controlled backgrounds but near-zero transfer to new environments, as reported in original ACT and SmolVLA benchmarks.
Key facts
- SEVO is a data-centric approach for VLA and imitation-learning policies
- It improves cross-environment robustness without modifying policy architecture
- Uses body-fixed cameras, active red-spectrum illumination, and YOLO segmentation
- Addresses failure of policies trained on low-cost hardware in new environments
- Original ACT and SmolVLA benchmarks show high success in controlled settings but near-zero transfer
- SEVO transforms raw RGB camera stream in three ways
- Published on arXiv with ID 2605.11114
- Announce type is cross
Entities
Institutions
- arXiv