SEVO: Data-Centric Approach Boosts Robot Manipulation Robustness

ai-technology · 2026-05-13

A new data-centric method called SEVO (Semantic-Enhanced Virtual Observation) improves cross-environment manipulation robustness for Vision-Language-Action (VLA) and imitation-learning policies without modifying the policy architecture. Developed by researchers at arXiv, SEVO transforms raw RGB camera streams using three mechanisms: body-fixed cameras covering the full workspace, active red-spectrum illumination to normalize object appearance, and real-time YOLO segmentation overlay providing background-invariant semantic cues. The approach addresses a critical failure mode where policies trained via community toolchains on low-cost hardware achieve high success rates under controlled backgrounds but near-zero transfer to new environments, as reported in original ACT and SmolVLA benchmarks.

Key facts

SEVO is a data-centric approach for VLA and imitation-learning policies
It improves cross-environment robustness without modifying policy architecture
Uses body-fixed cameras, active red-spectrum illumination, and YOLO segmentation
Addresses failure of policies trained on low-cost hardware in new environments
Original ACT and SmolVLA benchmarks show high success in controlled settings but near-zero transfer
SEVO transforms raw RGB camera stream in three ways
Published on arXiv with ID 2605.11114
Announce type is cross

SEVO: Data-Centric Approach Boosts Robot Manipulation Robustness

Key facts

Entities

Institutions

Sources