SPECTRA Framework Enables Supervision-Free Agentic Capabilities for Small Vision-Language Models
A new framework called Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA) has been introduced to address visual brittleness and poor tool orchestration in Small Vision-Language Models (SVLMs). SPECTRA bootstraps agentic capabilities through Coldstart Reinforcement Learning, eliminating the need for expensive supervised trajectory tuning. The framework enforces Soft Structured Multi-turn Rollouts, a topological constraint that requires agents to sequence tool-derived evidence before synthesis, effectively grounding reasoning in visual observations. A multi-objective reward signal simultaneously maximizes task correctness, rollout structure, and tool utility, allowing agents to self-discover robust behaviors without human preference labels. The research also introduces Tool Instrumental Utility (TIU), a novel metric for quantifying tool effectiveness. This work, detailed in arXiv preprint 2604.17475v1, presents a significant advancement in reducing dependency on human-labeled data for training SVLMs.
Key facts
- SPECTRA is a supervision-free framework for Small Vision-Language Models (SVLMs)
- It addresses visual brittleness and poor tool orchestration in SVLMs
- The framework uses Coldstart Reinforcement Learning
- It enforces Soft Structured Multi-turn Rollouts to sequence tool-derived evidence before synthesis
- A multi-objective reward signal maximizes task correctness, rollout structure, and tool utility
- Agents can self-discover robust behaviors without human preference labels
- Tool Instrumental Utility (TIU) is a novel metric introduced in the research
- The work is detailed in arXiv preprint 2604.17475v1
Entities
Institutions
- arXiv