Skill Availability Boosts LLM Agent Performance in Controlled Study
A new study on arXiv (2605.31408) examines how skill document presentation granularity affects large-language-model agents' task success. Using a pinned SkillsBench version with 30 tasks, two reasoning-enabled models (GPT-5.5 and DeepSeek V4-Flash), six skill conditions, and five trials per cell, the experiment generated 1,800 rows of data (900 per model). Skill availability proved the strongest signal: compared to no skill, skill conditions increased task-mean pass rate by 26.7–36.0 percentage points for GPT-5.5 and 18.0–26.0 for DeepSeek V4-Flash. Primary presentation contrasts showed smaller and uncertain effects. The study aggregates five trials per task-condition-model cell before paired contrasts over 30 tasks.
Key facts
- Study published on arXiv with ID 2605.31408
- Uses SkillsBench version with 30 domain-balanced tasks
- Tests two models: GPT-5.5 and DeepSeek V4-Flash
- Six skill conditions applied
- Five trials per task-condition-model cell
- 1,800 total data rows (900 per model)
- Skill availability increased pass rate by 26.7–36.0 pp for GPT-5.5
- Skill availability increased pass rate by 18.0–26.0 pp for DeepSeek V4-Flash
Entities
Institutions
- arXiv