SkillEvolBench: Benchmarking LLM Skill Evolution from Experience

other · 2026-05-26

Researchers have unveiled SkillEvolBench, a diagnostic benchmark designed to assess the ability of large language model (LLM) agents to convert episodic experiences into reusable procedural skills. This benchmark features 180 tasks set in six real-world agent environments, categorized into role-conditioned task families that share underlying procedures. Agents engage in acquisition tasks, enhance an external skill library with compacted trajectories and feedback from verifiers, and subsequently tackle frozen deployment tasks that evaluate context shifts, adversarial shortcuts, and composition. By contrasting self-generated and curated-start skill evolution with no-skill and raw-trajectory controls, SkillEvolBench distinguishes procedural abstraction from foundational capability, curated prior knowledge, and the direct reuse of episodic traces. The study examines ten model configurations to determine if accumulated episodic trajectories can be distilled into reusable skills.

Key facts

SkillEvolBench is a diagnostic benchmark for evaluating LLM skill evolution.
It contains 180 tasks across six real-world agent environments.
Tasks are organized into role-conditioned task families with shared latent procedures.
Agents learn from acquisition tasks and update an external skill library.
Deployment tasks test context shift, adversarial shortcuts, and composition.
The benchmark compares self-generated and curated-start skill evolution against controls.
Ten model configurations are tested.
The study was published on arXiv with ID 2605.24117.

SkillEvolBench: Benchmarking LLM Skill Evolution from Experience

Key facts

Entities

Institutions

Sources