SWE-Chain: Benchmarking Coding Agents on Package Upgrades

other · 2026-05-16

SWE-Chain represents a novel benchmark aimed at assessing coding agents driven by large language models through chained release-level package upgrades. In contrast to current benchmarks that emphasize solving isolated issues, SWE-Chain evaluates ongoing maintenance throughout various version transitions, where each upgrade relies on the previous codebase of the agent. A divide-and-conquer synthesis pipeline was utilized to develop this benchmark, aligning release notes with code diffs for every version change, thus ensuring that upgrade specifications are based on actual code alterations, relevant for agents, and practical to execute. It includes 12 upgrade chains from 9 authentic Python packages, featuring 155 version transitions and 1,660 grounded upgrade requirements, filling the void in evaluating agents on realistic software evolution tasks involving bundled changes.

Key facts

SWE-Chain is a benchmark for evaluating coding agents on chained release-level package upgrades.
It captures continuous maintenance across multiple version transitions.
The benchmark uses a divide-and-conquer synthesis pipeline to align release notes with code diffs.
SWE-Chain contains 12 upgrade chains across 9 real Python packages.
It includes 155 version transitions and 1,660 grounded upgrade requirements.
The benchmark focuses on realistic software evolution beyond isolated issue resolution.
Each upgrade transition builds on the agent's prior codebase.
The upgrade specifications are grounded in actual code changes.

SWE-Chain: Benchmarking Coding Agents on Package Upgrades

Key facts

Entities

Institutions

Sources