SWE-QA: Benchmark for Multi-Hop Code Comprehension

other · 2026-04-30

SWE-QA has been launched by researchers as a dataset and benchmark aimed at assessing multi-hop code understanding, bridging the divide between straightforward evaluation tasks and the intricate reasoning needed in actual software development. This dataset features 9,072 multiple-choice questions that were systematically created from 12 Python repositories within SWE-bench. It focuses on recurring reasoning patterns, including Declaration-and-Call questions that connect entity definitions to their applications, and Interacting-Entity questions that investigate the dynamic interactions among various collaborating components. The questions were crafted using parsing-based entity extraction and aided by Large Language Models, with distractors meticulously validated. This benchmark is intended to differentiate authentic comprehension from mere superficial pattern recognition.

Key facts

SWE-QA is a dataset and benchmark for multi-hop code comprehension.
It addresses the gap between simplified evaluation tasks and real-world software development.
The dataset contains 9,072 multiple-choice questions.
Questions are generated from 12 Python repositories of SWE-bench.
It evaluates reasoning patterns like Declaration-and-Call and Interacting-Entity questions.
Generation uses parsing-based entity extraction and LLM-assisted question construction.
Distractors are carefully validated.
The benchmark distinguishes genuine comprehension from superficial pattern matching.

Entities

—

Sources

arXiv cs.AI — 2026-04-29