SWE-QA: Benchmark for Multi-Hop Code Comprehension
SWE-QA has been launched by researchers as a dataset and benchmark aimed at assessing multi-hop code understanding, bridging the divide between straightforward evaluation tasks and the intricate reasoning needed in actual software development. This dataset features 9,072 multiple-choice questions that were systematically created from 12 Python repositories within SWE-bench. It focuses on recurring reasoning patterns, including Declaration-and-Call questions that connect entity definitions to their applications, and Interacting-Entity questions that investigate the dynamic interactions among various collaborating components. The questions were crafted using parsing-based entity extraction and aided by Large Language Models, with distractors meticulously validated. This benchmark is intended to differentiate authentic comprehension from mere superficial pattern recognition.
Key facts
- SWE-QA is a dataset and benchmark for multi-hop code comprehension.
- It addresses the gap between simplified evaluation tasks and real-world software development.
- The dataset contains 9,072 multiple-choice questions.
- Questions are generated from 12 Python repositories of SWE-bench.
- It evaluates reasoning patterns like Declaration-and-Call and Interacting-Entity questions.
- Generation uses parsing-based entity extraction and LLM-assisted question construction.
- Distractors are carefully validated.
- The benchmark distinguishes genuine comprehension from superficial pattern matching.
Entities
—