ImmigrationQA: A Dataset for U.S. Immigration Law
So, there’s this new dataset called ImmigrationQA, created by researchers to tackle immigration-related questions. It includes 17,058 question-answer pairs from 13 different areas. They pulled together information from 11 sources, like the USCIS Policy Manual and 8 CFR, resulting in 10,056 verified documents and 18,308 text snippets. They used Claude Sonnet 4.6 to generate the QA pairs but tossed out 22 pairs because they didn’t match the source well enough. Then, they fine-tuned a Llama 3.2 3B Instruct model with this data, testing its accuracy on a separate set of 993 pairs using a scoring system out of 101. U.S. immigration law is really complicated and often changes, making it tough for people without legal help.
Key facts
- ImmigrationQA dataset has 17,058 QA pairs across 13 immigration subdomains.
- Corpus from 11 sources including USCIS Policy Manual, 8 CFR, BIA precedent decisions.
- 10,056 validated canonical documents and 18,308 text chunks.
- QA pairs generated using Claude Sonnet 4.6 with five mode-specific prompts.
- 22 pairs rejected for insufficient source-span overlap.
- Fine-tuned Llama 3.2 3B Instruct model using LoRA.
- Evaluated on 993 held-out pairs with LLM-as-judge scoring.
- U.S. immigration law is complex and high-stakes for unrepresented petitioners.
Entities
Institutions
- USCIS
- BIA
Locations
- United States