LLM-Generated Survey Data Tested for Population Synthesis

ai-technology · 2026-05-28

A research investigation assesses the feasibility of utilizing health survey data produced by zero-shot large language models (LLMs) as a substitute for actual survey data in the context of geographically explicit population synthesis. Employing GPT-4.1 and Gemini-2.5-Pro, scientists generated synthetic datasets for Colorado and Mississippi, referencing the 2023 Behavioral Risk Factor Surveillance System (BRFSS). These datasets were incorporated into an iterative proportional fitting (IPF) framework to develop synthetic populations at the census tract level. Findings revealed that both LLMs effectively captured significant state-level differences, demonstrating that zero-shot generation can yield geographically distinct data. Nonetheless, the study underscores the varying performance and the potential as well as the limitations of LLMs in generating synthetic demographic data.

Key facts

Study uses zero-shot LLM-generated health survey data for population synthesis.
Models tested: GPT-4.1 and Gemini-2.5-Pro.
Data source: 2023 Behavioral Risk Factor Surveillance System (BRFSS).
Geographic focus: Colorado and Mississippi, U.S.
Method: Iterative proportional fitting (IPF) pipeline.
Outcome: LLMs captured major state-level contrasts.
Limitations: Performance not fully benchmarked.
Published on arXiv: 2605.27401.

Entities

Institutions

arXiv

Locations

Colorado
Mississippi
United States

Sources

arXiv cs.AI — 2026-05-28