ARTFEED — Contemporary Art Intelligence

LLM-Generated Survey Data Tested for Population Synthesis

ai-technology · 2026-05-28

A research investigation assesses the feasibility of utilizing health survey data produced by zero-shot large language models (LLMs) as a substitute for actual survey data in the context of geographically explicit population synthesis. Employing GPT-4.1 and Gemini-2.5-Pro, scientists generated synthetic datasets for Colorado and Mississippi, referencing the 2023 Behavioral Risk Factor Surveillance System (BRFSS). These datasets were incorporated into an iterative proportional fitting (IPF) framework to develop synthetic populations at the census tract level. Findings revealed that both LLMs effectively captured significant state-level differences, demonstrating that zero-shot generation can yield geographically distinct data. Nonetheless, the study underscores the varying performance and the potential as well as the limitations of LLMs in generating synthetic demographic data.

Key facts

  • Study uses zero-shot LLM-generated health survey data for population synthesis.
  • Models tested: GPT-4.1 and Gemini-2.5-Pro.
  • Data source: 2023 Behavioral Risk Factor Surveillance System (BRFSS).
  • Geographic focus: Colorado and Mississippi, U.S.
  • Method: Iterative proportional fitting (IPF) pipeline.
  • Outcome: LLMs captured major state-level contrasts.
  • Limitations: Performance not fully benchmarked.
  • Published on arXiv: 2605.27401.

Entities

Institutions

  • arXiv

Locations

  • Colorado
  • Mississippi
  • United States

Sources