Cognitive Digital Shadows: 190K-Record Corpus Tracks LLM Debate on Controversial Topics

ai-technology · 2026-05-01

A new synthetic corpus, Cognitive Digital Shadows (CDS), enables analysis of how large language models discuss divisive societal issues when prompted to mimic human personalities or AI roles. The dataset comprises 190,000 records generated by 19 LLMs, covering vaccines, social media disinformation, gender gaps in science, and STEM stereotypes. Each persona-conditioned record includes 17 sociodemographic and psychological attributes, linking prompts, language, stances, and reasoning. Texts are validated for topic anchoring and support emotional analysis via interpretable NLP techniques like textual forma mentis networks. A pooling platform with user-friendly dashboards facilitates easy exploration. The research, detailed in arXiv:2604.27624, addresses the sparse availability of datasets that control for social and contextual prompting in LLM output variation.

Key facts

Cognitive Digital Shadows (CDS) contains 190,000 records.
Records generated by 19 different LLMs.
LLMs prompted to shadow human personas or AI-assistant roles.
Covers 4 controversial topics: vaccines/healthcare, social media disinformation, gender gap in science, STEM stereotypes.
Persona-conditioned records encode 17 sociodemographic and psychological attributes.
Texts validated for topic anchoring.
Supports emotional analysis via interpretable NLP (e.g., textual forma mentis networks).
Includes a pooling platform with user-friendly dashboards.

Entities

—

Sources

arXiv cs.AI — 2026-05-01