Simpson's Paradox Distorts Behavioral Curve Models in User Dynamics
A recent study indicates that aggregation can lead to systematic distortions in modeling behavioral curves, a common technique in fields like recommendation systems, advertising, and clinical dosing. Researchers illustrate Simpson's paradox in behavioral curves through data from Goodreads, which includes 3.3 million users across 9 genres. Individual users show a peak of about 11 exposures, while the aggregate reaches around 34, resulting in a threefold difference attributed to survival bias. In Amazon Electronics, with 18 million reviews, a distortion of 5.3 times is observed. MovieLens-25M acts as a negative control, validating survival bias as the key mechanism. The study introduces Synthetic Null Calibration to tackle a 32% false positive rate in per-user classification, highlighting the relevance of these findings in estimating individual behavioral parameters from aggregated data.
Key facts
- Aggregation introduces systematic distortion in behavioral curve modeling.
- Simpson's paradox observed in behavioral curves on Goodreads (3.3M users, 9 genres).
- Individual users peak at ~11 exposures; aggregate peaks at ~34 exposures (3x gap).
- Amazon Electronics (18M reviews) shows 5.3x distortion.
- MovieLens-25M serves as negative control confirming survival bias.
- Distortion robust to category granularity, engagement operationalization, and classifier calibration.
- Synthetic Null Calibration developed to address 32% false positive rate.
- Findings apply to any domain estimating individual behavioral parameters from aggregated data.
Entities
Institutions
- arXiv
- Goodreads
- Amazon
- MovieLens