OpenAI Models Show Recognition of Copyrighted O'Reilly Media Books

ai-technology · 2026-05-07

A study using the DE-COP membership inference attack on a dataset of 34 copyrighted O'Reilly Media books found that OpenAI's GPT-4o exhibits patterns consistent with recognizing pay-walled book content, achieving an AUROC score of 0.82 (95% CI: 0.60-0.96). GPT-4o Mini showed little recognition of non-public content with an AUROC of 0.56 (0.28-0.83). The research highlights potential copyright issues in LLM training data, though wide confidence intervals reflect uncertainty due to the small sample size. Testing multiple models with the same cutoff date controlled for language shifts over time.

Key facts

Dataset of 34 copyrighted O'Reilly Media books used
DE-COP membership inference attack method applied
GPT-4o AUROC score: 0.82 (95% CI: 0.60-0.96)
GPT-4o Mini AUROC score: 0.56 (0.28-0.83) for non-public data
Wide confidence intervals due to limited number of books
Multiple models tested with same cutoff date as partial control
Study investigates whether OpenAI's LLMs show recognition of copyrighted content
Potential language shifts over time considered as bias factor

OpenAI Models Show Recognition of Copyrighted O'Reilly Media Books

Key facts

Entities

Institutions

Sources