A recent study has raised questions about OpenAI’s data training practices, suggesting that its AI model, GPT-4o, may have been trained on copyrighted books from O’Reilly Media without authorization.
Highlights
Conducted by the AI Disclosures Project, the research indicates that GPT-4o demonstrates an unusually strong recognition of paywalled O’Reilly content, adding to ongoing discussions about AI companies’ data usage and copyright compliance.
Findings and Methodology
AI models such as GPT-4o are developed using vast datasets, including books, articles, and other text sources. While OpenAI has licensing agreements with some publishers, it has also advocated for more flexible regulations regarding AI training data.
To investigate the presence of O’Reilly’s content in GPT-4o’s training data, researchers used DE-COP, a technique designed to detect whether AI models have been trained on specific copyrighted material.
The study found that GPT-4o was significantly more effective at recognizing and generating excerpts from non-public O’Reilly books than its predecessor, GPT-3.5 Turbo.
This suggests that OpenAI’s latest model may have had access to these books during training, despite no known licensing agreement between OpenAI and O’Reilly Media.
The AI Disclosures Project, co-founded by media executive Tim O’Reilly and economist Ilan Strauss, analyzed over 13,000 paragraphs from 34 O’Reilly books published before and after GPT-4o’s training cutoff date.
Their findings indicate that GPT-4o’s ability to identify and reproduce content from these books was significantly higher than earlier OpenAI models, even when accounting for improvements in AI capabilities.
Uncertainties and OpenAI’s Data Practices
Despite these findings, the study acknowledges that its methodology has limitations. One possibility is that GPT-4o’s knowledge of O’Reilly content originated from users copying and pasting excerpts into ChatGPT, rather than direct exposure during training.
Additionally, the study does not examine OpenAI’s most recent models, such as GPT-4.5, leaving some uncertainty about whether similar practices have continued.
OpenAI has not responded to the claims made in the study, and the company has previously faced scrutiny regarding its data collection methods.
While OpenAI has secured agreements with news organizations and publishers to legally acquire training data, concerns remain about the extent to which AI models may incorporate copyrighted material without explicit authorization.
Broader Ethical and Legal Debates in AI Training
The findings contribute to the larger industry discussion about AI training ethics and the use of proprietary content.
Many AI companies have begun using AI-generated data to train new models, but access to real-world data remains crucial for improving model accuracy.
The use of copyrighted materials—whether intentional or not—poses a challenge in balancing AI development with legal and ethical considerations.
As AI firms seek to expand their datasets, Elon Musk and other industry figures have highlighted the increasing reliance on synthetic AI-generated data due to the limited availability of human-created text.
This shift has raised concerns about data quality and the risk of AI hallucinations, as synthetic data may introduce inaccuracies.
O’Reilly Media’s Approach to AI and Content Usage
O’Reilly Media has maintained a clear stance on generative AI (GenAI) and ethical content use. The company emphasizes the importance of responsible AI deployment and requires authors and content creators to track the use of AI-generated content while ensuring human oversight.
O’Reilly is developing a GenAI tool trained exclusively on trusted content, including its own publications and those from verified partners. This initiative is aimed at protecting copyrights and ensuring proper attribution for content creators.
Initiatives for Public Domain AI Training Data
Amid concerns over copyright and training data accessibility, institutions such as Harvard University are working on solutions.
Harvard, with funding from OpenAI and Microsoft, has launched a project to provide a dataset of nearly one million public-domain books for AI training. This effort aims to offer AI developers a legally safe and high-quality dataset while reducing reliance on copyrighted materials.
With legal scrutiny increasing and policymakers discussing AI regulations, studies like this could influence the evolving rules around training data transparency.