With the rise of generative artificial intelligence, tech companies are scrambling for training data to enhance their models, often taking it without permission.
Amazon’s Prime Day: The first half of Amazon’s 48-hour sales event has resulted in the largest U.S. e-commerce day of 2024 so far.
A joint investigation by Proof News and Wired revealed that companies such as Apple, Nvidia, and Anthropic have used subtitles from tens of thousands of YouTube videos to train their AI models, despite YouTube’s policies against unauthorized downloading and use of its content.
These companies employed a dataset known as YouTube Subtitles, which contains transcripts from 173,536 YouTube videos across more than 48,000 channels. The videos range from educational content by Khan Academy and MIT to news outlets like The Wall Street Journal and popular creators such as MrBeast and Marques Brownlee.
Brownlee acknowledged the investigation in a post on X, noting that Apple sourced data from several companies, one of which gathered numerous transcripts from YouTube videos, including his own. He emphasized that while Apple technically isn’t at fault since they didn’t perform the scraping themselves, this issue will likely persist.
Proof News also developed a tool allowing creators to check if their content is part of the dataset, which contains some videos from Quartz. Although the YouTube Subtitles dataset lacks video imagery, it does include translated subtitles in various languages such as German and Arabic.
The dataset in question was created by Eleuther AI, a nonprofit AI research lab focused on promoting open science norms. It is part of Eleuther AI’s broader compilation of materials from various sources, including the European Parliament and English Wikipedia, known as the Pile.
A spokesperson for Salesforce, one of the companies found using the dataset, stated, “The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes and was publicly available under a permissive license.”
Apple, Nvidia, and Anthropic did not immediately respond to requests for comment.
In April, YouTube CEO Neal Mohan told Bloomberg that using YouTube videos or transcripts to train AI models would be a “clear violation” of the platform’s policies. However, the New York Times reported shortly after that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.