With the rise of generative artificial intelligence, tech companies are seeking training data to enhance their models — and some are doing so without permission.
Apple, Nvidia, and Anthropic have been found utilizing subtitles from tens of thousands of YouTube videos to train their AI models, despite YouTube’s rules against unauthorized downloading and usage of its content, according to an investigation by Proof News in collaboration with Wired.
The investigation uncovered the use of a dataset called YouTube Subtitles, comprising transcripts from 173,536 YouTube videos across over 48,000 channels. These videos range from educational content by Khan Academy and MIT to news sources like The Wall Street Journal, and popular creators such as MrBeast and Marques Brownlee.
Marques Brownlee addressed the investigation in a post on X, revealing that “Apple has sourced data for their AI from several companies,” and one of these companies scraped extensive data/transcripts from YouTube, including his videos. However, he noted that “Apple technically avoids ‘fault’ here because they’re not the ones scraping,” adding that this issue will be prevalent for a long time.
Additionally, Proof News developed a tool for creators to check if their content is included in the dataset, which also contains a few videos from Quartz. Although the YouTube Subtitles dataset does not encompass video imagery, it includes some translated subtitles in languages like German and Arabic.
Eleuther AI, a non-profit AI research lab dedicated to promoting open science, created the dataset. It is part of the nonprofit’s collection of materials from various sources, including the European Parliament and English Wikipedia, called the Pile.
A spokesperson for Salesforce, one of the companies named in the investigation for using this dataset, stated that “The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes,” and it was publicly available under a permissive license.
Apple, Nvidia, and Anthropic have not responded to requests for comment.
In April, YouTube CEO Neal Mohan informed Bloomberg that using YouTube videos, transcripts, or video snippets to train AI models, like OpenAI’s text-to-video generator Sora, would be a clear violation of the platform’s policies. Nonetheless, the New York Times reported that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.