As generative artificial intelligence gains momentum, tech companies are in search of training data to enhance their models, occasionally resorting to using data without authorization.
Apple, Nvidia, and Anthropic are among firms that have reportedly trained AI models using subtitles from tens of thousands of YouTube videos, contrary to the platform’s rules against unauthorized content use, according to an investigation by Proof News co-published with Wired.
The investigation revealed that these companies utilized a dataset called YouTube Subtitles, incorporating transcripts from 173,536 YouTube videos across over 48,000 channels. This dataset includes content from educational sources like Khan Academy and MIT, news outlets such as The Wall Street Journal, and popular creators including MrBeast and Marques Brownlee.
“Apple has sourced data for their AI from several companies,” Brownlee stated in a post on X in response to the investigation. “One of them scraped tons of data/transcripts from YouTube videos, including mine.”
Brownlee further mentioned that while “Apple technically avoids ‘fault’ here because they’re not the ones scraping,” “this is going to be an evolving problem for a long time.”
Proof News also developed a tool for creators to search for their content in the dataset, which included some videos from Quartz. The YouTube Subtitles dataset lacks video imagery but does contain translated subtitles in languages such as German and Arabic.
The dataset was generated by Eleuther AI, a non-profit AI research lab aiming to promote open science norms. It is part of the lab’s broader compilation of materials from other sources like the European Parliament and English Wikipedia, known as the Pile, Proof News reported.
A Salesforce spokesperson, one of the companies identified in the investigation, stated, “The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes. The dataset was publicly available and released under a permissive license.”
Neither Apple, Nvidia, nor Anthropic provided immediate comments when contacted.
In April, YouTube CEO Neal Mohan informed Bloomberg that using YouTube videos, including transcripts or excerpts, to train AI models like OpenAI’s text-to-video generator, Sora, would be a “clear violation” of the platform’s policies. Nonetheless, the New York Times reported shortly after that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.