Amid the surge in generative artificial intelligence, tech companies are actively seeking training data to enhance their models, with some appropriating the data without proper consent.
In a related tech legal matter, the U.K. court has ruled that Craig Wright did not create Bitcoin.
Major tech firms like Apple, Nvidia, and Anthropic have been discovered to have used subtitles from thousands of YouTube videos to train their AI models, despite YouTube’s restrictions against downloading and using its content without authorization, according to an investigation by Proof News and Wired.
The investigation revealed that these companies utilized a dataset named YouTube Subtitles, containing transcripts from 173,536 YouTube videos across over 48,000 channels. This dataset included content from educational channels like Khan Academy and MIT, news outlets such as The Wall Street Journal, and prominent YouTube creators including MrBeast and Marques Brownlee.
“Apple has sourced data for their AI from several entities,” Brownlee mentioned in a post on X, responding to the investigation. “Some of them harvested a vast amount of data/transcripts from YouTube videos, including mine.” He further noted that even though “Apple technically avoids ‘fault’ here because they’re not the ones scraping,” this issue is likely to persist and evolve over time.
Proof News has also developed a tool that enables creators to search for their content within the dataset, which encompasses a few videos from Quartz. Although the YouTube Subtitles dataset does not include video imagery, it does incorporate some translated subtitles in languages like German and Arabic.
This dataset was created by Eleuther AI, a nonprofit AI research lab dedicated to promoting open science norms, and is part of the nonprofit’s broader compilation of materials, known as the Pile, which also includes data from the European Parliament and English Wikipedia, as per Proof News.
A representative from Salesforce, one of the companies listed in the investigation, stated, “The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes. The dataset was publicly available and released under a permissive license.”
Apple, Nvidia, and Anthropic did not immediately respond to requests for comment.
In April, YouTube CEO Neal Mohan told Bloomberg that using YouTube videos, including transcripts or video snippets, to train AI models such as OpenAI’s text-to-video generator, Sora, clearly violates the platform’s policies. However, the New York Times reported days later that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.