Tech Giants Under Fire for Unauthorized Data Usage

by

in

With the rise of generative artificial intelligence, tech companies are on the lookout for training data to enhance their models, often taking it without permission.

An investigation by Proof News, in collaboration with Wired, revealed that firms including Apple, Nvidia, and Anthropic have used subtitles from thousands of YouTube videos for AI model training, despite rules against such usage without permission.

The investigation uncovered that a dataset, YouTube Subtitles, containing transcripts from 173,536 YouTube videos covering over 48,000 channels, was utilized. These videos range from educational content from Khan Academy and MIT to news sites like The Wall Street Journal and popular creators such as MrBeast and Marques Brownlee.

Marques Brownlee addressed the issue on social media, noting, “Apple has sourced data for their AI from several companies, one of which scraped transcripts from YouTube videos, including mine.” He pointed out that while Apple technically isn’t at fault as they did not directly scrape the data, this issue will likely persist for a long time.

Proof News also developed a tool for creators to check if their content is in the dataset, which included a few Quartz videos. The dataset contains subtitles, including translations in languages like German and Arabic, but no video imagery.

The YouTube Subtitles dataset was created by Eleuther AI, a nonprofit AI research lab dedicated to open science norms. This dataset is part of the nonprofit’s broader data compilation known as the Pile, which also includes material from sources like the European Parliament and English Wikipedia.

A Salesforce spokesperson stated that the Pile dataset used in the research was trained in 2021 for academic and research purposes and was publicly available under a permissive license.

Apple, Nvidia, and Anthropic have not yet responded to requests for comment.

Earlier this year, YouTube CEO Neal Mohan emphasized that using YouTube content to train AI models without permission is a “clear violation” of the platform’s policies. Despite this, reports indicated that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.

Popular Categories


Search the website