Tech Giants Acquire YouTube Data for AI: Unauthorized Use?

by

in

The rise of generative artificial intelligence has tech companies on the hunt for data to refine their models, often acquiring it without permission.

A new investigation by Proof News, co-published with Wired, reveals that Apple, Nvidia, and Anthropic have used captions from thousands of YouTube videos to train AI models, despite the platform’s rules against unauthorized content downloading.

The companies utilized a dataset called YouTube Subtitles, containing transcripts of 173,536 YouTube videos from over 48,000 channels. These videos range from educational sources like Khan Academy and MIT, to news organizations such as The Wall Street Journal, and prominent creators including MrBeast and Marques Brownlee.

Commenting on the investigation, Brownlee took to social media platform X, noting, “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine.” He highlighted that although Apple technically isn’t at fault as they didn’t perform the scraping themselves, the issue is likely to persist.

Proof News provided a tool allowing creators to check if their content was part of the dataset, which included videos from a small number of outlets like Quartz. While the YouTube Subtitles dataset lacks video imagery, it does incorporate translated subtitles in languages such as German and Arabic.

The dataset was developed by Eleuther AI, a non-profit AI research lab promoting open science, and forms part of their larger compilation from diverse sources, known as the Pile. This compilation includes materials from the European Parliament and English Wikipedia.

A Salesforce spokesperson explained that their use of the Pile was for academic and research purposes, and the dataset was publicly available under a permissive license.

Neither Apple, Nvidia, nor Anthropic provided immediate comments on the investigation.

In April, YouTube CEO Neal Mohan stated that using YouTube videos, including their transcripts, to train AI models like OpenAI’s Sora constitutes a clear policy violation. Yet, the New York Times reported that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.

Popular Categories


Search the website