OpenAI trained GPT-4 on millions of hours of YouTube audio

OpenAI trained GPT-4 on one million hours of audio from YouTube videos. The AI giant did not ask Google’s permission to do so. However, the latter did not object because the company itself uses YouTube to train its own LLMs.

In 2021, OpenAI lacked reliable English-language data available online for training its newest LLM, GPT-4. The New York Times reports that OpenAI decided to tap into new data sources, specifically YouTube videos.

To do so, OpenAI developed its audio transcription model Whisper to scrape about a million hours of audio from YouTube videos. This text-based data was then loaded into GPT-4 to train the LLM.

Scraping was ‘fair use’

OpenAI’s team for this particular data collection from YouTube videos included Greg Brockman, co-founder and president of the AI company. Although several employees objected to this ‘illegal’ data collection, the team went ahead anyway.

Sources told the American newspaper that OpenAI said that even though scraping YouTube videos went against Google’s copyright and terms of use, it was still ‘fair use’ and thus permissible.

No objection from Google

Google itself has remarkably not objected to the use of YouTube to train GPT-4. Although it recently indicated that using YouTube videos to train the AI video model Sora would most certainly violate the video service’s terms of use.

Tip: Confusion about training data model Sora that generates videos

According to the New York Times, the tech giant itself also uses data from YouTube to train its own models. Google is even said to have recently stretched the terms of use of several services, giving it access to public material for training its own LLMs. These include public documents in Google Docs, reviews of restaurants on Google Maps and YouTube videos, for example.

Race for new data sources

The scraping of YouTube videos for training LLMs shows that big AI companies are frantically looking for new training data and are getting increasingly creative in doing so. For example, Meta is said to have considered acquiring major U.S. publisher Simon & Schuster to obtain data from its portfolio. The company is also said to be collecting copyrighted data all over the internet, even though this could potentially lead to lawsuits.

Google itself has further struck a deal with Reddit to use content on its platform to train its AI models.

Also read: Google pays $60 million annually for content on Reddit through AI deal

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Whitepapers

OpenAI trained GPT-4 on millions of hours of YouTube audio

Scraping was ‘fair use’

No objection from Google

Race for new data sources

Stay tuned, subscribe!

Ingram Micro slowly gets back on its feet after ransomware attack

Is English the next programming language? JetBrains’ CEO says no

KnowBe4 evolves from security training to human risk management

Amazon S3: almost 20 years old, but still very modern

New Alteryx release tears down walls between cloud services and datasets

Wikidata unlocks its own knowledge base by vectorizing its data

SAP Datasphere makes data access easier

Appian’s Data Fabric gets more value out of data, wherever it resides

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon