OpenAI trained GPT-4 on one million hours of audio from YouTube videos. The AI giant did not ask Google’s permission to do so. However, the latter did not object because the company itself uses YouTube to train its own LLMs.
In 2021, OpenAI lacked reliable English-language data available online for training its newest LLM, GPT-4. The New York Times reports that OpenAI decided to tap into new data sources, specifically YouTube videos.
To do so, OpenAI developed its audio transcription model Whisper to scrape about a million hours of audio from YouTube videos. This text-based data was then loaded into GPT-4 to train the LLM.
Scraping was ‘fair use’
OpenAI’s team for this particular data collection from YouTube videos included Greg Brockman, co-founder and president of the AI company. Although several employees objected to this ‘illegal’ data collection, the team went ahead anyway.
Sources told the American newspaper that OpenAI said that even though scraping YouTube videos went against Google’s copyright and terms of use, it was still ‘fair use’ and thus permissible.
No objection from Google
Google itself has remarkably not objected to the use of YouTube to train GPT-4. Although it recently indicated that using YouTube videos to train the AI video model Sora would most certainly violate the video service’s terms of use.
Tip: Confusion about training data model Sora that generates videos
According to the New York Times, the tech giant itself also uses data from YouTube to train its own models. Google is even said to have recently stretched the terms of use of several services, giving it access to public material for training its own LLMs. These include public documents in Google Docs, reviews of restaurants on Google Maps and YouTube videos, for example.
Race for new data sources
The scraping of YouTube videos for training LLMs shows that big AI companies are frantically looking for new training data and are getting increasingly creative in doing so. For example, Meta is said to have considered acquiring major U.S. publisher Simon & Schuster to obtain data from its portfolio. The company is also said to be collecting copyrighted data all over the internet, even though this could potentially lead to lawsuits.
Google itself has further struck a deal with Reddit to use content on its platform to train its AI models.
Also read: Google pays $60 million annually for content on Reddit through AI deal