3 min Applications

Sites can now block OpenAI data scraping, but should they?

Sites can now block OpenAI data scraping, but should they?

OpenAI has revealed how others can identify its own web crawler. From now on, sites can block the GPTBot user agent if they want to. By doing so, they can potentially ensure that they are not used to train a future LLM of OpenAI, but is that advisable?

The documentation states that OpenAI can use the GPTBot agent to “improve future models,” not including sites with paywalls for data collection. However, according to VentureBeat, sites such as The Verge and Clarkesworld are already starting to block the crawler. This is possible by supplementing the robots.txt document with a token to keep the bot from indexing the site’s pages. However, outlets can also opt for certain sections to be open to data gathering while leaving more valuable information out.

Earnings model

Sam Altman’s company relies heavily on external data to train AI models. For the general public, the large data set can be interacted with in ChatGPT’s broad knowledge base, even though that chatbot’s knowledge ends in September 2021.

However, OpenAI has been busy in recent months signing contracts with numerous parties, from Microsoft to Salesforce, BuzzFeed and Atlassian. The knowledge that the GPT models house, leads to a lot of revenue. So as a website, you have to ask yourself if you want to aid OpenAI in its pursuit of profitability.

What do you have to lose?

Those who block GPTBot will not notice anything at first. A GPT-4 application that does not allow Internet access has an already ingrained data set that nothing changes. However, there are also GPT-based applications such as Bing Chat that can take to the world wide web. In that case, the tool is essentially a search engine. That just so happens to be one use case that shows the potential value you can get from such a bot as a website.

After all, news organizations, web shops and just about all online platforms are largely dependent on Google. The discoverability of news is so vital that Canadian outlets opposed the possible blocking of Google News a while ago.

Tip: Google won’t pay for Canadian news links

Thus, with the help of generative AI, a search engine may take on similar importance for websites in the near future. This will force parties to interact with the likes of OpenAI or Google, where the sites’ information can be used for other purposes. The filtering of paywall information and personal data already acts as a (relatively minor) restriction on the freedom of data collection.

In short, you have to ask yourself whether you should ban OpenAI on principle. An organization of a certain size may choose to make a deal with the company, perhaps even to be able to use even more data. The Associated Press recently announced to choose for this strategy in cooperation with OpenAI. The advantage for the latter in this regard is that it can increasingly rest on reliable sources, as opposed to questionable information from platforms such as Reddit and X.

Also read: OpenAI may use Associated Press archives for AI training