5 min

Tags in this article

, , , ,

There is plenty of free data available to train an English language model. Once you need a multilingual chatbot, however, the internet needs to be scraped longer, and data comes from a wider variety of sources. Common Crawl provides the most extensive database and developers use it to train their language models. Recent research states the Dutch dataset is primarily fed by a pirate site found to be illegal.

ChatGPT speaks quite some languages. The language model must have made itself multilingual, with data freely available on the Internet. Usually, a company keeps the composition of training sets secret. For example, it is unknown how GPT-3, the model behind ChatGPT, was created.

Multilingual datasets

What is public is the database that pretty much summarizes the entire Internet, called Common Crawl. The dataset is available in multiple languages in mC4. The dataset is the brainchild of Google and was much more challenging to obtain than an English-language dataset. According to researchers at the tech giant, for C4, the English-language dataset, it was enough to include available digital content from April 2019. For mC4, it was necessary to aggregate 71 monthly web scrapes from Common Crawl.

Google demonstrated the usefulness of the dataset in its Natural Language Processing (NLP) language model mT5. All the code and training sets are publicly available. The researchers argue that choice as follows: “We release all code and pre-trained datasets used in this paper to facilitate future work on multilingualism research.”

Pirate site frontrunner

It would not be surprising if this dataset also forms the basis for GPT-3 and thus ChatGPT. As it turned out, multilingual datasets are difficult to compile and thus not numerous. De Groene Amsterdammer brought this theory to practice and concluded that the MC4 dataset is in all likelihood behind OpenAI’s language model. It also looked at which Dutch-language websites form the basis for the training sets. In the top twenty, we found some surprising results, to put it gently.

For example, the largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. Accounting for 3.6 per cent of the total dataset. The website is a paradise for hackers since private information such as documents containing job applicants’ evaluations is freely available there. To do this, the website constantly scrapes the Internet for files. The website further includes data from data breaches, complete resumes and tax returns. It did not take long for the website to be found illegal by the Dutch DPA and the National Cyber Security Center. Nevertheless, the website is still up and running.

The top three also include tripadvisor.nl (1.9%) and uitspraken.rechtspraak.nl (1.2%). Ads from private sellers were also included in the dataset, with 0.3 per cent coming from ebay.nl, which occupies place eleven, and marktplaats.nl having a 0.2 per cent share. As a result, the language model knows quite a few private phone numbers from ads on these websites.

On top of this, the dataset still gobbles up a lot of information from websites brimming with disinformation. For example, the research found the neo-Nazi website Stormfront, conspiracy site Vrijspeker and anti-Islamic and Europhobic blog E.J. Bron.

Leaky quality filter

Non-English-language websites are complex to check for reliability and relevance for the companies behind chatbots. This is because the development of language models is usually done in the United States, where researchers are primarily English-speaking. In any case, they will not tell you which website should definitely be in the dataset and which should be left out.

Moreover, the number of Dutch-speaking websites on the global Internet is relatively low. You can only get a well-trained chatbot by providing sufficient training material and you can’t get that by including only the most prestigious Dutch websites.

Put it together and you receive the problem that non-English speaking NLP language models are trained with datasets full of disinformation, private data and copyrighted content. A mix of these elements will be found in a chatbot’s answer. Indeed, language models reproduce the information they are trained with to understand your question and respond to your prompt.

OpenAI tried to solve the problem by teaching the chatbot to filter by the resource quality. Texts that score well according to the model are then used more often in the training material. The rules for rating a text as “good” were designed by the researchers themselves. As a result, GPT-3 more often uses Wikipedia sources, websites that are well-shared on the social media platform Reddit and a collection of books. It is not known which books are exactly involved.

However, the study shows that the Dutch-language website filter appears as leaky as a sieve. Otherwise, news media and other sources of information would have been parading above docplayer.nl. Moreover, the filter proved to favor texts from white, educated and American elite.

No pushback?

Can a major American company like OpenAI get away with this unscathed? After all, you should not get away with piracy and privacy violations. The Dutch Data Protection Authority (DPA) is already sounding the alarm and sent OpenAI a letter asking for more clarity about ChatGPT. On that, it says itself, “Among other things, the DPA wants to know how OpenAI handles personal data when training the underlying system.”

The authority questions whether personal data appears in the training sets and what happens with questions (prompts) that contain personal information. Indeed, generative AI, and thus this language model, further refines itself by including prompts in the training material. Finally, concerns are raised about generated responses to questions about other people. “The generated content may be inaccurate, outdated, misleading, inappropriate, humiliating or offensive and may take on a life of its own,” the AP states.

The DPA’s letter has only just been sent and it remains to be seen how and if OpenAI will respond. The study does show the importance of regulation for artificial intelligence. By the end of 2023, the European Union’s AI Act should be a reality. That should slow down the spread of disinformation and personal data through AI-generated content. From then on, OpenAI will no longer be able to get away with piracy and privacy violations.

Also read: Are Google and OpenAI the right partners to regulate AI?