OpenAI wants to improve the quality of training data for training its LLM models. This data set should be as broad as possible. To this end, the AI giant now wants to partner with public and private parties through its Data Partnerships program. The partner will receive no reward.
According to OpenAI, good data quality of the data on which its models are trained is extremely important, to make AI safe and ensure that the technology is suitable for everyone to use.
According to the AI tech giant, the AI models used must properly “understand” all information about subjects, business sectors, as well as cultures and languages. These models must, therefore, be trained with the widest possible data set.
To this end, OpenAI is now actively seeking the support of public and private third parties to generate this very broad training data for its AI models. Providing this data, the AI giant says, can ensure that its models know more about these parties’ specific domains.
Specific ‘human’ data
There are conditions attached to the supplied data, however. More specifically, within the Data Partnerships, OpenAI is looking for data that concerns “human society” and is not currently publicly available online. Think of texts, images, audio or video. Especially those data that express ‘human expressions,’ such as longer texts or conversations rather than short snippets or sound bites. This can be in any language, on any topic and in any format.
OpenAI says it can help parties digitize these sources and data. Among other things, it offers OCR and ASR services for printed texts and spoken words. However, datasets must not contain sensitive and personal information or be owned by another third party.
Data can remain private
Potential partners can participate in the OpenAI Data Partnerships in two ways. The first way is through an Open-Source Archive. In this, partners help the AI giant create an open-source dataset for training LLM models. OpenAI would also like to use this data to explore how it can safely train open-source datasets.
The second way is through Private datasets. These datasets are used to train the AI giant’s own AI models, such as foundation models, GPT-4 and GPT-3.5, and fine-tune and custom models.
In this process, the partner’s supplied data remains private but is used to gain more knowledge about the specific partner’s domain. According to OpenAI, the partner itself may also benefit, in time, if a custom language model launches. Other than that, there is nothing in it for partners, and thus, OpenAI depends mainly on willing souls who are happy to share their data for free and for nothing to, in this last case, only OpenAI.