OpenAI asks for online unavailable data to willing third parties

OpenAI wants to improve the quality of training data for training its LLM models. This data set should be as broad as possible. To this end, the AI giant now wants to partner with public and private parties through its Data Partnerships program. The partner will receive no reward.

According to OpenAI, good data quality of the data on which its models are trained is extremely important, to make AI safe and ensure that the technology is suitable for everyone to use.

According to the AI tech giant, the AI models used must properly “understand” all information about subjects, business sectors, as well as cultures and languages. These models must, therefore, be trained with the widest possible data set.

To this end, OpenAI is now actively seeking the support of public and private third parties to generate this very broad training data for its AI models. Providing this data, the AI giant says, can ensure that its models know more about these parties’ specific domains.

Specific ‘human’ data

There are conditions attached to the supplied data, however. More specifically, within the Data Partnerships, OpenAI is looking for data that concerns “human society” and is not currently publicly available online. Think of texts, images, audio or video. Especially those data that express ‘human expressions,’ such as longer texts or conversations rather than short snippets or sound bites. This can be in any language, on any topic and in any format.

OpenAI says it can help parties digitize these sources and data. Among other things, it offers OCR and ASR services for printed texts and spoken words. However, datasets must not contain sensitive and personal information or be owned by another third party.

Data can remain private

Potential partners can participate in the OpenAI Data Partnerships in two ways. The first way is through an Open-Source Archive. In this, partners help the AI giant create an open-source dataset for training LLM models. OpenAI would also like to use this data to explore how it can safely train open-source datasets.

The second way is through Private datasets. These datasets are used to train the AI giant’s own AI models, such as foundation models, GPT-4 and GPT-3.5, and fine-tune and custom models.

In this process, the partner’s supplied data remains private but is used to gain more knowledge about the specific partner’s domain. According to OpenAI, the partner itself may also benefit, in time, if a custom language model launches. Other than that, there is nothing in it for partners, and thus, OpenAI depends mainly on willing souls who are happy to share their data for free and for nothing to, in this last case, only OpenAI.

Also read: OpenAI introduces more up-to-date GPT-4 Turbo for more complex tasks

OpenAI asks for online unavailable data to willing third parties

Specific ‘human’ data

Data can remain private

Stay tuned, subscribe!

ServiceNow moves beyond control tower to govern and kill enterprise AI

Alteryx Inspire: Business analysts will become the architects of AI

Anthropic allows partners to share findings from Mythos

SAP blocks external AI agents. Salesforce and ServiceNow don’t.

AI creates brand new attack surfaces in cloud security

How Falco catches threats that static analysis misses

groundcover uses eBPF and AI agents to modernize observability

How JFrog secures binaries in the age of AI coding assistants

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

Five tips for embracing continuous deployment as a DevOps mindset

The only thing constant in technology is change, except for unrealistic hopefulness

mnemonic opens Dutch Security Operations Centre (SOC) and relocates to new office in Utrecht

Infosecurity Europe

.NEXT On Tour Amsterdam

Oxygenate

VivaTech

GITEX AI EUROPE 2026

GOTO Copenhagen 2026

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices