2 min Analytics

Data Provenance Initiative addresses transparency issues AI

Data Provenance Initiative addresses transparency issues AI

A group of leading institutions has launched the Data Provenance Initiative to address the “crisis in data transparency and its consequences.”

Participating in the initiative are universities such as MIT and Harvard Law School, as well as tech giant Apple and nonprofit Cohere For AI. The 12 participating parties are immediately launching the interactive platform Data Provenance Explorer.

The initiative conducted a major audit on AI datasets used to train large language models. To date, the Data Provenance Initiative has reviewed over 1,800 popular text-to-text finetuning datasets, which have collectively been downloaded tens of millions of times.

This involved cataloging data sources, licenses, creators and other metadata. For example, students and journalists can use this to get assurances about transparency, documentation and whether datasets in AI are well-informed.

Transparency problem

With the initiative, the parties aim to address transparency problems they identify. “Increasingly, widely used dataset collections are treated as monolithic, instead of a lineage of data sources, scraped (or model generated), curated, and annotated, often with multiple rounds of re-packaging (and re-licensing) by successive practitioners,” Data Provenance Initiative said in a paper. “The disincentives to acknowledge this lineage stem both from the scale of modern data collection (the effort to properly attribute it), and the increased copyright scrutiny.”

According to Data Provenance Initiative, it has led to fewer data sheets, as well as the non-disclosure of training sources. In the latter case, the initiative is referring in particular to OpenAI, which has become popular with ChatGPT for its use of large datasets. Ultimately, the initiative sees a decline in understanding training data emerging.

“This lack of understanding can lead to data leakages between training and test data; expose personally identifiable information (PII), present unintended biases or behaviours; and generally result in lower
quality models than anticipated,” the parties conclude. In addition, they see ethical and legal risks emerging.

Tip: Trustworthy AI starts before the first line of code