3 min Analytics

Wikidata unlocks its own knowledge base by vectorizing its data

Initiative for the benefit of open-source AI models

Wikidata unlocks its own knowledge base by vectorizing its data

To better unlock the massive amount of data present in Wikidata, the German branch of Wikimedia (the largest of its kind) is partnering with DataStax and China’s Jina AI. The goal is to convert the mountain of data into semantic vectors that are readable to nonprofit AI applications.

Wikidata is a huge, central repository of data, facts, and references, used by Wikipedia, for example. It organizes all birth dates, locations, and the like and also makes them machine-readable so that the information is usable across platforms.

However, for developers looking to unlock these 112 million-plus entries, it can be quite a task to know where to start. Moreover, sifting through all the data is a time-intensive process, something that large companies in particular may have the resources for, but smaller organizations to a much lesser extent.

Machine-readable

The new initiative, announced during the Open Source Summit in Vienna, is meant to simplify the data analysis process by translating Wikidata entries into semantic vectors readable by machine learning applications. This could improve the accuracy of AI models because the data on Wikidata is up-to-date and verified, which is the idea, at least.

Only open-source models may benefit from this initiative. Thanks to input from a multitude of independent, reliable sources, these should then remain a viable alternative to closed models. The data present in Wikidata should eventually become available for Retrieval Augmented Generation (RAG) as well.

Tip: What is Retrieval-Augmented Generation?

DataStax provides the vector database technology, Jina AI provides the embedding model that enables the vectorization of text-based data. The direct semantic analysis this puts within reach should improve accuracy and also make vandalism quickly visible.

Difficult to unlock at scale

Project Lead is Dr. Jonathan Fraine, head of software development at Wikimedia Deutschland. As one of the main reasons for the creation of the project, Fraine cited that access to Wikidata would otherwise remain a challenge due to the vast amount of data that, while available, is difficult to access at scale. Lydia Pintscher, Portfolio Lead Product Manager of Wikidata, added that thanks to improved access to the wealth of data, open-source AI remains a realistic alternative to commercial generative AI models. The project is scheduled to go into beta in early 2025.

The choice of DataStax as partner-in-vectoring is understandable. The company now offers a slew of tools for developing AI models, such as Langflow and RAGStack. Langflow was an existing visual framework acquired by the San Jose-based company’s earlier this year. The framework is also available on the DataStax Cloud platform. RAGStack an out-of-the-box RAG solution that packs multiple building blocks for building AI software. It is also possible to integrate Langflow into this solution.

Also read: DataStax revamps tools for AI application development