Wikidata unlocks its own knowledge base by vectorizing its data

To better unlock the massive amount of data present in Wikidata, the German branch of Wikimedia (the largest of its kind) is partnering with DataStax and China’s Jina AI. The goal is to convert the mountain of data into semantic vectors that are readable to nonprofit AI applications.

Wikidata is a huge, central repository of data, facts, and references, used by Wikipedia, for example. It organizes all birth dates, locations, and the like and also makes them machine-readable so that the information is usable across platforms.

However, for developers looking to unlock these 112 million-plus entries, it can be quite a task to know where to start. Moreover, sifting through all the data is a time-intensive process, something that large companies in particular may have the resources for, but smaller organizations to a much lesser extent.

Machine-readable

The new initiative, announced during the Open Source Summit in Vienna, is meant to simplify the data analysis process by translating Wikidata entries into semantic vectors readable by machine learning applications. This could improve the accuracy of AI models because the data on Wikidata is up-to-date and verified, which is the idea, at least.

Only open-source models may benefit from this initiative. Thanks to input from a multitude of independent, reliable sources, these should then remain a viable alternative to closed models. The data present in Wikidata should eventually become available for Retrieval Augmented Generation (RAG) as well.

Tip: What is Retrieval-Augmented Generation?

DataStax provides the vector database technology, Jina AI provides the embedding model that enables the vectorization of text-based data. The direct semantic analysis this puts within reach should improve accuracy and also make vandalism quickly visible.

Difficult to unlock at scale

Project Lead is Dr. Jonathan Fraine, head of software development at Wikimedia Deutschland. As one of the main reasons for the creation of the project, Fraine cited that access to Wikidata would otherwise remain a challenge due to the vast amount of data that, while available, is difficult to access at scale. Lydia Pintscher, Portfolio Lead Product Manager of Wikidata, added that thanks to improved access to the wealth of data, open-source AI remains a realistic alternative to commercial generative AI models. The project is scheduled to go into beta in early 2025.

The choice of DataStax as partner-in-vectoring is understandable. The company now offers a slew of tools for developing AI models, such as Langflow and RAGStack. Langflow was an existing visual framework acquired by the San Jose-based company’s earlier this year. The framework is also available on the DataStax Cloud platform. RAGStack an out-of-the-box RAG solution that packs multiple building blocks for building AI software. It is also possible to integrate Langflow into this solution.

Also read: DataStax revamps tools for AI application development

Wikidata unlocks its own knowledge base by vectorizing its data

Initiative for the benefit of open-source AI models

Insight: Data Fabrics

Machine-readable

Difficult to unlock at scale

Stay tuned, subscribe!

Pega wants to make AI performance and cost predictable

SCION wants to make the foundations of the internet safer

Panic as China starts production of its ASML alternative

AI is a top priority, but there is also distrust about use in cybersecurity

How Mirantis helps neoclouds maximize GPU ROI with k0rdent AI

Why enterprises are choosing HPE for private cloud AI

Why OpenSearch doubled downloads under open governance

ServiceNow unveils Action Fabric AI platform architecture

AMD “Helios”: Building rack-scale AI Infrastructure for EMEA Enterprises

Taking the right lessons from AI success stories

Why traditional security can’t protect your enterprise against AI threats

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

Dreamforce

GOTO Copenhagen 2026

NetApp INSIGHT 2026

Manhattan EMEA Exchange

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices