4 min Applications

Microsoft commits to European AI language models

Microsoft commits to European AI language models

Microsoft is investing in European language technology with new AI initiatives focusing on multilingual models, open-source data, and cultural heritage. With GPT-NL, the Netherlands is demonstrating how this ambition is being realized locally through its own infrastructure and training data.

Microsoft announced a series of initiatives in Paris to better align AI with Europe’s linguistic and cultural diversity. Through investments in data access, cloud infrastructure, and local partnerships, the company aims to break the dominance of English-language AI systems. At the same time, the Netherlands is demonstrating how this ambition can be realized on a national scale with the GPT-NL project.

The core of Microsoft’s approach lies in improving multilingual representation within Large Language Models (LLMs). While English is spoken as a native language by only a small proportion of the world’s population, half of all web content consists of English text. This causes imbalances in AI performance. This is particularly true for models that rely on large-scale web data.

Microsoft has found that these models systematically perform worse in European languages. Examples include Latvian, (modern) Greek, and Estonian. The company sees accuracy differences of more than 25 percentage points. It wants to tackle this problem by facilitating better access to high-quality, language-specific data.

To this end, Microsoft is deploying technical and organizational resources through its Open Innovation Center and the AI for Good Lab. These organizations are based in Strasbourg. The collaboration with the ICube laboratory at the University of Strasbourg will take the form of engineering capacity, Azure cloud credits, and the deployment of more than 70 specialists from Microsoft’s international network.

Multilingual datasets

The first step is to make multilingual datasets from our own sources available, including text corpora from GitHub and speech collections. These will be made accessible through collaboration with platforms such as Hugging Face and Common Crawl, with annotation provided by native speakers from the relevant language areas.

From a technological perspective, Microsoft is focusing on two specific problems in training LLMs: script dependency and data quality. Many existing tokenizer methods are optimized for the Latin alphabet, which leads to inaccurate segmentation of non-Latin characters such as Cyrillic, Arabic, or the Greek alphabet. This disrupts the learning ability of models in those languages.

Microsoft cites the development of script-independent tokenization—such as byte-level or unified token encoders—as a crucial step in reducing language-specific bias. In parallel, the company supports synthetic data generation, with an emphasis on privacy preservation and control over sensitive content.

Digitization of cultural heritage

The technical deployment does not stand alone. Microsoft is combining this model improvement with the digitization of cultural heritage. In collaboration with the French Ministry of Culture and the company Iconem, among others, work is underway on the digital replication of monuments such as Notre Dame. At the same time, datasets from national libraries and museums are being made available for educational and AI applications. These initiatives are the practical expression of Microsoft’s belief that AI systems are not neutral, but must serve the language, culture, and legal context in which they are used.

This approach is being implemented in the Netherlands through GPT-NL. Led by TNO, SURF, and NFI, this consortium is developing a language model specifically for the Dutch market. It was recently announced that news publishers and press agency ANP will make more than 20 billion tokens of news data available for training.

This will double the training corpus in one fell swoop. The model will be trained on legally obtained, copyright-protected data, and publishers will receive compensation for this, according to the NVJ. Technical agreements have been made to prevent source material from being traceable via model outputs. GPT-NL focuses on tasks such as summarizing, simplifying, and information extraction, and is used as an alternative to generic, internationally trained models.