Microsoft commits to European AI language models

Microsoft is investing in European language technology with new AI initiatives focusing on multilingual models, open-source data, and cultural heritage. With GPT-NL, the Netherlands is demonstrating how this ambition is being realized locally through its own infrastructure and training data.

Microsoft announced a series of initiatives in Paris to better align AI with Europe’s linguistic and cultural diversity. Through investments in data access, cloud infrastructure, and local partnerships, the company aims to break the dominance of English-language AI systems. At the same time, the Netherlands is demonstrating how this ambition can be realized on a national scale with the GPT-NL project.

The core of Microsoft’s approach lies in improving multilingual representation within Large Language Models (LLMs). While English is spoken as a native language by only a small proportion of the world’s population, half of all web content consists of English text. This causes imbalances in AI performance. This is particularly true for models that rely on large-scale web data.

Microsoft has found that these models systematically perform worse in European languages. Examples include Latvian, (modern) Greek, and Estonian. The company sees accuracy differences of more than 25 percentage points. It wants to tackle this problem by facilitating better access to high-quality, language-specific data.

To this end, Microsoft is deploying technical and organizational resources through its Open Innovation Center and the AI for Good Lab. These organizations are based in Strasbourg. The collaboration with the ICube laboratory at the University of Strasbourg will take the form of engineering capacity, Azure cloud credits, and the deployment of more than 70 specialists from Microsoft’s international network.

Multilingual datasets

The first step is to make multilingual datasets from our own sources available, including text corpora from GitHub and speech collections. These will be made accessible through collaboration with platforms such as Hugging Face and Common Crawl, with annotation provided by native speakers from the relevant language areas.

From a technological perspective, Microsoft is focusing on two specific problems in training LLMs: script dependency and data quality. Many existing tokenizer methods are optimized for the Latin alphabet, which leads to inaccurate segmentation of non-Latin characters such as Cyrillic, Arabic, or the Greek alphabet. This disrupts the learning ability of models in those languages.

Microsoft cites the development of script-independent tokenization—such as byte-level or unified token encoders—as a crucial step in reducing language-specific bias. In parallel, the company supports synthetic data generation, with an emphasis on privacy preservation and control over sensitive content.

Digitization of cultural heritage

The technical deployment does not stand alone. Microsoft is combining this model improvement with the digitization of cultural heritage. In collaboration with the French Ministry of Culture and the company Iconem, among others, work is underway on the digital replication of monuments such as Notre Dame. At the same time, datasets from national libraries and museums are being made available for educational and AI applications. These initiatives are the practical expression of Microsoft’s belief that AI systems are not neutral, but must serve the language, culture, and legal context in which they are used.

This approach is being implemented in the Netherlands through GPT-NL. Led by TNO, SURF, and NFI, this consortium is developing a language model specifically for the Dutch market. It was recently announced that news publishers and press agency ANP will make more than 20 billion tokens of news data available for training.

This will double the training corpus in one fell swoop. The model will be trained on legally obtained, copyright-protected data, and publishers will receive compensation for this, according to the NVJ. Technical agreements have been made to prevent source material from being traceable via model outputs. GPT-NL focuses on tasks such as summarizing, simplifying, and information extraction, and is used as an alternative to generic, internationally trained models.

Replatforming virtualized workloads: Do your VMs need a new home?

Finding a balance for VMs and containers

Sander Almekinders July 14, 2025

Anthropic unexpectedly restricts use of Claude Code

Users of Claude Code, Anthropic's AI code assistant, have been experiencing stricter usage limits since the b...

Mels Dees July 18, 2025

Top story

Storyblok Blueprints, speedier setup for web developers

Storyblok is a headless CMS for web developers who want to make a bigger, faster market impact. It frees web ...

Adrian Bridgwater July 14, 2025

Zoho launches its own AI model and agent platform

Zoho announces Zia LLM, its large language model developed for business use. Additionally, the software compa...

Berry Zwets July 17, 2025

Expert Talks

Tech calendar

Microsoft commits to European AI language models

Multilingual datasets

Digitization of cultural heritage

Stay tuned, subscribe!

Chris Wright: AI needs model, accelerator, and cloud flexibility

Dutch Department of Justice offline after Citrix vulnerability

Broadcom launches Tomahawk Ultra with 250ns network latency

What is HPE VM Essentials and is it a direct competitor to VMware?

The evolution of HPE from a hardware to a software company

SAP Sapphire Orlando: Unveiling a new pricing strategy

What is HPE's Unleash AI program and how does it help companies?

How AI and automation are redefining ROI in the enterprise

Enhancing video encoding: The AV1 support in the new ARTPEC-9 System-on-Chip

How organisations can remain compliant while building resiliency during the AI era

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices