Amazon releases MASSIVE dataset for natural language understanding

Amazon is launching the MASSIVE dataset. With this, it wants to lift Natural Language Understanding (NLU) to a higher level. Companies can use the dataset to have virtual assistants interpret obscure language.

MASSIVE is a parallel dataset, Amazon explains. For the dataset, it means that 1 million utterances are known in 51 different languages. These are languages for which there is often a lack of labelled data. With MASSIVE, developers should get a dataset to train AI models for broader applicability. The goal is to achieve the same level of natural speech support for the less spoken languages as is now being achieved for the widely spoken languages.

Tackling the lack of training data

Models can reach that level with massively multilingual natural language understanding (MMNLU). With MMNLU, models parse and understand data inputs from different languages. The model can also share knowledge about languages with a lot of training data with languages for which there is little data.

Amazon calls MASSIVE particularly suitable for improving the understanding of spoken language, in other words, converting audio to text before applying NLU. Virtual assistants often use spoken-language understanding to understand voice commands but only understand a select number of languages due to the lack of training data.

It is hoped that MASSIVE will address the lack of data with the presence of 1 million utterances. Professional translators helped to convert and localise the language to achieve this dataset. The developed models should eventually be easy to generalise for new languages.

Amazon has made MASSIVE immediately available on GitHub, including the accompanying tools. In addition, Amazon is starting a competition to encourage the use of the dataset and train a model with it.

Tip: Here’s why AWS is going all out on SageMaker

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Tech calendar

Amazon releases MASSIVE dataset for natural language understanding

Tackling the lack of training data

Stay tuned, subscribe!

Nvidia reaches milestone of $4 trillion market value

AI requires mature choices from companies

KnowBe4 evolves from security training to human risk management

It’s World Backup Day, but backups alone are not enough

Veeam focuses on migration and data freedom through hypervisor updates

Pure’s FlashBlade//EXA should solve storage bottlenecks in AI and HPC

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices