Amazon is launching the MASSIVE dataset. With this, it wants to lift Natural Language Understanding (NLU) to a higher level. Companies can use the dataset to have virtual assistants interpret obscure language.
MASSIVE is a parallel dataset, Amazon explains. For the dataset, it means that 1 million utterances are known in 51 different languages. These are languages for which there is often a lack of labelled data. With MASSIVE, developers should get a dataset to train AI models for broader applicability. The goal is to achieve the same level of natural speech support for the less spoken languages as is now being achieved for the widely spoken languages.
Tackling the lack of training data
Models can reach that level with massively multilingual natural language understanding (MMNLU). With MMNLU, models parse and understand data inputs from different languages. The model can also share knowledge about languages with a lot of training data with languages for which there is little data.
Amazon calls MASSIVE particularly suitable for improving the understanding of spoken language, in other words, converting audio to text before applying NLU. Virtual assistants often use spoken-language understanding to understand voice commands but only understand a select number of languages due to the lack of training data.
It is hoped that MASSIVE will address the lack of data with the presence of 1 million utterances. Professional translators helped to convert and localise the language to achieve this dataset. The developed models should eventually be easy to generalise for new languages.
Amazon has made MASSIVE immediately available on GitHub, including the accompanying tools. In addition, Amazon is starting a competition to encourage the use of the dataset and train a model with it.