The model can customize voices and generate sounds, but the technology is not planned to be available anytime soon.
Fugatto, in full Foundational Generative Audio Transformer Opus 1, promises to be especially valuable to music, game, and movie producers. Nvidia is thus entering a competitive market in which several players are already focusing on generating audio or video based on prompts. Earlier this year, for example, OpenAI unveiled the not-yet-available Sora model, which can create videos based on text. Nvidia’s model, however, stands out for its more advanced capabilities.
Fugatto can significantly modify an audio file. For example, a piece of piano playing can be turned into an audio file that sounds like a man is singing. The model can also modify a spoken message to a different accent. To train the new model, Nvidia used open-source data.
However, wether and how the model will reach the market is unknown. This is partly related to concerns about misuse of audio and video models. “Any generative technology always carries some risks, because people might use that to generate things that we would prefer they don’t,” said Nvidia vice president of applied deep learning research Bryan Catanzaro. “We need to be careful about that, which is why we don’t have immediate plans to release this.”
Tip: Confusion about training data model Sora that generates videos