Nvidia combines speech, vision, and text in new AI model

Nvidia is introducing a new AI model that combines multiple forms of input into a single system. With the launch of Nvidia Nemotron 3 Nano Omni, the company is focusing on so-called multimodal AI. This involves the simultaneous processing of text, audio, and visual information.

The model is designed for use in AI agents that perform tasks autonomously. According to the announcement, the combination of different data streams should enable such systems to reason better and understand context. Instead of using separate models for speech, image, and text, Nvidia is attempting to integrate these functions into a single architecture.

Nemotron 3 Nano Omni stands out because it is relatively compact compared to larger multimodal models. The company is thus targeting applications where efficiency and deployability in production environments are key. Developers can adapt the model to specific use cases, which aligns with a broader trend in which companies want more control over their AI infrastructure.

The integration of multiple modalities is intended to simplify processes. In practical scenarios, this could mean, for example, that a system analyzes audio clips, documents, and video footage simultaneously without requiring separate pipelines. This can reduce the complexity of implementations and potentially lower latency as well.

Performance and claims still to be verified

According to Nvidia, the model is optimized for performance in such combined tasks. It highlights improvements in speed and accuracy compared to previous generations. Independent benchmarks and broader evaluations will need to determine to what extent these claims hold up in various applications.

The introduction of Nemotron 3 Nano Omni fits into a broader trend in which AI models are increasingly becoming multimodal. Major technology companies are investing in systems that are no longer limited to a single type of input but combine multiple information sources to achieve better results. With this model, Nvidia is explicitly seeking to position itself in that arena, with a focus on practical usability rather than scale alone.

Expert Talks

Tech calendar

Nvidia combines speech, vision, and text in new AI model

Performance and claims still to be verified

Stay tuned, subscribe!

The EU AI Act gets serious on August 2: What’s changing?

SCION wants to make the foundations of the internet safer

Panic as China starts production of its ASML alternative

Why hyperscalers run containers in VMs: VKS deep dive

Is your server ready for quantum threats? HPE thinks so

AI observability and container security with Wiz at KubeCon

How HPE brought two networking giants together in under one year

AMD “Helios”: Building rack-scale AI Infrastructure for EMEA Enterprises

Taking the right lessons from AI success stories

Why traditional security can’t protect your enterprise against AI threats

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

Dreamforce

GOTO Copenhagen 2026

NetApp INSIGHT 2026

Manhattan EMEA Exchange

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices