Microsoft Magma brings agentic AI to robotic systems

Microsoft Research presents Magma, an integrated AI base model that combines visual and language processing to control software interfaces and robotic systems.

This reports Ars Technica. If the results exceed Microsoft’s internal tests, it represents a major step forward for a versatile multimodal AI that can operate interactively in both the physical and digital worlds.

Microsoft claims Magma is the first AI model that not only processes multimodal data, such as text, images and video but can also act on it directly. This is regardless of whether it is navigating a user interface or manipulating physical objects. The project is a collaboration between researchers from Microsoft, KAIST, the University of Maryland, the University of Wisconsin-Madison and the University of Washington.

There have been similar AI-driven robotics projects before. Consider Google’s PALM-E and RT-2 or Microsoft’s ChatGPT for Robotics. Those used large language models (LLMs) as interfaces. But unlike many previous multimodal AI systems, which require separate models for perception and control, Magma integrates these capabilities into a single base model.

Step toward agentic AI

Microsoft is positioning Magma as a step toward agentic AI. This involves a system that creates plans autonomously and can perform complex tasks on behalf of a human rather than just answering questions about what it sees. Microsoft writes in its research report that Magma can formulate plans and perform actions. If the user describes a goal, then Magma is able to achieve that goal.

Microsoft is not alone in pursuing agentic AI. OpenAI is experimenting with AI agents through projects such as Operator. That application can perform UI tasks in a Web browser. Google is exploring agentic AI with several agentic projects, including Gemini 2.0.

More than a perceptual model

Magma builds on transformer-based LLM technology, which involves feeding training data into a neural network. Yet it differs from traditional language models like GPT-4V. Instead of focusing only on verbal intelligence, Magma also adds spatial intelligence. By training with a mix of images, videos, robotics data, and UI interactions, Microsoft claims Magma is a truly multimodal agent and not just a perceptual model.

Top story

Domain-specific AI beats general models in business applications

Visma’s AI team is quietly redefining document processing across Europe. With a background spanning nearly ...

Berry Zwets 2 days ago

Whitepapers

Microsoft Magma brings agentic AI to robotic systems

Step toward agentic AI

More than a perceptual model

Stay tuned, subscribe!

Nvidia reaches milestone of $4 trillion market value

Zscaler Cellular brings Zero Trust to IoT and OT devices

Is English the next programming language? JetBrains’ CEO says no

Ingram Micro slowly gets back on its feet after ransomware attack

ServiceNow aims to disrupt Salesforce with new AI-based CRM

The heart of IFS beats in Sri Lanka

Christian Klein: “SAP has the most modern cloud stack of all SaaS vendors”

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon