2 min Applications

Microsoft Magma brings agentic AI to robotic systems

Microsoft Magma brings agentic AI to robotic systems

Microsoft Research presents Magma, an integrated AI base model that combines visual and language processing to control software interfaces and robotic systems.

This reports Ars Technica. If the results exceed Microsoft’s internal tests, it represents a major step forward for a versatile multimodal AI that can operate interactively in both the physical and digital worlds.

Microsoft claims Magma is the first AI model that not only processes multimodal data, such as text, images and video but can also act on it directly. This is regardless of whether it is navigating a user interface or manipulating physical objects. The project is a collaboration between researchers from Microsoft, KAIST, the University of Maryland, the University of Wisconsin-Madison and the University of Washington.

There have been similar AI-driven robotics projects before. Consider Google’s PALM-E and RT-2 or Microsoft’s ChatGPT for Robotics. Those used large language models (LLMs) as interfaces. But unlike many previous multimodal AI systems, which require separate models for perception and control, Magma integrates these capabilities into a single base model.

Step toward agentic AI

Microsoft is positioning Magma as a step toward agentic AI. This involves a system that creates plans autonomously and can perform complex tasks on behalf of a human rather than just answering questions about what it sees. Microsoft writes in its research report that Magma can formulate plans and perform actions. If the user describes a goal, then Magma is able to achieve that goal.

Microsoft is not alone in pursuing agentic AI. OpenAI is experimenting with AI agents through projects such as Operator. That application can perform UI tasks in a Web browser. Google is exploring agentic AI with several agentic projects, including Gemini 2.0.

More than a perceptual model

Magma builds on transformer-based LLM technology, which involves feeding training data into a neural network. Yet it differs from traditional language models like GPT-4V. Instead of focusing only on verbal intelligence, Magma also adds spatial intelligence. By training with a mix of images, videos, robotics data, and UI interactions, Microsoft claims Magma is a truly multimodal agent and not just a perceptual model.