New AI agent from Microsoft can control software and robots

Magma, an integrated AI foundation model that integrates linguistic and visual processing to operate robotic devices and software interfaces, was unveiled by Microsoft Research on Wednesday. The findings could represent a significant advancement for an all-purpose multimodal AI that can function interactively in both real and virtual environments if they hold up outside of Microsoft’s own trials.

Microsoft asserts that Magma is the first AI model capable of natively acting upon multimodal data, such as text, photos, and video, in addition to processing it. This includes navigating a user interface and handling tangible items. Researchers from the University of Washington, the University of Maryland, the University of Wisconsin-Madison, Microsoft, and KAIST are working together on the project.

Other significant language model-based robotics projects that use LLMs as an interface include Microsoft’s ChatGPT for Robotics and Google’s PALM-E and RT-2. However, Magma combines perception and control into a single foundation model, in contrast to many previous multimodal AI systems that need separate models for these functions.

Instead of merely responding to inquiries about what it observes, Microsoft is presenting Magma as a step toward agentic AI, which is a system that can independently create plans and carry out multi-step tasks on a human’s behalf.

Microsoft states in a research paper that “Magma can create plans and carry out actions to achieve a described goal.” Magma connects verbal, spatial, and temporal intelligence to successfully handle challenging activities and environments by efficiently transferring information from publicly accessible visual and linguistic data.

The quest for agentic AI is not exclusive to Microsoft. With initiatives like Operator, which can carry out user interface chores in a web browser, OpenAI has been experimenting with AI agents, and Google has investigated several agentic projects with Gemini 2.0.

Spatial intelligence

Although Magma is based on Transformer-based LLM technology, which uses training tokens to feed a neural network, it differs from conventional vision-language models (such as GPT-4V) in that it incorporates “spatial intelligence” (planning and action execution) in addition to what they refer to as “verbal intelligence.” According to Microsoft, Magma is not merely a perceptual model but rather a true multimodal agent, having been trained on a variety of photos, videos, robotics data, and user interface interactions.

Two technical components are introduced by the Magma model: Trace-of-Mark, which learns movement patterns from video data, and Set-of-Mark, which uses numeric labels to identify objects that can be manipulated in an environment, such as graspable objects in a robotic workspace or clickable buttons in a user interface. According to Microsoft, those capabilities enable the model to do tasks like controlling robotic arms to grip items or browsing user interfaces.

Microsoft Magma researcher Jianwei Yang stated that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch)” after some individuals pointed out that “Magma” is already part of an existing matrix algebra library, which could cause some misunderstanding in technical arguments.

Improvements over previous models were reported

Microsoft says Magma-8B performs competitively across benchmarks, demonstrating strong performance in activities like robot manipulation and user interface navigation in its Magma write-up.

For instance, it received a score of 80.0 on the VQAv2 visual question-answering benchmark, which was lower than LLaVA-Next’s 81.8 but higher than GPT-4V’s 77.2. With a POPE score of 87.4, it outperforms all other models. In a number of robot manipulation tasks, Magma is said to perform better than OpenVLA, an open source vision-language-action model.

Since many AI benchmarks have not been scientifically verified as being able to measure practical attributes of AI models, we always proceed with caution. Once the public code release is available to other researchers, Microsoft’s benchmark findings can be externally verified.

Magma is not flawless, just like any other AI model. According to Microsoft’s documentation, it still has technical limits when it comes to intricate, step-by-step decision-making that calls for several steps over time. The business claims that it is still conducting research to enhance these skills.

According to Yang, next week Microsoft will make Magma’s training and inference code available on GitHub so that outside academics can expand on the work. If Magma fulfills its promise, it may advance Microsoft’s AI assistants beyond simple text exchanges, allowing them to control software and use robotics to carry out real-world tasks.

Magma also illustrates how rapidly the AI culture may shift. Many people were alarmed by this type of agentic discourse only a few years ago, fearing that it might result in the global takeover by AI. Even if some individuals are still afraid about that happening, in 2025, mainstream AI research frequently focuses on AI agents without prompting calls to stop all AI development.

Source link