AI Needs to Learn Where Things Are

Modern multimodal models are good at looking at images and explaining them. But a robot or screen-based agent needs more than description. It must understand where something is, how it can be affected, and what action should happen next.

This is the transition Microsoft Research is tackling with Magma, a foundation model for multimodal AI agents. The paper was published in 2025 and targets both digital and physical environments: user interfaces, videos, game-like tasks, and robotic manipulation.

In short: Magma tries to bridge the gap between seeing the world and acting in it.

A robot doesn't just need vision. It needs a language for action.

From Vision-Language to Vision-Language-Action

Traditional vision-language models can answer questions about images. Magma extends this toward agentic tasks. The model is trained on heterogeneous datasets drawn from images, videos, and robotics, and employs two key techniques: Set-of-Mark and Trace-of-Mark.

Set-of-Mark involves labeling action-relevant objects in images — for example, buttons in a user interface. Trace-of-Mark involves movement traces in video — for example, how a hand or robotic arm moves through time.

Together, these are intended to give the model both spatial and temporal intelligence.

2025
Magma paper
3
core domains: UI, video, robotics
8B
public Magma model variant
Magma Wants to Give Multimodal Agents a Sense of Space - Bilde 1

Why UI and Robotics Go Hand in Hand

Combining screen navigation and robotic arms in the same model might seem like an odd pairing. But they share a common core: the agent must observe an environment, understand a goal, select a point or object, and propose the next action.

In a user interface, that action might be clicking the right button. In robotics, it might be grasping the right object. Both require visual grounding. Both punish small errors.

Relevance for Norwegian Industry

Norway has many sectors where digital and physical automation converge: maritime, manufacturing, energy, warehousing, medical equipment, and public operations. Magma is not a ready-made Norwegian industrial solution, but the research is relevant because it points toward more general-purpose agents for exactly these kinds of environments.

Rather than training one model for one robotic task and another for one screen-based system, future systems could use the same multimodal foundation model as a starting point — then adapt it to a specific domain, safety requirements, and local procedures.

That doesn't mean a robot will suddenly be safe to deploy in production. Physical AI demands rigorous verification. But better spatial understanding could make it easier to build systems that learn faster and fail more visibly.

Multimodal AI becomes truly useful when it can point, plan, and act — not just describe.

The Major Limitation

Magma is still research. Getting a model to perform well on benchmarks and demos is one thing. Getting it to operate reliably in a cluttered warehouse, aboard a vessel, or in a hospital environment is another matter entirely.

Sensor noise, unexpected objects, poor lighting, safety requirements, and mechanical constraints make physical AI far harder than screen-based AI. Norwegian organizations should therefore view Magma as a research direction, not as a plug-and-play robot pilot.

Conclusion

Magma is compelling because it tackles one of the biggest gaps in today's AI agents: the transition from understanding to action. By connecting images, video, language, and action traces, Microsoft Research is trying to give agents a better sense of space.

For 24AI readers, the key takeaway is straightforward: the next wave of AI research isn't just about larger language models. It's about models that can orient themselves in the world — whether that world is a screen or a physical space.