The most important AI metric may be time

Benchmark scores can be impressive, but they often say little about the working day. A model may top the charts in math, coding, and knowledge, yet still fall apart when a task requires many small decisions over an extended period.

This is what METR is trying to capture with AI time horizon. Instead of asking "what percentage does the model get right?", they ask: How long a human task can the AI agent handle on its own before the probability of success drops too low?

It sounds simple. It is, in fact, a rather brutal question.

An agent that manages ten minutes is a tool. An agent that manages ten hours is starting to resemble a worker.

What the 50 percent time horizon means

METR defines the 50 percent time horizon as the length of tasks — measured by how long relevant humans take to complete them — that an AI system can complete with a 50 percent success rate.

In the 2025 paper, the researchers combined benchmarks including RE-Bench, HCAST, and new shorter tasks. They timed humans with relevant expertise, had AI agents attempt the same tasks, and modelled how quickly the success rate declined as tasks grew longer.

The result: frontier agents have improved dramatically. METR reports that the time horizon has doubled roughly every seven months since 2019, with signs of faster growth in 2024.

50 min
Claude 3.7 Sonnet time horizon in the paper
7 months
historical doubling time
1 sec–16 hours
task range in METR-HRS
METR gives AI agents a new clock: How long can they work alone? - Bilde 1

Why this matters for AI safety

Time horizon is not merely a productivity metric. It is also a safety metric. The longer an agent can operate autonomously, the more damage it can cause if its goals, tool access, or control boundaries are misconfigured.

A chatbot that gives a poor answer to a single question is annoying. An agent that can work for hours with files, a browser, code, and APIs can create real problems: erroneous changes, data leaks, runaway costs, or actions no human has approved.

Not all domains are equal

METR followed up with an analysis of how time horizons vary across domains. They note that software, reasoning, and research-adjacent tasks have significantly higher time horizons than visual computer use tasks such as OSWorld and WebArena.

This means "AI agent" is not a single thing. An agent can be strong at coding and weak at GUI navigation. It may answer scientific questions well, yet lose its way in a lengthy browser-based workflow.

For Norwegian organisations, this is critical. A banking agent, a municipal agent, or a support agent must be tested in its own environment. General figures are a map, not the terrain.

The same model can be impressive at code and brittle in a standard user interface.

The practical implication

If METR's trend holds, 2026 and 2027 will not merely be the years of better chat. They will be the years in which autonomous working duration becomes a competitive parameter. Vendors will not just sell "better answers", but "longer uninterrupted work".

That makes procurement harder. A vendor who shows a polished five-minute demo has not proven that the agent can handle a two-hour task. And an agent capable of working for extended periods must also come with better logging, kill switches, and governance policies.

Conclusion

METR's time horizon metric gives the AI debate a much-needed grounding in reality. It makes it possible to discuss agent capability in terms of working time, not just benchmark scores.

For Norway, this means organisations should start evaluating their agents on long-running, real-world, and reversible workflows. How long can they work? How often do humans need to intervene? And what happens when they go wrong after 47 minutes, not after 47 seconds?