Almost All Attempts Are Wasted

When an AI agent fails a task and tries again, one would assume that the repetitions gradually bring it closer to the goal. But according to an analysis published by Towards Data Science, the reality is quite different: In a controlled benchmark consisting of 200 tasks, a full 90.8 percent of all retry attempts were spent on error types that, by definition, can never succeed.

The reason is not that the underlying language models are too weak. The problem is architectural — and that means more training or better prompts will not solve it.

What Are ReAct Agents?

ReAct (Reasoning and Acting) is one of the most widely used paradigms for AI agents today. The system allows a language model to alternate between reasoning and actual actions in an iterative loop called Thought-Action-Observation (TAO). This makes agents flexible and capable of handling complex, composite tasks.

But flexibility comes at a price.

9 out of 10 AI agent retry attempts are wasted — here's the solution

Hallucinated Tool Calls Are the Core Problem

What makes the retry problem particularly serious is that the errors are not random. The analysis shows that the majority of failed attempts are due to the agent calling tools that do not exist, or using them in ways that are structurally impossible. These hallucinated tool calls can be repeated indefinitely without the agent ever succeeding — and they drain the system's retry budget without providing any value.

Prompt-tuning, which is the most common approach to improving agent behavior, has no effect on this type of error because the problem lies in the agent architecture itself.

Fine-tuning prompts on a structurally broken system is like adjusting the navigation on a ship with holes in its hull.
9 out of 10 AI agent retry attempts are wasted — here's the solution

Three Structural Changes That Help

According to Towards Data Science, there are three types of architectural changes that can practically eliminate wasted retries. These are supported by broader research on alternative agent designs.

1. Reflection Before Action

The variant called REBACT incorporates a reflection step before each action phase, not after. This allows the agent to correct its course immediately instead of discovering the error afterward. Results from the ALFWorld benchmark test show a success rate of 98.51 percent — an increase of 24 percentage points over the baseline model — and a corresponding decrease in cumulative errors and API calls.

2. Focused Context and Early Stopping

"Focused ReAct" addresses a phenomenon called context drift, where the agent gradually loses sight of what it was originally asked to do. The solution is simple: repeat the original task at each reasoning step, and stop early if an action repeats itself. According to the research, this increases accuracy by up to 530 percent on the Gemma 2B model and reduces runtime by up to 34 percent.

3. Multi-Agent Architecture with Hierarchical Planning

The CoAct system uses a global planner to control local executing agents. Compared to a standard ReAct-based agent on GPT-3.5, the average success rate on complex tasks increased from 9.4 to 13.8 percent — an improvement of about 47 percent. Even more drastic results are reported from the GLM architecture, which combines multi-agent structure with graph-based reasoning.

90.8%
Wasted retries in benchmark
530%
Accuracy increase with Focused ReAct

What Does This Mean for Practice?

For teams building and operating production systems based on ReAct agents, the message from the analysis is clear: Efficiency efforts should be directed at the agent structure itself, not at prompt design or model selection. There is little to gain from improving the reasoning of an agent that systematically tries to perform impossible operations.

Research communities point towards a future where agent architecture, not just model size, becomes the decisive factor for performance and cost-effectiveness in AI systems. The GLM architecture illustrates the potential: a 95.7 percent reduction in token costs and 90.3 percent lower inference latency compared to comparable systems, according to the research basis for this article.

The source specifies that the figures come from controlled benchmarks, and that the transfer value to production environments will vary depending on the task type and system configuration.