AI agents quickly become overwhelmed by tasks

AI agents quickly become overwhelmed by tasks

AI agents seem capable of many tasks, but quickly become overwhelmed when given too many of them. LangChain conducted experiments to discover when and why their performance collapses in this way.

LangChain exposed AI agents to various experiments and found that a single agent has a limit in terms of context and tools. VentureBeat writes about the findings. After reaching this limit, performance decreases. These experiments contribute to a better understanding of the architecture needed to effectively maintain agents and multi-agent systems.

In a blog post, LangChain describes a series of experiments it conducted with a single ReAct agent. And the company explains how it compared performance. Specifically, LangChain was concerned with at what point a single ReAct agent is overloaded with instructions and tools, causing performance to drop.

LangChain chose the ReAct agent architecture because, as the company states, it is one of the most basic agent architectures.

Although benchmarking agent performance often leads to misleading results, LangChain limited the test to two easily quantifiable agent tasks. The agent used was an email assistant responsible for two main areas of work, answering and scheduling meeting requests and supporting customers with their questions.

LangChain primarily used pre-built ReAct agents through the LangGraph platform. These agents contained language models (LLMs) with tool call functions that were part of the benchmark test. The LLMs tested included Claude 3.5 Sonnet from Anthropic, Llama-3.3-70B from Meta and three models from OpenAI: GPT-4o, o1 and o3-mini.

Different steps

To properly assess the performance of the e-mail assistant, they split the testing into several steps. First, the testers looked at the agent’s customer service capabilities. How does the agent accept and respond to a customer’s e-mail?

LangChain first evaluated the path of tool calls. Which tools does the agent use and in what order? If the agent followed the correct sequence, it passed the test. Next, the email response was evaluated by an LLM. For the second area of work, scheduling meetings, LangChain focused on the agent’s ability to follow instructions.

Overloading the agent

Once the parameters were established, LangChain tried to overload the e-mail assistant as much as possible. The company tested 30 tasks per domain (customer service and calendar scheduling). And it did so three times per task (a total of 90 runs). To better evaluate the tasks, LangChain created two separate agents. That is, one for calendar management and one for customer service. In doing so, the agenda management agent had access only to the agenda management domain. The customer service agent only had access to the customer service domain.

Then more tasks and tools were added to the agents to expand their responsibilities. These ranged from HR tasks to technical quality control and legal compliance.

All models dropped stitches

After testing, LangChain found that a single agent became overwhelmed when given too many tasks. The agent began failing to invoke tools correctly or failed to perform tasks when instructions and context increased.

The results showed that GPT-4o performed worse than Claude-3.5-Sonnet, o1 and o3 in agenda scheduling at larger contexts. The performance of GPT-4o dropped to only 2% when the number of domains increased to seven or more.

Other models did not perform much better. Llama-3.3-70B forgot to invoke the email sending tool, so it failed in all test cases. Claude-3.5-Sonnet, o1 and o3-mini still managed to trigger the tool. But Claude-3.5-Sonnet did not perform as well as the OpenAI models. o3-mini’s performance decreased once irrelevant domains were added to the instructions.

The customer service agent could use more tools, but in this test, Claude-3.5-mini performed as well as o3-mini and o1. This agent also showed a less sharp performance degradation with more domains. However, when the context window was expanded, Claude performed worse. GPT-4o scored the worst of all models tested in almost all tests.

LangChain is now investigating how to evaluate multi-agent architectures using the same overload method.