Arbor enables AI agents to learn from their own mistakes

Researchers from Microsoft Research and Renmin University of China have presented a new framework designed to help AI agents better optimize software and machine learning systems. According to their research, Arbor performs significantly better than existing solutions such as Claude Code and Codex.

According to VentureBeat, the project addresses a well-known problem. AI systems that perform well during development sometimes end up providing incorrect answers or overlooking important constraints in production. Improving such systems often involves a series of adjustments to prompts, search methods, and other settings, making it difficult to determine which change actually has an effect.

According to the researchers, existing coding agents do not lack computational power, but rather a way to retain and reuse insights from previous experiments. As a result, they risk repeating the same mistakes.

Arbor therefore opts for a different architecture. A central coordinator determines the direction of the research and evaluates results, while individual agents conduct experiments in isolated environments. This allows multiple solution paths to be tested in parallel without influencing one another.

Knowledge Tree as Memory

Arbor is based on a structure known as Hypothesis Tree Refinement. Hypotheses, experiments, results, and conclusions are stored in a tree structure that grows during the optimization process.

Failed experiments do not disappear from view but are recorded as knowledge. This prevents the system from repeating the same mistake later. Successful results, on the other hand, can yield broader insights that guide new experiments.

As an example, the researchers cite a RAG environment. Whereas traditional agents often adjust multiple components simultaneously, Arbor treats each change as a separate hypothesis. This makes it clearer which change actually improves performance.

Better Than Claude Code and Codex

For the tests, the researchers used GPT-5.5, Claude Opus 4.6, and Gemini-3-Flash, among others, as underlying models. Arbor was then compared to Claude Code and Codex under the same budget and computational conditions.

According to the results, Arbor achieved, on average, more than 2.5 times the performance improvement of the competing systems. In a search optimization benchmark, accuracy rose from 45.3 to 67.7 percent. Claude Code and Codex remained stuck at 53.3 and 50 percent, respectively.

In addition, Arbor proved to be less susceptible to overfitting. Improvements identified during development held up better on independent test data.

The researchers see applications primarily in long-term optimization processes, such as AI pipelines, model training, and data synthesis. On the other hand, the approach requires a relatively large number of tokens, storage, and computing power.

Furthermore, the quality of the results remains dependent on the evaluation method used. According to the researchers, a flawed metric primarily leads to faster, but incorrect, optimization.

JetBrains plug-ins steal API keys from AI services

Security firm Aikido Security has discovered malicious plug-ins in the JetBrains Marketplace that intercept A...

Mels Dees June 17, 2026

Top story

From token maxxing to tokenomics: Pega keeps AI real and pragmatic

Blueprint AI integrates with Infinity to form Infinity Studio

Sander Almekinders June 8, 2026

Top story

The AI trailblazer GitHub Copilot is running out of road

It’s easy to forget that GitHub Copilot appeared on the scene a year and a half before ChatGPT. The AI assi...

Erik van Klinken June 8, 2026

Elastic acquires DeductiveAI for $85 million

Elastic is reportedly acquiring the AI startup DeductiveAI. With this deal, Elastic gains access to technolog...

Mels Dees June 19, 2026

Expert Talks

Tech calendar

Arbor enables AI agents to learn from their own mistakes

Knowledge Tree as Memory

Better Than Claude Code and Codex

Stay tuned, subscribe!

Trump: Anthropic no longer a security risk

From app-centric to open and data-centric: Can Everpure deliver on its promise?

Claude’s creator Anthropic overtakes OpenAI at the IPO game

The AI trailblazer GitHub Copilot is running out of road

How HPE brought two networking giants together in under one year

Cisco wants to tackle the 80-tool security problem

Why observability is critical for AI code generation success

Why only 25% of teams are ready for the Cyber Resilience Act

AMD “Helios”: Building rack-scale AI Infrastructure for EMEA Enterprises

Taking the right lessons from AI success stories

Why traditional security can’t protect your enterprise against AI threats

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

GITEX AI EUROPE 2026

GOTO Copenhagen 2026

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices