Researchers from Microsoft Research and Renmin University of China have presented a new framework designed to help AI agents better optimize software and machine learning systems. According to their research, Arbor performs significantly better than existing solutions such as Claude Code and Codex.
According to VentureBeat, the project addresses a well-known problem. AI systems that perform well during development sometimes end up providing incorrect answers or overlooking important constraints in production. Improving such systems often involves a series of adjustments to prompts, search methods, and other settings, making it difficult to determine which change actually has an effect.
According to the researchers, existing coding agents do not lack computational power, but rather a way to retain and reuse insights from previous experiments. As a result, they risk repeating the same mistakes.
Arbor therefore opts for a different architecture. A central coordinator determines the direction of the research and evaluates results, while individual agents conduct experiments in isolated environments. This allows multiple solution paths to be tested in parallel without influencing one another.
Knowledge Tree as Memory
Arbor is based on a structure known as Hypothesis Tree Refinement. Hypotheses, experiments, results, and conclusions are stored in a tree structure that grows during the optimization process.
Failed experiments do not disappear from view but are recorded as knowledge. This prevents the system from repeating the same mistake later. Successful results, on the other hand, can yield broader insights that guide new experiments.
As an example, the researchers cite a RAG environment. Whereas traditional agents often adjust multiple components simultaneously, Arbor treats each change as a separate hypothesis. This makes it clearer which change actually improves performance.
Better Than Claude Code and Codex
For the tests, the researchers used GPT-5.5, Claude Opus 4.6, and Gemini-3-Flash, among others, as underlying models. Arbor was then compared to Claude Code and Codex under the same budget and computational conditions.
According to the results, Arbor achieved, on average, more than 2.5 times the performance improvement of the competing systems. In a search optimization benchmark, accuracy rose from 45.3 to 67.7 percent. Claude Code and Codex remained stuck at 53.3 and 50 percent, respectively.
In addition, Arbor proved to be less susceptible to overfitting. Improvements identified during development held up better on independent test data.
The researchers see applications primarily in long-term optimization processes, such as AI pipelines, model training, and data synthesis. On the other hand, the approach requires a relatively large number of tokens, storage, and computing power.
Furthermore, the quality of the results remains dependent on the evaluation method used. According to the researchers, a flawed metric primarily leads to faster, but incorrect, optimization.