Anthropic has demonstrated the extent of autonomous AI development with a remarkable experiment. Sixteen AI agents built a C compiler almost entirely independently, but the results show both technological progress and clear limitations.
The experiment took place at a time when several AI suppliers are focusing on agentic systems. Both Anthropic and OpenAI recently introduced new tooling for multi-agent use, which means the timing of the publication does not seem coincidental, according to Ars Technica.
In the experiment, sixteen AI agents, all running on Claude Opus 4.6, were tasked with building a C compiler in Rust from scratch. After formulating the goal, the human supervisors largely withdrew. The agents worked in parallel on a shared Git repository, without central orchestration or a controlling main agent.
To make this possible, the company developed its own technical infrastructure. Each AI agent ran in a separate Docker container and worked in an infinite loop, automatically starting a new session after completing a task. Tasks were coordinated among themselves using simple lock files in the repository, so agents did not interfere directly with each other.
Two thousand Claude Code sessions
The project ran for almost two weeks and included approximately two thousand Claude Code sessions. Approximately two billion input tokens were processed, and approximately 140 million output tokens were generated, accounting for nearly twenty thousand dollars in API costs. The end result is a compiler of approximately one hundred thousand lines of code.
According to Anthropic, the compiler can build realistic software. For example, the system successfully compiled a bootable Linux 6.9 kernel for x86, ARM, and RISC-V architectures. Projects such as PostgreSQL, SQLite, Redis, FFmpeg, and QEMU were also successfully compiled. On the GCC torture test suite, the compiler achieved a success rate of approximately 99 percent. As an informal final test, the compiler was even able to compile and run the game Doom.
At the same time, external reports clearly question the degree of autonomy. Although the AI agents wrote code independently, the experiment required considerable human preparation. According to Ars Technica, most of the work was not in the programming itself, but in designing test harnesses, CI pipelines, and feedback mechanisms tailored to the limitations of language models.
In that context, Anthropic emphasizes that the compiler was developed without direct external influences. The AI agents had no internet access during the development process and used only the Rust standard library. The company, therefore, refers to it as a clean-room implementation.
However, this qualification is controversial. Although the development environment was shielded, the underlying language model was pre-trained on large amounts of publicly available source code. This almost certainly includes existing C compilers, test sets, and associated tooling. The use of the term clean room thus deviates from its classic meaning in software development.
These limitations became particularly apparent as the project grew. As the code base approached 100,000 lines, new bug fixes and extensions began to break existing functionality regularly. This pattern, familiar from large human codebases, also appeared here in AI agents that operate autonomously for long periods. The experiment thus suggests a practical scale limit for agentic software development with the current generation of models.
The complete source code is publicly available, and Anthropic explicitly presents the project as research. The experiment shows what is possible with current AI agents, but also where the practical limits of large-scale autonomous software development lie.