AI coding tool Lovable sees major improvements in the code produced by Claude. Claude Opus 4 and Sonnet 4 make fewer mistakes, work faster, and are therefore more useful programming assistants.
Claude creator Anthropic recently rolled out the two new models. Claude 4 is the long-awaited successor to the successful 3.5/3.7 Sonnet. The latter LLM was already a favorite among programmers, although (like all other models) it had many shortcomings.
Impressive benchmarks
In a blog post, Anthropic stated that Claude Opus 4 achieved a score of 72.5 percent on SWE-bench (Software Engineering Benchmark). This benchmark is designed to test the software engineering capabilities of AI models. As a result, it is multifaceted: LLMs must first try to solve GitHub issues that need to be understood. Only then can they write code.
The tests show that Opus 4 performs excellently on long-running tasks that require sustained focus and thousands of steps. According to Anthropic, the latest model was even able to work on code for seven hours straight without any loss of quality. That’s a big claim: LLMs have a notorious tendency to focus on the initial input, after which the output gets worse and worse.
The company is therefore positioning the new generation of models as a breakthrough in coding, advanced reasoning, and autonomous AI systems. According to Anthropic, this comes with a higher risk level than ever before: for the first time, it is activating AI Safety Level 3 protections. This should prevent Claude 4 from participating in malicious tasks that it could theoretically perform.
Practical improvements at Lovable
Lovable, a developer of “AI-driven prompt-based web and app builders” (read: vibe coding), has observed similar improvements after switching to Claude 4. The company uses Claude for its own solution.
In a post on X, Lovable reports that after implementing Claude 4, it has seen a 25 percent reduction in errors and an overall speed improvement of 40 percent. These improvements apply to both creating new projects and editing existing ones.
In a separate post, Lovable founder Anton Osika confirmed that “Claude 4 has eliminated most of Lovable’s errors,” referring primarily to syntax errors in coding.
Impact for developers
The improvements in Claude 4 are significant for developers who rely on AI coding tools. In theory, that’s quite a few: anyone who wants to program something without any expertise needs to get virtually error-free answers from LLMs. Syntax errors are among the most common problems in automatic code generation, and a 25 percent reduction can significantly increase productivity.
The 40 percent speed improvement also means that developers (no matter how experienced) spend less time waiting for code to be generated or edited, leading to a more efficient development process.
Enough?
With these improvements, Anthropic demonstrates that the latest generation of LLMs not only perform better in controlled benchmark tests, but also offer concrete benefits in practical applications for software development. The question is whether this is “enough,” and if so, what would be enough? Minimizing errors is a prerequisite for scoring well on benchmarks, but the reality is that reliable, consistent code generation is still a long way off.
The ultimate goal of AI model builders is, as has been repeated ad nauseam, AGI. What this ‘Artificial General Intelligence’ should actually be able to do is unclear. Should it be as good as a human? As good as the best professionals? And in which tasks? All this seems easier to define by linking a concrete task to the AGI requirement. Think of coding, where after each compilation or Run button, it becomes clear whether the programming code makes sense. However, that is not the practical limitation at the moment. Rather, Claude 4, like its predecessors, depends on high-quality prompts and a cooperative individual to fix any bugs.