AI models are still unreliable code assistants

AI models that act as programming assistants still hallucinate a lot. Commercial models partially fabricate the content of code packages in 5.2 percent of cases. For open-source models, the figure is as high as 21.7 percent, according to research conducted by three American universities.

The scientists from the University of Texas, the University of Oklahoma, and Virginia Tech examined 16 LLMs widely used for code generation. Using the npm and PyPI package repositories, respectively, they generated 576,000 pieces of code in JavaScript and Python.

They ran thirty tests that resulted in 2.23 million packages—almost twenty percent of these, 440,445 packages, involved hallucinations. In addition to the hallucinations in the code, the LLMs made up entire package names in 205,474 unique cases. These did not exist at all in the repositories used.

Less wrong than in previous study

A bright spot is that according to this research, the models hallucinated less than measured in an earlier study by Lasso Securities. In the case of GPT-4, it is 5.76 percent versus 24.2 percent. For GPT-3.5, the difference is 4.05 percent versus 22.22 percent. (In the paper, the Lasso are listed the other way around, on this page, they are correct).

To mitigate the hallucinations, the researchers applied Retrieval Augmented Generation (RAG) via the DeepSeek Coder 6.7B and CodeLlama 7B models to generate a list of valid package names. That helped improve response rates, but unfortunately, it resulted in lower overall code quality: 26.1 percent less quality when using DeepSeek and 3.1 percent less with CodeLlama.

“Hallucinations are outputs produced by LLMs that are factually incorrect, nonsensical, or completely unrelated to the input task,” the researchers want us to remember. According to them, such hallucinations are a “critical obstacle” to the effective and safe deployment of LLMs because of the inaccurate or misleading information they provide.

Code mistakenly assessed as correct

Another study of code hallucinations by AI models also shows that they do not always produce reliable results. The study from an AI-institute of the Polytechnic University of Valencia (Spain), says that at least OpenAI’s GPT, Meta’s LLaMA, and BigScience’s open-source BLOOM model hallucinate more when they have more parameters.

Bluntly, one could say that the bigger, the more unreliable. GPT, in particular, was found to be more unreliable in this research and hallucinated plenty to please its human master. The same study also found that human reviewers incorrectly marked code as correct in 10 to 40 percent of cases.

Also read: DeepSeek Coder V2: Chinese open source model challenges America

Top story

Is English the next programming language? JetBrains’ CEO says no

AI evangelists like Nvidia's Jensen Huang proclaim that English will become the next programming language. Je...

Erik van Klinken 1 day ago

Tech calendar

AI models are still unreliable code assistants

Open-source models in particular hallucinate a lot

Less wrong than in previous study

Code mistakenly assessed as correct

Stay tuned, subscribe!

SAP CEO says EU doesn’t need a massive AI buildout. Is he right?

Amazon S3: almost 20 years old, but still very modern

A Ferrari needs brakes, innovation needs cybersecurity

Many roads lead to Oracle: the routes taken by VTTI and Hendrix Genetics

EUVD security database is Europe’s next step towards autonomy

Dutch government starts consultation for NIS2 bill

NIS2 leads to better basic hygiene

NIS2 compliance is the beginning, better security the goal

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices