3 min Devops

AI models are still unreliable code assistants

Open-source models in particular hallucinate a lot

AI models are still unreliable code assistants

AI models that act as programming assistants still hallucinate a lot. Commercial models partially fabricate the content of code packages in 5.2 percent of cases. For open-source models, the figure is as high as 21.7 percent, according to research conducted by three American universities.

The scientists from the University of Texas, the University of Oklahoma, and Virginia Tech examined 16 LLMs widely used for code generation. Using the npm and PyPI package repositories, respectively, they generated 576,000 pieces of code in JavaScript and Python.

They ran thirty tests that resulted in 2.23 million packages—almost twenty percent of these, 440,445 packages, involved hallucinations. In addition to the hallucinations in the code, the LLMs made up entire package names in 205,474 unique cases. These did not exist at all in the repositories used.

Less wrong than in previous study

A bright spot is that according to this research, the models hallucinated less than measured in an earlier study by Lasso Securities. In the case of GPT-4, it is 5.76 percent versus 24.2 percent. For GPT-3.5, the difference is 4.05 percent versus 22.22 percent. (In the paper, the Lasso are listed the other way around, on this page, they are correct).

To mitigate the hallucinations, the researchers applied Retrieval Augmented Generation (RAG) via the DeepSeek Coder 6.7B and CodeLlama 7B models to generate a list of valid package names. That helped improve response rates, but unfortunately, it resulted in lower overall code quality: 26.1 percent less quality when using DeepSeek and 3.1 percent less with CodeLlama.

“Hallucinations are outputs produced by LLMs that are factually incorrect, nonsensical, or completely unrelated to the input task,” the researchers want us to remember. According to them, such hallucinations are a “critical obstacle” to the effective and safe deployment of LLMs because of the inaccurate or misleading information they provide.

Code mistakenly assessed as correct

Another study of code hallucinations by AI models also shows that they do not always produce reliable results. The study from an AI-institute of the Polytechnic University of Valencia (Spain), says that at least OpenAI’s GPT, Meta’s LLaMA, and BigScience’s open-source BLOOM model hallucinate more when they have more parameters.

Bluntly, one could say that the bigger, the more unreliable. GPT, in particular, was found to be more unreliable in this research and hallucinated plenty to please its human master. The same study also found that human reviewers incorrectly marked code as correct in 10 to 40 percent of cases.

Also read: DeepSeek Coder V2: Chinese open source model challenges America