3 min

ChatGPT is found to make errors 83 percent of the time when performing pediatric diagnoses, a study conducted by JEMA Pediatrics found. Nevertheless, generative AI is a promising technology for health care, the researchers argue.

The research focused specifically on ChatGPT in its GPT-4 guise. OpenAI offers this version of its chatbot as part of its ChatGPT Plus subscription service.

In the study, 100 medical cases in pediatrics were submitted to the chatbot. Of these, 83 were incorrectly diagnosed by ChatGPT, 11 of which were worded too broadly. According to the researchers, the times it gets stuff wrong show that the AI tool fails to form important relationships, such as between autism and vitamin deficiencies.

They also believe that the dataset used by OpenAI contains too many errors for it to be reliable. GPT-4 is based on a large amount of Internet data that has not been extensively fact-checked. LLMs, by their very nature, fail to differentiate fact from fiction anyway. By contrast, the JEMA Pediatrics researchers place Med-PaLM 2, Google’s model trained on medical information, which they state could be a lot more promising.

Nothing new

ChatGPT has failed to perform various tasks well. For example, it was found to generate mostly insecure programming code and is not generally recommended to aid scholars. It’s important to note, however, that OpenAI seems to know this all too well. Indeed, anyone asking the chatbot a medical question will be referred to a medical expert in no time by the chatbot. Of course, there are bound to be ways to circumvent such safeguards, but OpenAI’s clear aim is to avoid anyone relying on ChatGPT to decide if they need to see a doctor.

Specialized models, featuring far fewer parameters than GPT-4 (which is said to contain 1.8 trillion), may indeed offer better results. For example, Microsoft recently showed that Phi-2, a “small language model” with a relatively tiny set of 2.7 billion parameters, can still produce impressive and truthful information. The key ingredient: high-quality, textbook-quality data to train the model. By now, it is clear that smaller, high-quality data sets allow AI models to deliver better results than any LLM trained with a huge amount of mostly unverified information may achieve.

Medical world does benefit from AI

Promising medical applications of AI have been in the news before. IBM Watson aimed to shake up the healthcare industry over ten years ago. It was said to have been able to accelerate drug discovery and supply doctors with reliable diagnoses. Those promises, often not even made by IBM itself, were never realized. In the end, IBM sold a significant portion of its medtech products for more than a billion dollars in 2022.

Since then, Google in particular has been riding high with Med-PaLM. Despite renewed positive coverage for a new medical AI tool, and impressive benchmark scores to boot, the company seems to be somewhat cautious, reluctant to set expectations too high. “While Med-PaLM 2 reached state-of-the-art performance on several multiple-choice medical question answering benchmarks, and our human evaluation shows answers compare favorably to physician answers across several clinically important axes, we know that more work needs to be done to ensure these models are safely and effectively deployed.”

Medical diagnoses aren’t something AI excels at just yet. Currently, the most ambitious application would be to detect false negatives, where a medical expert would be advised to look again at patient data. However, using such data for AI applications is not easy, given the privacy issues it raises. On the other hand, AI can still be useful to the medical community. EHR (electronic health record) vendors are adding Gen AI functionality to reduce administrative workloads and get professionals to their answers faster. It’s not a doctor replacement, but it may make repetitive tasks more surmountable than ever before.