Avoiding a future where the ‘cause of death’ is an AI chatbot | Viewpoint

Opinion
Article

AI chatbots can aid in diagnosis and care management. But there are several specific issues with current mainstream AI chatbots.

AI in healthcare is under intense scrutiny given the high stakes involved in patient care. As many doctors are already using it in a clinical setting, I wanted to put generative artificial intelligence, commonly called AI chatbots or GenAI, to the test.

Image: Wolters Kluwer Health

Peter Bonis

The results were concerning, at best.

Turning to a well-known, free AI chatbot, I asked a relatively straightforward medical question. “How do you treat a urinary tract infection (UTI) in a patient with a penicillin allergy?”

To the untrained eye, the answer seemed convincing enough: “fluoroquinolones or levofloxacin” were listed among the options.

Mistake number one: levofloxacin is a type of fluoroquinolone, so the response wasn’t worded accurately and could cause the reader to believe that levofloxacin was not a fluoroquinolone. I, however, was much more concerned with what I would deem a medically serious flaw: there was no caveat for a pregnant patient.

Put simply, fluroquinolones can cause serious fetal harm. If the patient was pregnant, it could have been extremely detrimental to the health of the baby had a clinician blindly followed the response. The AI chatbot failed to consider this basic context because it did not have a sense of how critical this detail was. And why would it? It doesn’t have real clinical experience to navigate the many checkpoints informed by medical training, nor the reasoning and judgment from years of clinical practice.

Still, when combined with other technologies, GenAI tools have great potential to augment clinical decision-making beyond what doctors and nurses alone are capable of.

At present, AI chatbots can aid in diagnosis and care management. They perform well on medical board exams according to a growing number of studies. But the data do not tell the whole story.

There are several specific issues with current mainstream AI chatbots that demonstrate the technology is not ready to have a major role in clinical decision making on the frontlines of healthcare.

Hallucinations

As the most well-known risk of any GenAI platform, hallucinations persist despite the blistering pace of AI chatbot evolution.

And hallucinations are not always easy to recognize—even by clinicians who are convinced they can spot them. We cannot put the burden of differentiating the wheat from the chaff on a busy doctor who often has just seconds to make a judgment call.

GenAI models sometimes include references, but it is very unlikely clinicians will consistently verify whether the original source material is valid or hallucinated. I have already encountered references citing seemingly convincing studies that proved to be non-existent. In other cases, even valid references may come up short, not fully reflecting what is known about a given topic.

Inconsistency

If you ask an AI chatbot a question one day, and try that same prompt later, you’ll likely get two responses that can be different enough to be interpreted in meaningfully different ways. At a basic statistical level, GenAI’s outputs can vary greatly, which should be a serious concern for doctors as well as patients.

Biases

GenAI platforms are prone to a range of biases. Egregious examples of racial bias have shown that how a prompt is worded can not only influence the responses, but it can also cause responses to vary with clinically significant differences. (Study 1, study 2)

More subtle biases lurk in the decisions the models must make to prioritize various inputs. For example, how does the model consistently choose the most credible source of information when it encounters conflicting evidence in the research? For now, anyway, AI chatbots can’t perform critical peer review to evaluate evidence. A variety of human-like errors in judgment due to bias have also been reported.

Muscle memory lapses

AI chatbots are dazzling medical users with nearly instantaneous answers, even if they are sometimes inaccurate. While a doctor is the final decision maker, they may become accustomed to asking questions and implementing the responses, developing a sort of trusting muscle memory for the process, and over time, giving the recommendations provided less and less scrutiny and consideration.

This presents wholly new dimensions of decision-making for regulators to evaluate, not to mention uncharted legal territory for medical mistakes.

Real-world overload

Diagnosing and treating a disease can take several steps over time, as information evolves. While GenAI models can handle tidy board-exam clinical vignettes, they falter substantially with messy, real-world data evolving during actual patient care. (Study 1, study 2)

In a sector where one wrong response to a prompt can have grievous consequences, tackling these issues head-on will help build a roadmap for healthcare generative AI done right.

With proper vetting and processes to ensure clinical accuracy, AI chatbots combined with other technologies can improve physician-patient interactions in real-time by guiding follow-up questions, keeping context in mind, and leveraging the best evidence to assist with care.


Recent Videos
Image: Ron Southwick, Chief Healthcare Executive
Images: ANA, ENA, AACN, and AONL
Image: AAMC
Image: Chief Healthcare Executive
Image: HSHS St. Vincent Children's Hospital
Image credit: ©Michael Flippo - stock.adobe.com
Image: Ron Southwick, Chief Healthcare Executive
Image: Ron Southwick, Chief Healthcare Executive
Image: Ron Southwick, Chief Healthcare Executive
Image: Ron Southwick, Chief Healthcare Executive
Related Content
© 2025 MJH Life Sciences

All rights reserved.