New research suggests that medical AI chatbots are woefully unreliable at understanding how people actually communicate their health problems.
As detailed in yet-to-be-peer-reviewed study presented last month by MIT researchers, an AI chatbot is more likely to advise a patient not to seek medical care if their messages contained typos. The errors AI is susceptible to can be as seemingly inconsequential as an extra space between words, or if the patient used slang or colorful language. And strikingly, women are disproportionately affected by this, being wrongly told not to see a doctor at a higher rate than men.
"Insidious bias can shift the tenor and content of AI advice, and that can lead to subtle but important differences" in how medical resources are distributed, Karandeep Singh at UC San Diego Health, who was not involved in the study, told New Scientist.
The work adds to the serious doubts about using AI models in a clinical setting, particularly in patient-facing roles. Hospitals and health clinics are already using chatbots to schedule appointments, field questions, and triage patients based on what they tell the chatbot, leaving their fate in the hands of a technology that often misinterprets information and makes up factual claims.
Humans tend to be poor at explaining what's bothering us medically. We can hem and haw about what symptoms we have and when they started to occur, hedging our answers with "maybe"s and "kind of"s. The perils are heightened in a written setting, where typos and bad grammar prevail — and even more so if someone is forced to communicate in a language that isn't their native tongue.
Your hypothetical medical AI is supposed to be unerring in the face of these hurdles, but are they actually? To find out, the MIT researchers evaluated four models, including OpenAI's GPT-4, Meta's open source LLama-3-70b, and a medical AI called Palmyra-Med.
To test them, the researchers simulated thousands of patient cases using a combination of real patient complaints from a medical database, health posts on Reddit, and some AI-generated patient cases. Before giving these to the AI models, they added "perturbations" to the cases that could potentially throw the chatbots off. These included the use of exclamation marks, typing in all lower case, using colorful language, using uncertain language like "possibly," and using gender neutral pronouns. Crucially, these changes were made without affecting the clinical data in the patients' responses, the researchers said.
But for one reason or another, the AI models clearly had their perceptions changed by the nonstandard writing. Overall, when faced with these stylistic flourishes, they were between 7 to 9 percent more likely to suggest a patient should self-manage their symptoms, instead of seeing a doctor.
One explanation is that the medical LLMs are relying on their training on medical literature, and can't make the leap to teasing out clinical information from a patient's vernacular language.
"These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don't know," study lead author Abinitha Gourabathina, a researcher at the MIT Department of Electrical Engineering and Computer Science, said in a statement about the work.
The even uglier implication is that the AI is reflecting, if not exaggerating, the biases already exhibited by human doctors, especially in regards to gender. Why is it that female patients were told more often to self-manage than men? Could it have anything to do with the fact that real-life doctors often downplay women's medical complaints because they're seen as being too emotional or "hysterical"?
Coauthor Marzyeh Ghassemi, an associate professor in the MIT EECS, says that the work "is strong evidence that models must be audited before use in health care" — but ironing out these flaws won't be easy.