Researchers at BIDMC Found Physicians Using Chatbots Spent More Time on Patient Cases and Made Safer Decisions Compared to Colleagues Without AI Access
Boston, MA— It didn't take long for artificial intelligence (AI) to outperform human physicians in diagnostic reasoning—the first, if critical, step in clinical reasoning and patient care. Now, a study published in Nature Medicine suggests physicians who have access to large language models (LLM), also known as chatbots, demonstrate improved performance on several patient care tasks compared to physicians without access to LLM.
"Early implementation of AI into healthcare has largely been directed at clerical clinical workflows, such as portal messaging," said Adam Rodman, MD, MPH, Director of AI Programs at Beth Israel Deaconess Medical Center (BIDMC). "But one of the theoretical strengths of chatbots is their ability to serve as a cooperation partner, augmenting human cognition. Our findings demonstrate that improving physician performance, even in a task as complex as open-ended decision-making, represents a promising application. However, this will require rigorous validation to realize LLMs' potential for enhancing patient care."
Rodman and colleagues assessed 92 practicing physicians' decision-making processes as they worked through five hypothetical patient cases, each based on real, de-identified patient encounters. The researchers focused on the physicians' management reasoning, a step in clinical reasoning that encompasses decision-making around testing and treatment, balanced against patient preferences, social factors, costs, and risk.
"Unlike diagnostic reasoning, a task often with a single right answer which LLMs excel at, management reasoning may have no right answer and involves weighing trade-offs between inherently risky courses of action," said Rodman.
When their responses to their hypothetical patient cases were scored, physicians using the chat bot scored significantly higher than those using conventional resources only. Chatbot users also spent more time per case—by nearly two minutes. Additionally, physicians who used LLMs provided responses that carried a lower likelihood of mild-to-moderate harm; potential for mild-to-moderate harm was observed in 3.7 percent of LLM-assisted responses compared to 5.3 percent in the conventional resources group. However, potential for severe harm ratings were nearly identical between physician groups.
"The availability of an LLM improved physicians' management reasoning compared to conventional resources only, with comparable scores between physicians randomized to use AI and AI by itself. This suggests a future use for LLM's as a helpful adjunct to clinical judgment," said Rodman. "Further exploration into whether the LLM is merely encouraging users to slow down and reflect more deeply, or whether it is actively augmenting the reasoning process would be valuable."
Co-authors included Hannah Kerman, Jason A. Freed, Josephine A. Cool and Zahir Kanjee of Beth Israel Deaconess Medical Center; Ethan Goh, Eric Strong, Yingjie Weng, Neera Ahuja, Arnold Millstein. Jason Hom and Jonathan H. Chen of Stanford University; Robert Gallo of VA Palo Alto Health Care System; Kathleen P. Lane and Andrew P.J. Olsen of University of Minnesota Medical School; Andrew S. Parsons of School of Medicine, University of Virginia; Eric Horvitz of Microsoft; and Daniel Yang of Kaiser Permanente.
Rodman, Cool and Kanjee disclose funding from the Gordon and Betty Moore Foundation. Please see the publication for a complete list of disclosures and funders.