Using large language models to accurately analyze doctors' notes

Johns Hopkins University

The amount of digital data available is greater than ever before, including in health care, where doctors' notes are routinely entered into electronic health record systems. Manually reviewing, analyzing, and sorting all these notes requires a vast amount of time and effort, which is exactly why computer scientists have developed artificial intelligence and machine learning techniques to infer medical conditions, demographic traits, and other key information from this written text.

However, safety concerns limit the deployment of such models in practice. One key challenge is that the medical notes used to train and validate these models may differ greatly across hospitals, providers, and time. As a result, models trained at one hospital may not perform reliably when they're deployed elsewhere. Previous seminal works by Johns Hopkins University's Suchi Saria—an associate professor of computer science at the Whiting School of Engineering—and researchers from other top institutions recognize these "dataset shifts" as a major concern in the safety of AI deployment.

"As we increase our use of AI in real-world applications and learn about its strengths and weaknesses, it is important to develop tools that improve AI models' robustness and safety."

Suchi Saria

Associate professor, Whiting School

But a team of Johns Hopkins and Columbia University computer scientists has a plan to harness recent breakthroughs in large language models to combat the spurious correlations that may arise from AI-powered medical text analysis. They presented their new technique at the 37th Annual Conference on Neural Information Processing Systems in December.

"We found that we can greatly improve the robustness of these text models across different settings by making them less sensitive to changes in writing habits and styles observed between different caregivers," says Yoav Wald, a postdoctoral fellow working on the project with Saria.

For example, doctors often use specialized templates, such as headings or tables, in their notes. These templates have no inherent link to the patient's condition. However, AI systems can incorrectly deduce associations between certain templates and specific diagnoses, as the same templates tend to be used by doctors treating certain subpopulations of patients, he explains. The same goes for doctors' writing styles, including word choice and grammar.

Though these style-related factors have nothing to do with the analysis being attempted, they can lead to poor results when a model is deployed, degrading its performance and resulting in inaccurate diagnoses.

One way of preventing models from learning these spurious correlations is to feed it the same medical note in many different writing styles. This way, the model learns to focus on the content rather than the writing style, the researchers say.

But rather than having each caregiver rewrite other physicians' notes—which would severely drain the already scarce resource of caregivers' time—the team used large language models to automate this process and create datasets that are resistant to the learning of faulty correlations based on writing style.

"Given a specific note that we wish to rewrite in the style of some caregiver—say, Dr. Beth—we instead ask an LLM, 'How would this note look had Dr. Beth written it?'" Wald explains.

By using LLMs to generate such counterfactual data—data that do not exist in the real world, but that can be used to negate spurious correlations in existing data—the researchers say they can reduce the likelihood that an ML model makes inaccurate predictions based on irrelevant details.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like