Humans Outperform AI In Disease Coding Test

An artificial intelligence model designed to classify complex medical case documents has been bested by its human challengers – but researchers say the AI technology could still be of enormous benefit.

A recent James Cook University-led study put five human clinical document coders up against ChatGPT-based large language models to analyse 100 randomly selected challenging clinical patient summaries across five major categories of diseases.

ChatGPT achieved 22 per cent accuracy, while the top human coder in the study achieved 47 per cent.

"We saw a couple of human coders perform better in almost all cases than the tool," said study lead author and JCU PhD candidate Akram Mustafa.

"Some of the coders performed worse, but if you put the combined five categories together, overall human coders do better."

Coders translate health records into standardised alphanumeric codes, which are then used for state and commonwealth data reporting, health service planning and hospital funding models.

Mr Mustafa said while previous studies had compared human coders to AI in classifying medical documents, this study went a step further.

"Some clinical cases are easy to classify where previous machine learning models or normal mapping tools can already do well. But we wanted to look at cases where it's challenging for those mainstream tools to classify clinical documents into different disease categories," he said.

"We wanted to see in these challenging cases, where some information may be missing, or the record doesn't show enough information, how a large language model AI tool would compare to human coders."

Study co-author and JCU Electronics and Computer Engineering Professor Mostafa Rahimi Azghadi said the team also compared the performance of ChatGPT 3.5 with ChatGPT 4 during the study, finding the latter produced far more consistent disease classifications when repeatedly fed the same clinical documents.

"ChatGPT 4 was much more stable. 86 per cent to 89 per cent of the time, it gave the exact same disease prediction," Prof Azghadi said.

"It's a similar process to giving a clinical record to one doctor and asking them for a diagnosis and then going in tomorrow and asking them the same question."

Prof Azghadi said the model should be viewed as a tool that could complement human coding, particularly in reducing inconsistencies and improving efficiency.

"Currently, all of these documents need to be coded by humans. They sit down and look at a large body of text, which includes information about the patient, their hospital assessment, treatment and progress, and what medication has been used," he said.

"A hybrid approach could be to leverage a large language model's speed and ability to flag difficult cases and combine it with human oversight for scenarios where the classification is more difficult. This may enhance coding accuracy and streamline the process."

Prof Azghadi said the next step would be to add more 'explainability' in the model where it could provide a more detailed justification for why it has classified a patient with a particular condition.

Dr Usman Naseem from Macquarie University's School of Computing was also involved in the study.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.