New Study Shows LLM Tool GPT-3.5 and 4 Do Very Well in Clinical Reasoning

A new study confirms that GPT-3.5 and 4 excel in clinical reasoning and furthers the case for the use of LLMs in healthcare.

In a recent study published in NPJ Digital Medicine, researchers developed diagnostic reasoning prompts to investigate whether large language models (LLMs) could simulate diagnostic clinical reasons. The study found that with the right prompts, GPT-3.5 and 4 did quite well at the task.

In this particular study, one of the latest of its kind, researchers assessed diagnostic reasoning by GPT-3.5 and GPT-4 for open-ended-type clinical questions, hypothesizing that GPT models could outperform conventional chain-of-thought (CoT) prompting with diagnostic reasoning prompts.

The team used the revised MedQA United States Medical Licensing Exam (USMLE) dataset and the New England Journal of Medicine (NEJM) case series to compare conventional CoT prompting with various diagnostic logic prompts modeled after the cognitive procedures of forming differential diagnosis, analytical reasoning, Bayesian inferences, and intuitive reasoning.

They investigated whether LLMs could mimic clinical reasoning skills using specialized prompts, combining clinical expertise with advanced prompting techniques. The study found that GPT-4 prompts could mimic the clinical reasoning of clinicians without compromising diagnostic accuracy, which is crucial to assessing the accuracy of LLM responses, thereby enhancing their trustworthiness for patient care. The approach can help overcome the black box limitations of LLMs, bringing them closer to safe and effective use in medicine.

GPT-3.5 accurately responded to 46% of assessment questions by standard CoT prompting and 31% by zero-shot-type non-chain-of-thought prompting. Of prompts associated with clinical diagnostic reasoning, GPT-3.5 performed the best with intuitive-type reasonings (48% versus 46%).

The study’s findings showed that GPT-3.5 and GPT-4 have improved reasoning abilities but still have some issues with accuracy when compared to conventional CoT reasoning. GPT-4 performed similarly with conventional and intuitive-type reasoning chain-of-thought prompts but worse with analytical and differential diagnosis prompts. Bayesian inferences and chain-of-thought prompting also showed worse performance compared to classical CoT.

Yet despite some limitations, overall, the researchers concluded, “We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can imitate clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether an LLM’s response is likely correct and can be trusted for patient care. Prompting methods that use diagnostic reasoning have the potential to mitigate the “black box” limitations of LLMs, bringing them one step closer to safe and effective use in medicine.”

You can read the entire study here: https://www.nature.com/articles/s41746-024-01010-1

How BigRio Helps Bring LLM and Advanced AI Solutions to Healthcare

At BigRio, we are excited to see this kind of real-world evidence that helps make the case for LLMs in medicine, albeit with proper safeguards and human oversight in place. When it comes to leveraging GAI and LLMs for healthcare, there are two primary approaches: building your own model or utilizing existing public models developed by big tech companies like OpenAI.

Of course, it is much easier to use an off-the-shelf LLM solution; however, while these “open source” GAI/LLM solutions like ChatGPT have gained signiﬁcant attention across various ﬁelds, including healthcare, they are limited by their need to be non-specific in scope and ability.

What if you could build an LLM model for your healthcare organization’s unique targets and needs? You can, with BigRio’s Help!

Creating a large language model from scratch requires extensive resources, the expertise of AI developers and data scientists, the MLOps team, and computational power. It involves training the model on massive datasets, ﬁne-tuning it through multiple iterations, and optimizing its performance. This process demands substantial time, expertise, and computational resources, including high-performance hardware and storage systems. The good news is that the BigRio team can offer you all of the above and more!

BigRio has long been a facilitator and incubator in leveraging AI to improve healthcare delivery, originally in the field of diagnostics and research. We have recently been focusing our efforts on supporting startups and developing our own solutions that use LLMs and GAI to improve those areas of healthcare as well as in direct patient interactions and customer relationship management.

You can read much more about how AI is redefining healthcare delivery and drug discovery in my new book Quantum Care: A Deep Dive into AI for Health Delivery and Research. It’s a comprehensive look at how AI and machine learning are being used to improve healthcare delivery at every touchpoint.

Rohit Mahajan is a Managing Partner with BigRio. He has particular expertise in the development and design of innovative solutions for clients in Healthcare, Financial Services, Retail, Automotive, Manufacturing, and other industry segments.

BigRio is a technology consulting firm empowering data to drive innovation and advanced AI. We specialize in cutting-edge Big Data, Machine Learning, and Custom Software strategy, analysis, architecture, and implementation solutions. If you would like to benefit from our expertise in these areas or if you have further questions on the content of this article, please do not hesitate to contact us.

New Study Shows LLM Tool GPT-3.5 and 4 Do Very Well in Clinical Reasoning

Company

Knowledge Center

GET IN TOUCH