Traditional diagnostic decision support systems outperform generative AI for diagnosing disease
- Date:
- May 29, 2025
- Source:
- Mass General Brigham
- Summary:
- Researchers compared their long-standing diagnostic decision support systems AI tool, DXplain, with modern large language models like ChatGPT and Gemini, finding DXplain performed slightly better. They say their findings suggest that combining DXplain with LLMs could enhance clinical diagnosis and improve both technologies.
- Share:
Medical professionals have been using artificial intelligence (AI) to streamline diagnoses for decades, using what are called diagnostic decision support systems (DDSSs). Computer scientists at Massachusetts General Hospital (MGH), a founding member of the Mass General Brigham healthcare system first developed MGH's own DDSS called DXplain in 1984, which relies on thousands of disease profiles, clinical findings, and data points to generate and rank potential diagnoses for use by clinicians. With the popularization and increased accessibility of generative AI and large language models (LLMs) in medicine, investigators at MGH's Laboratory of Computer Science (LCS) sought to compare the diagnostic capabilities of DXplain, which has evolved over the past four decades, to popular LLMs.
Their new research compares ChatGPT, Gemini, and DXplain at diagnosing patient cases, revealing that DXplain performed somewhat better, but the LLMs also performed well. The investigators envision pairing DXplain with an LLM as the optimal way forward, as it would improve both systems and enhance their clinical efficacy. The results are published in JAMA Network Open.
"Amid all the interest in large language models, it's easy to forget that the first AI systems used successfully in medicine were expert systems like DXplain," said co-author Edward Hoffer, MD, of the LCS at MGH.
"These systems can enhance and expand clinicians' diagnoses, recalling information that physicians may forget in the heat of the moment and isn't biased by common flaws in human reasoning. And now, we think combining the powerful explanatory capabilities of existing diagnostic systems with the linguistic capabilities of large language models will enable better automated diagnostic decision support and patient outcomes," said corresponding author Mitchell Feldman, MD, also of MGH's LCS.
The investigators tested the diagnostic capabilities of DXplain, ChatGPT, and Gemini using 36 patient cases spanning racial, ethnic, age, and gender categories. For each case, the systems had a chance to suggest potential case diagnoses both with and without lab data. With lab data, all three systems listed the correct diagnosis most of the time: 72% for DXplain, 64% for ChatGPT, and 58% for Gemini. Without lab data, DXplain listed the correct diagnosis 56% of the time, outperforming ChatGPT (42%) and Gemini (39%), though the results were not statistically significant.
The researchers observed that the DDSS and LLMs caught certain diseases the others missed, suggesting there may be promise in combining the approaches. Preliminary work building off these findings reveals that LLMs could be used to pull clinical findings from narrative text, which could then be plugged into DDSSs -- in turn synergistically improving both systems and their diagnostic conclusions.
Story Source:
Materials provided by Mass General Brigham. Note: Content may be edited for style and length.
Journal Reference:
- Mitchell J. Feldman, Edward P. Hoffer, Jared J. Conley, Jaime Chang, Jeanhee A. Chung, Michael C. Jernigan, William T. Lester, Zachary H. Strasser, Henry C. Chueh. Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Network Open, 2025; 8 (5): e2512994 DOI: 10.1001/jamanetworkopen.2025.12994
Cite This Page: