Can AI Support Oncology Workflows? Study Finds LLMs Outperformed Physicians in Several Reasoning Tasks
A large language model (LLM) outperformed physicians in multiple clinical tasks—including emergency room (ER) decision-making, diagnosis identification, and selecting next steps in patient management, according to a study published in Science. The findings raised new questions about how AI could eventually support complex areas of care, such as oncology. The investigators cautioned, however, that the findings do not suggest AI systems are ready to practice medicine independently or replace physicians in the diagnostic process.
The study evaluated the diagnostic and management reasoning capabilities of OpenAI’s o1 series model across six experiments. Researchers tested the model using The New England Journal of Medicine (NEJM) clinicopathologic conferences, NEJM Healer diagnostic cases, Grey Matters management cases, landmark diagnostic cases, diagnostic probabilistic reasoning cases, and real-world ER cases using AI for second opinions.
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” co-senior study author Arjun K. Manrai, PhD, of Harvard Medical School, Boston, commented in an institutional press release.
Among the findings with potential relevance to oncology workflows, the o1-preview model included the correct diagnosis in its differential in 78.3% (95% confidence interval [CI] = 70.7%–84.8%) of NEJM clinicopathologic conference cases and included either the correct or a potentially helpful or very close diagnosis in 97.9% (95% CI = 94.0%–99.6%) of cases. On five Grey Matters management reasoning cases, the model achieved a median score of 89%, compared with 42% for OpenAI’s GPT-4, 41% for physicians with access to GPT-4, and 34% for physicians using conventional resources.
Additional Findings
The investigators also evaluated the o1 series model using 20 NEJM Healer clinical reasoning cases designed to assess clinical reasoning documentation using Revised-IDEA (R-IDEA), a validated 10-point scale for evaluating core domains of clinical reasoning. In 78 of 80 cases, o1-preview achieved a perfect R-IDEA score, compared with 47 of 80 cases for GPT-4, 28 of 80 cases for attending physicians, and 16 of 72 cases for resident physicians (all P < .0001).
In a separate analysis of six landmark diagnostic cases previously used to compare GPT-4 with 50 generalist physicians, the median score for o1-preview was 97%, compared with historical control data showing scores of 92% for GPT-4, 76% for physicians with access to GPT-4, and 74% for physicians using conventional resources.
Diagnostic probabilistic reasoning was additionally evaluated using five cases on primary care topics previously administered to 553 medical practitioners, including resident physicians, attending physicians, nurse practitioners, and physician assistants. The investigators reported that clinicians demonstrated substantially wider variability in probability estimates than both GPT-4 and o1-preview.
The study also included 76 ER cases from Beth Israel Deaconess Medical Center that were presented to the o1 model directly from the electronic health record without preprocessing for differential diagnoses. Across three diagnostic touchpoints—initial ER triage, ER physician evaluation, and admission to the medical floor or intensive care unit—the o1 model was found to either perform nominally better than or on par with OpenAI’s GPT-4o and two attending physicians. During initial ER triage, the model identified the exact, or very close to the exact, diagnosis in 67.1% of cases, compared with 55.3% and 50.0% for the two physicians. The o1 model identified the diagnosis correctly, or nearly exactly, during the ER physician encounter 72.4% of times, vs 61.8% and 52.6% of the time for the two physicians. Upon admission to the medical floor or intensive care unit, the AI model achieved 81.6% accuracy compared with 78.9% and 69.7% for physicians 1 and 2, respectively.
“In all experiments, the LLM outperformed physician baselines and displayed continued improvement from prior generations of AI clinical decision support,” the investigators concluded. “Our findings suggest that LLMs have now eclipsed most benchmarks of clinical reasoning, motivating the urgent need for human–computer interaction studies and prospective clinical trials to rigorously assess the potential of AI systems to improve clinical practice and patient outcomes.”
However, co-first author Peter G. Brodeur, MD, MA, of Beth Israel Deaconess Medical Center, Boston, stressed: “A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm. Humans should be the ultimate baseline when it comes to evaluating performance and safety.”
DISCLOSURES: The study was funded by the National Institutes of Health, the Harvard Medical School Dean’s Innovation Award for the Use of Artificial Intelligence, the Macy Foundation, the Moore Foundation, the Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program, and a Stanford RAISE Health Seed Grant 2024.For disclosures of the study authors and code availability, visit science.org.
ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.