News Research Decision-Making Support

Study Finding That Foundation Models Outperform Clinical Tools on Medical Benchmarks Sparks Controversy

Findings from a head-to-head comparison of clinical AI tools and general-purpose large language models on medical knowledge, queries, and other benchmarks have stirred mixed reactions among clinicians who use AI in practice.

June 24, 2026 Lisa Astor 10 min read

Researchers conducted a comparative evaluation of clinical AI tools and general-purpose large language models (LLMs) to determine which performed better across three sets of medical benchmarks. The findings, published in Nature Medicine by lead study author Krithik Vishwanath of NYU Langone Health and The University of Texas at Austin, and colleagues at NYU Langone, have prompted clinicians who use AI in daily practice to reconsider which tools are best suited to specific tasks—and whether some models are advancing more rapidly than others.

Methods

The researchers compared the AI tools in three stages. First, they assessed medical knowledge using 500 multiple-choice questions similar to those found on U.S. Medical Licensing Examinations. Second, they evaluated agreement with expert clinicians on 500 free-response items from HealthBench. Third, they assessed performance on 100 real-world clinical queries generated during routine clinical use of LLMs. Responses in the third stage underwent randomized, blinded review by 12 clinicians.

The clinical models OpenEvidence and UpToDate Expert AI were compared with leading general-purpose models from OpenAI, Google, and Anthropic, including GPT-5.2, Gemini 3.1 Pro Preview, and Claude Opus 4.6. Google Search’s AI Overview was also included as a real-world control.

The HealthBench responses were evaluated by a panel of LLM judges—GPT-5.2, Gemini 3.1 Pro, and Opus 4.6— to mitigate bias associated with any single model. Responses were assessed across seven thematic categories: emergency referrals, context seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, and response depth. They were also graded on five performance axes: accuracy, completeness, communication quality, context awareness, and adherence to instructions. An example of a HealthBench grading prompt is provided in the supplementary material.

Clinical queries were drawn from anonymous prompts submitted to NYU Langone Health’s instance of ChatGPT. Twelve blinded clinicians evaluated the models’ responses for clinical correctness, completeness, safety and harm avoidance, and clarity, with each response independently scored by three randomly assigned reviewers. AI Overview was included in this phase of the evaluation because clinicians often encounter it in practice, according to the study authors. Responses were scored on a scale of 1 to 4; an example is provided in the supplementary material.

Findings

The general-purpose LLMs outperformed both clinical AI tools across all three stages of the evaluation.

On the medical examination–style questions, Gemini achieved the highest score at 97.4%, followed by GPT (94.2%), Claude (90.2%), OpenEvidence (89.6%), and UpToDate (88.4%).

On the HealthBench evaluation, ChatGPT achieved the highest overall score at 88.0, followed by Gemini (79.3), Claude (77.0), OpenEvidence (62.6), and UpToDate (61.3).

Across the seven thematic categories, ChatGPT either ranked first or tied for first in every category, whereas OpenEvidence and UpToDate AI ranked last or tied for last throughout. Compared with ChatGPT, differences in performance were statistically significant across all categories (P ≤ .04), except for responding under uncertainty (P = 1.0).

On the real-world clinical queries benchmark, Gemini achieved the highest mean aggregate score at 3.62, followed by ChatGPT (3.54), Claude (3.52), Google Search’s AI Overview (3.27), OpenEvidence (3.24), and UpToDate AI (3.17).

The study authors noted that the “frontier models outperformed clinical tools on most individual questions, not just on average.”

After adjusting for differences in rater leniency, the clinical AI tools were 49% to 87% less likely than Gemini to receive a higher rating (odds ratio = 0.13–0.51).

Across the four evaluation dimensions in the real-world clinical queries assessment, Google’s AI Overview performed comparably to, or better than, both clinical AI tools. The greatest differences between models were observed in clarity, whereas clinical correctness showed the least variation.

“OpenEvidence scored lowest on clarity (mean = 2.84), suggesting its weakness was communication, not knowledge,” the study authors noted. They also reported that OpenEvidence generated more low-scoring responses than other models because of incomplete clinical content, safety-critical omissions, and answers that were disorganized or hard to follow.

The authors also noted that UpToDate AI declined to answer more queries than any other model, refusing 19% of prompts compared with 1% to 6% for the other systems.

Harmful responses were identified in 3% of Claude outputs, 2.5% of UpToDate outputs, and 1.0% of OpenEvidence outputs. No harmful responses were observed from the remaining models.

Hallucinations were reported in approximately 1% of responses from Gemini, OpenEvidence, and Google AI, but not from the other models.

Discussion and Limitations

The study authors offered several possible explanations for the performance gap between the clinical and general-purpose models:

“Frontier LLMs may simply be better at the knowledge retrieval and reasoning that characterize most medical questions. They also benefit from faster iteration cycles, larger training corpora and greater alignment than specialist systems. The observed advantages of frontier general-purpose models may reflect the accelerated development and investment in these systems.”

“Our results should therefore be interpreted as a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches,” Vishwanath et al wrote. “In particular, deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation.”

The authors noted that because the clinical AI tools do not offer public application programming interfaces (APIs), they had to query them through browser interfaces. This approach may have limited the sample size and introduced factors such as hidden prompts, retrieval behavior, and output formatting.

The authors also acknowledged that the models may have been exposed during training to U.S. Medical Licensing Examination–type questions and/or HealthBench items, but not to the real-world clinical queries used in the study.

They noted further that limited public information is available about HealthBench, an OpenAI–developed benchmark, including details of its development and evaluation methodology. As a result, the benchmark may have favored ChatGPT in scoring.

“Accordingly, we view the blinded clinician evaluation on the real clinical queries benchmark as the primary evidence in this study, while HealthBench should be interpreted as supplementary,” the authors wrote.

The frontier models also served as judges during the first two evaluation stages, which may have influenced how other models were scored. In addition, the study did not assess response latency or citation quality.

The authors emphasized the need for independent evaluation frameworks to minimize bias, and they argued that existing benchmarks may not adequately capture a model’s utility in real-world clinical settings.

They concluded that “[t]he path forward may ultimately lie with hospital-specific LLMs that leverage institutional data to mitigate external harm, along with careful use of frontier models for less-sensitive tasks.”

Responses and Insights

Following publication of the article, OpenEvidence posted a response on LinkedIn alleging that the Nature Medicine paper contained “undisclosed conflicts of interest and irredeemable methodological flaws.” The company also stated that it had previously declined a request to develop an API for an in-house medical AI system at NYU.

OpenEvidence also argued that the study relied on widely used benchmark questions for which the models may already have been exposed to both the questions and answers during training. The company further criticized HealthBench for scoring responses using what it described as arbitrary metrics and noted that neither the data set used for the real-world clinical queries evaluation nor the methodology used to create it was made publicly available.

The message concluded by calling for a “good faith” approach to evaluating both general-purpose and clinical AI systems—one that reflects real-world use and measures meaningful clinical impact.

“We should be careful not to overread the headline. This study does not show that general-purpose LLMs are categorically superior to specialized clinical AI tools. It shows that, in this benchmark, a small set of frontier models scored higher than two clinical products on selected tasks,” explained Sheila Bond, MD, Director of Clinical Content & AI Strategy at Wolters Kluwer Health in a LinkedIn post. “Clinical AI tools are not designed to win benchmarks. Their value depends on grounding in trusted medical content, governance, auditability, and reliable use in clinical environments.

“Medicine has learned this lesson before: pleasing and plausible answers are not enough. What matters is disciplined reasoning grounded in current evidence, with clear limits and known failure modes. The authors appropriately call their findings a ‘snapshot of a rapidly evolving landscape.’ That’s the right frame.

“The question for healthcare is not: Which model gives the highest-rated answer? It is: Which system can reliably preserve, deliver, and govern trusted medical expertise at the point of care?
That is the harder question, and one of the most important for healthcare today.”

In a separate LinkedIn post, Jonathan H. Chen, MD, PhD, an associate professor and Director for Medical Education in Artificial Intelligence at Stanford University School of Medicine, highlighted a key question clinicians face when choosing among AI tools: “Do you want something grounded in references you can trust but verify, or something that just empirically gets more answers right than wrong (even if unverifiable and maybe the questions tested are really the kind you care about in practice)?”

“Understandably, there has been a lot of excitement and controversy surrounding these results,” Adam Rodman, MD, MPH, FACP, Assistant Professor of Medicine and Assistant Professor of Biomedical Informatics at Harvard Medical School and Director of AI Programs at the Shapiro Center for Research and Education, Beth Israel Deaconess Medical Center, told ASCO AI in Oncology.

“I think this study tells us something important, but there are a lot of questions that are still unanswered.

“Base language models are really good. I don't think any AI researchers are that surprised that the foundation models have really good outputs; all of the labs have been consistently improving their performance on medical tasks for years now.

“Reference quality was not evaluated. Both OpenEvidence and UpToDate really make their value proposition on their grounding. OpenEvidence has an evidence look up and has signed content deals with a bunch of journals and medical societies; UpToDate grounds everything in their expert database. The foundation models are almost certainly not grounding their responses as well. This study didn't evaluate reference quality, so if you're an oncologist who relies on OpenEvidence or UpToDate to query guidelines or the latest literature, this doesn't tell us a lot about that.

“Do doctors use the NYU secure GPT the same way they use Open Evidence? I think this is an open question—because of HIPAA, we can't see the entire data set that was used (in fact, we can only see a single input). It's a strength that they use real inputs, but it's also fully possible that people use OpenEvidence/UpToDate in different ways than they use a secure ChatGPT instance.

“The evaluation is meaningful. There has been some, in my opinion, unwarranted pushback on the methods that the paper used. Their methods are in line with other papers on this subject. All studies have limitations, and this isn't how I personally would have run this evaluation, but I find the analysis convincing.

“This is hardly the last word on this subject, and hopefully this will spur more meaningful study (and benchmarks) of the performance of these systems. One of the challenges of doing these kinds of studies is that while the big labs make their systems available for evaluation, OpenEvidence and UpToDate (and all the other big players in this space) don't make APIs easily available to researchers so they can benchmark their systems. If anything, I hope this paper and the debate around it spurs physicians, regulators, and our societies and boards to push for independent standard setting and evaluation standards so physicians and our patients can truly understand how the different systems perform,” Dr. Rodman concluded.

DISCLOSURES: This work was supported by the Institute for Information & Communications Technology Planning and Evaluation (IITP) grant funded by the Ministry of Science and ICT (MSIT) of the Republic of Korea government. Dr. Oermann reported equity in MarchAI and Artisight, spousal employment by Eikon Therapeutics, and consulting for Sofinnova Partners and Google. The remaining authors declared no competing interests. For code and data availability from the study, visit nature.com.

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO^®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

KOL Commentary

Watch