News Research Patient Communication Clinical Trials

Study Finds Differences in LLM Accuracy for Answering Patients' Clinical Trial Questions

March 23, 2026 By Julia Cipriano, MS, CMPP 5 min read
Share Share via Email Share on Facebook Share on LinkedIn Share on Twitter

Large language models (LLMs) showed differences in their ability to answer patient questions about clinical trials, with a leading proprietary model demonstrating superior reliability and safety over a widely used open-source counterpart in a direct comparative analysis published in JCO Oncology Advances.

According to lead author Jack Gallifant, MBBS, of the Artificial Intelligence in Medicine (AIM) Program at Mass General Brigham, Harvard Medical School, Boston, and colleagues (including ASCO AI in Oncology editorial advisory board member Danielle S. Bitterman, MD), the proprietary model (OpenAI GPT-4o-2024-08-06 [GPT-4o]) showed no instances of information fabrication in this evaluation, whereas the open-source model (Meta Llama-3.2-8B) produced fabricated claims in 14.5% of responses. One “particularly illustrative error” involved a response that incorrectly described principles from the Declaration of Helsinki; the investigators noted that “plausible-sounding misinformation, especially concerning ethical and regulatory standards, poses a significant safety concern.”

“These findings highlight the critical need for rigorous, comparative evaluation of LLMs before their deployment in patient-facing applications,” they commented. “To ensure patient safety, health-care systems should pair high-performing models with structured safety guardrails and continuous monitoring.”

To assess performance, the investigators constructed a data set of 349 unique trial information–related common patient queries derived from 23 authoritative oncology and regulatory sources, including materials from the National Cancer Institute (NCI), U.S. Food and Drug Administration (FDA), and NCI-designated cancer centers. A representative subset of queries was posed to Llama-3.2-8B and GPT-4o with consistent prompts and fixed generation parameters to enable comparison across models.

A total of 374 responses (188 from GPT-4o; 186 from Llama-3.2-8B) were evaluated by two physicians who were blinded to the model identity. The process used a structured QUEST framework–derived rubric, which assesses multiple dimensions, including accuracy, relevance, comprehensiveness, usefulness, clarity, potential for bias or harm, transparency, and trustworthiness.

More Performance Data

Both models were found to show high overall agreement with established facts and theories, with GPT-4o achieving 100% concordance in the Agreement category compared with 97.8% for Llama-3.2-8B. “These findings suggest LLMs’ ability to provide concordant responses to patient questions consistently,” the investigators wrote in their published report.

Differences were observed across other quality domains, with GPT-4o responses predominantly rated as “Agree” or “Strongly Agree” for Clarity and Usefulness, whereas Llama-3.2-8B responses more often received “Neutral” or “Disagree” ratings, as reported by the investigators. GPT-4o also demonstrated greater performance in the Self-awareness category, they continued, with responses acknowledging uncertainty and recommending consultation with a trial team.

Insights and Opportunities

In a related editorial published in JCO Oncology Advances, Peter P. Yu, MD, FACP, FASCO, and Tony K.W. Hung, MS, MD, MBA, FACP, of Hartford HealthCare Cancer Institute, Connecticut, compared LLMs to broadly applicable Swiss Army knives, writing, “just as a multitool cannot replace a surgical instrument, the value of LLMs depends on how precisely they are applied, the sources of truth they rely on, and the systems that govern their use.”

The editorial authors acknowledged both strengths and limitations of the study. It used a transparent, reproducible design with blinded evaluation and authentic patient inquiries, supporting the reliability and real-world relevance of the findings. Limitations included the use of single-turn questions, reliance on publicly derived queries that did not reflect the range of literacy levels, languages, or emotional states encountered in practice, and evaluation of a comparatively small open-source model.

Within this context, the editorial authors noted that model choice influenced the likelihood of fabricated or misleading content, emphasizing that such errors “are not benign” when explaining core trial concepts. They also wrote that “accuracy is a model attribute, but safety is a system attribute,” highlighting that a key question is not only which model performs better in isolation, but also how it is embedded within a system: which sources it may use, how responses are constrained, and how its performance is monitored over time. In addition, they noted that risk varies by task, with more complex applications involving individualized recommendations requiring closer oversight.

Drs. Yu and Hung concluded, “the study offers a careful evaluation of two language models in a clearly defined scenario and provides data that inform how patient-facing AI might be used cautiously in clinical research. It shows that differences in model architecture and training can influence the accuracy of responses and that unguarded deployment may lead to fabrication.”

“With careful governance and thoughtful design, LLMs can be refined into instruments that support informed participation in research rather than blunt tools that erode trust,” they added. “Responsible use will require attention not only to model capabilities but also to the systems—technical, clinical, and ethical—in which they operate.”

According to the study investigators, future research priorities include developing and validating publicly accessible web applications leveraging LLMs for trial information dissemination, along with establishing systems for prospective, real-world monitoring of LLM-generated content to assess and mitigate potential harm over time.

DISCLOSURES: Research was supported by the National Institutes of Health/National Cancer Institute, the American Cancer Society, the American Society for Radiation Oncology, and the Patient-Centered Outcomes Research Institute. For full disclosures of both the study and editorial authors, as well as funding and data sharing information, visit ascopubs.org.

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

Performance of a convolutional neural network in determining differentiation levels of cutaneous squamous cell carcinomas was on par with that of experienced dermatologists, according to the results of a recent study published in JAAD International.

“This type of cancer, which is a result of mutations of the most common cell type in the top layer of the skin, is strongly linked to accumulated [ultraviolet] radiation over time. It develops in sun-exposed areas, often on skin already showing signs of sun damage, with rough scaly patches, uneven pigmentation, and decreased elasticity,” stated lead researcher Sam Polesie, MD, PhD, Associate Professor of Dermatology and Venereology at the University of Gothenburg and Practicing Dermatologist at Sahlgrenska University Hospital, both in Gothenburg, Sweden.

KOL Commentary
Watch

Related Content