Consumer-Facing LLM ChatGPT Health Falls Short in Several High-Risk Triage Scenarios
Researchers conducted a structured stress test of ChatGPT Health’s triage abilities and recommendations, identifying dangerous failures and missed high-risk emergencies by the large language model (LLM) across multiple urgent medical situations. Fifty-two percent of gold-standard emergencies—including diabetic ketoacidosis and impending respiratory failure—were undertriaged, whereas classic emergencies such as stroke and anaphylaxis were correctly triaged.
These findings, published in Nature Medicine, raise safety concerns about using LLMs and other AI tools as consumer-facing triage systems without clinician oversight.
Background
On January 7, 2026, OpenAI launched ChatGPT Health, a large language model designed to provide health and wellness information and process medical data for consumers. In announcing the launch, OpenAI noted that the model—trained in collaboration with more than 260 physicians over 2 years and evaluated using HealthBench benchmarks—can help patients determine how urgently to seek clinician follow-up and interpret lab results and care instructions.
Consumers often use such models as triage tools before seeking physician consultation. Some studies have shown that consumers seeking medical advice from AI systems may trust AI-generated recommendations—even over physician advice—despite the potential for harmful guidance. Additionally, these models may exhibit biases that lead to less urgent responses for certain patients.
Model Methods
The researchers gathered 960 prompt-response pairs generated from 60 clinician-authored vignettes used to stress-test ChatGPT Health’s triage recommendations across 21 medical domains.
Each vignette was based on one of 30 scenarios and was written in two formats: one presenting only subjective information, such as symptoms and medical history, and another including objective findings such as laboratory values and exam results. The vignettes were then tested to assess how anchoring, race, access barriers, and gender influenced the LLM’s recommendations. The reference condition was a White male without anchoring or access barriers.
Three physicians independently determined gold-standard triage levels (A–D), ranging from nonurgent to most emergent, with strong inter-rater agreement. For half of the cases, only one triage level was considered correct (“clear”), whereas two adjacent triage levels were considered clinically reasonable for the remaining cases (“edge”).
Results
Clear cases included 8 nonurgent, 8 semiurgent, 10 urgent, and 4 emergency vignettes. The accuracy of ChatGPT Health’s triage responses was highest for the semiurgent cases (93%) followed by the urgent cases (76.9%). At the extremes, accuracy was lower—35.2% for nonurgent cases and 48.4% for true emergencies. Among these cases, 51.6% of emergencies were undertriaged and 64.8% of nonurgent cases were overtriaged, although none were recommended for emergency department visits.
Undertriage of asthma exacerbation accounted for 84.8% of undertriaged emergency cases. The study authors noted that the model correctly identified the warning signs of asthma exacerbation but subsequently dismissed their urgency. In cases of emergency situations involving diabetic ketoacidosis, the model correctly identified the issue but recommended outpatient management rather than emergency department care.
By contrast, classic emergencies—including stroke, anaphylaxis, meningitis, and aortic dissection—were all correctly triaged to the appropriate level. This finding “suggest[s] the model identifies classic presentations but fails when emergency status depends on clinical progression,” the study authors, led by corresponding co-author Ashwin Ramaswamy, MD, MPP, Senior Clinical Associate, The Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai and Mount Sinai Health System, noted.
Ninety-six percent of edge case responses fell within acceptable triage ranges, although the less urgent option was chosen in 60.8% of cases. When acceptable options included either urgent or emergency care, the system chose urgent care over emergency department referral 72.7% of the time.
Anchoring statements had the strongest effect on the system’s triage behavior compared with the other tested variables. In edge cases, anchoring statements increased the likelihood of influencing triage recommendations to 13.3% of the time, compared with 3.3% without anchoring (odds ratio [OR] = 11.7, 95% confidence interval [CI] = 3.7–36.6; Holm-adjusted P < .001). Just over half of all shifts (52.5%) leaned toward less urgent care, but 93.8% remained within acceptable triage level ranges.
Other variables did not have a statistically significant effect on the LLM’s recommendations. Notably, 17% of Black patients were undertriaged compared with 14.3% of White patients (OR = 1.96, 95% CI = 0.51–7.53; Holm-adjusted P = 1).
When reports and other objective findings were added to the prompts, triage accuracy increased from 54.6% to 77.9% (OR = 9.4, 95% CI = 4.9–18.0; P < .001). The inclusion of objective findings prevented overtriage in 95.3% of nonurgent cases (OR = 37.5, 95% CI = 10.4–207; P < .001), but undertriage of emergency cases increased to 56.2% (OR = 0.69, 95% CI = 0.23–2.05; P = .62).
Suicidal ideation was included among the vignettes and revealed a critical safety failure of the model. Crisis management was recommended only when objective findings were removed and not when they were included. Further testing using an additional 14 vignettes showed that crisis-management recommendations were inconsistent across prompts. The authors noted that safety guardrails appeared more reliably in prompts in which the patient did not identify a means of self-harm vs those in which a method was specified.
“The crisis guardrail finding may be the most consequential failure mode exhibited in the entire study,” the study authors wrote. “What we found was worse than simple suppression. Trust calibration requires predictable system behavior: when reliability is inconsistent, users cannot learn when to rely on the system and when to override it.”
“The implication is straightforward: consumer facing AI that functions as a front door for urgent medical decisions should not be deployed on trust alone,” the study authors concluded. “Our findings identify two engineering targets requiring immediate attention: emergency detection that accounts for clinical trajectory, not just snapshot presentation, and crisis guardrails that fire consistently rather than unpredictably. Given the direct patient-safety implications of missed emergencies, consumer health AI may warrant premarket safety evaluation requirements analogous to medical devices. At a minimum, these tools should demonstrate external safety for emergencies before widespread public deployment.”
DISCLOSURES: The authors reported no conflicts of interest. For access to the data and code used in the study, visit nature.com.
Editor’s Note: Karan Singhal, Head of Health AI at OpenAI, responded to the Nature Medicine paper in a post on LinkedIn.
AI in Practice: Have any of your patients mentioned using ChatGPT Health before, or come to you with AI-generated medical advice? Tell ASCO AI in Oncology about your experiences with consumer-facing health LLMs.
ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.
Performance of a convolutional neural network in determining differentiation levels of cutaneous squamous cell carcinomas was on par with that of experienced dermatologists, according to the results of a recent study published in JAAD International.
“This type of cancer, which is a result of mutations of the most common cell type in the top layer of the skin, is strongly linked to accumulated [ultraviolet] radiation over time. It develops in sun-exposed areas, often on skin already showing signs of sun damage, with rough scaly patches, uneven pigmentation, and decreased elasticity,” stated lead researcher Sam Polesie, MD, PhD, Associate Professor of Dermatology and Venereology at the University of Gothenburg and Practicing Dermatologist at Sahlgrenska University Hospital, both in Gothenburg, Sweden.