Are AI Tools in Pathology Learning True Biomarker Signals or Statistical Shortcuts?
Tools designed to detect molecular biomarker status from histologic images using AI may be more dependent upon correlational relationships with clinicopathologic features than on causal signals of the biomarker. According to Dawood et al in their paper published in Nature Biomedical Engineering, this results in the AI tools using "shortcuts" to find a biomarker rather than learning to truly understand its biology, making them potentially unreliable in patient care.
“This study highlights a critical point about the rollout of AI in medicine,” stated study author Nasir Rajpoot, PhD, Professor of Computational Pathology and the Founding Director of the Tissue Image Analytics Centre at the University of Warwick as well as the Chief Executive Officer of Warwick spin-out Histofy. “To deliver real and lasting impact, the value of AI-based clinically important predictions must be judged through rigorous, bias-aware evaluation, rather than relying solely on headline accuracies that fail to account for confounding effects.”
Study and Model Methods
The researchers analyzed 8,221 tissue samples from patients with breast, colorectal, lung, and endometrial cancers from six cohorts, including two independent validation cohorts. They gathered biomarker and gene mutation status information from cBioportal and whole-slide images from The Cancer Genome Atlas and The Cancer Imaging Archive.
For breast cancer, they focused on hormone receptor and HER2 status, and for colorectal cancer cases, they collected information on microsatellite instability, hypermutation, chromosomal instability, and CpG island methylator phenotype pathway. Tumor mutational burden information was gathered for all cancer types and cohorts.
Then, the interdependency between these biomarkers and mutations was calculated using log odds ratios and two-sided Fisher’s exact tests.
From there, the researchers trained deep learning models to predict biomarker status from whole-slide images. They used three algorithms that operated with different principles of operation—CLAM, SlideGraph, and TITAN—that represented approaches that do not “explicitly consider interdependencies between prediction variables,” according to the study authors. CLAM is an attention-based model and SlideGraph is a graph neural network–based model; both require a patch-level encoder, so the researchers trained with different encoders: CTransPath, which was trained on histology images, and ShuffleNet, which was trained on ImageNet. TITAN is a whole-slide image level multimodal foundation model trained on 330,000 image–text pairs.
The researchers then used permutation testing and stratification analyses to show the ways that the models would fail in their accuracy of biomarker detection based on the status of other biomarkers due to biases and other variables. For each model, a prediction variable and stratification variables were defined to see the impact of confounding factors on different models.
Additionally, they completed an analysis of machine learning models detection of various biomarkers in comparison with pathologist-assigned grades using a support vector machine that had one-hot encoded histological grades for predicting clinical biomarkers based on the protocols used by weakly supervised models.
Key Findings
Investigators determined that interdependencies between biomarkers can influence the predictive performance of machine-learning models, and when these relationships are ignored in development, it trains the models to learn the aggregated impact of interdependent biomarkers rather than learning to understand and identify true patterns associated with one biomarker.
“It’s a bit like judging a restaurant’s quality by the queue of people waiting to get in: it’s a useful shortcut, but it’s not a direct measure of what’s happening in the kitchen,” stated study author Fayyaz ul Amir Afsar Minhas, PhD, Associate Professor and Principal Investigator of the Predictive Systems in Biomedicine Lab in the Department of Computer Science, University of Warwick. “Many AI pathology models are doing the same thing, relying on correlations between biomarkers or on obvious tissue features, rather than isolating biomarker-specific signals. And when conditions change, these shortcuts often fall apart.”
When looking specifically at BRAF mutations in colorectal cancer samples, for example, they found that the AI tools detected the relationship of BRAF to microsatellite instability (MSI) status to collectively predict the presence of BRAF mutations—rather than identifying the true BRAF signal. “A model that cannot disentangle MSI-[high status] from BRAF status may achieve high aggregate area under the receiver operating characteristic curve, but lacks clinical utility, as confusing the two would misguide treatment selection. This example underscores the broader need for bias-aware evaluation: predictors must be assessed not only for overall accuracy but also for their ability to distinguish correlated biomarkers with divergent therapeutic pathways,” the study authors wrote in their report.
They also suggested that if factors shift in test cohorts, then the performance of the model could be significantly affected within specific patient subgroups where these factors were changed or abnormal.
The study authors did suggest that AI tools can still be valuable for cancer research and treatment decision-making, but that they should be used with caution. Going forward, the study authors recommend that stratification-based evaluation framework be used to report bias and to support the development of higher-standard, more trustworthy models in cancer diagnostics.
“This research is not a condemnation of AI in pathology. It is a wake-up call. Current models may perform well in controlled settings but rely on statistical shortcuts rather than genuine biological understanding. Until more robust evaluation standards are in place, these tools should not be seen as replacements for molecular testing, and it is essential that clinicians and researchers understand their limitations and use them with appropriate caution,” Dr. Minhas concluded.
DISCLOSURE: Dr. Branson works for GSK. For full disclosures of the study authors, visit nature.com.
ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.
Performance of a convolutional neural network in determining differentiation levels of cutaneous squamous cell carcinomas was on par with that of experienced dermatologists, according to the results of a recent study published in JAAD International.
“This type of cancer, which is a result of mutations of the most common cell type in the top layer of the skin, is strongly linked to accumulated [ultraviolet] radiation over time. It develops in sun-exposed areas, often on skin already showing signs of sun damage, with rough scaly patches, uneven pigmentation, and decreased elasticity,” stated lead researcher Sam Polesie, MD, PhD, Associate Professor of Dermatology and Venereology at the University of Gothenburg and Practicing Dermatologist at Sahlgrenska University Hospital, both in Gothenburg, Sweden.