HER2 Scoring Interpretation Concordance Across Computational Pathology Models
Scoring of HER2 expression levels showed strong agreement across 10 computational pathology AI models in identifying cases of HER2 3+ in whole-slide images from patients with breast cancer. Greater variability, however, was seen in borderline categories such as HER2-low expression. These findings, which demonstrate a need for greater tool refinement to improve reproducibility, especially in low-intensity settings, were published in Modern Pathology.
More precise HER2 level measurements are needed for anti-HER2 antibody-drug conjugates to ensure that patients with breast cancer who have low or ultralow HER2 expression are included. Additionally, these categories of HER2 low and HER2 ultralow differ from the standardized scoring guidelines developed by ASCO and the College of American Pathologists (CAP) for HER2 expression by immunohistochemistry staining with scores of 0, 1+, 2+, or 3+.
“Importantly, as HER2-low and HER2-ultralow classifications expand, no universally accepted reference standard or clinically anchored threshold exists for these lower expression levels, highlighting the need for standardized data sets and outcome-linked benchmarks to support both pathologist- and artificial intelligence (AI)–based scoring,” the study authors—led by Brittany McKelvey, PhD, formerly the Director of Regulatory Affairs for the Friends of Cancer Research and now the Senior Director of Regulatory Policy at LUNGevity Foundation—wrote in their publication, suggesting that AI tools could improve diagnostic precision.
Study Methods
As part of the group’s Digital and Computational Pathology Tool Harmonization Project, Friends of Cancer Research convened a working group of several stakeholders to complete a large-scale comparative evaluation of HER2 immunohistochemical scorings across 10 computational pathology AI models. The group consisted of HER2 AI model developers, pharmaceutical companies, clinical pathologists, independent consultants, government regulators and scientists, and patient advocates.
Locked, unmodified AI models were compared and assessed on a sample set of 1,124 whole-slide images from 733 patients who were diagnosed with breast cancer in 2021 at Ziekenhuis aan de Stroom Hospital in Antwerp, Belgium. The images consisted of paired hematoxylin and eosin (H&E)–stained and HER2 immunohistochemistry whole-slide images. Independent HER2 assessments from the models—including score, binary status, and classifications—were compared with that of three breast pathologists, who scored the images according to the 2018 ASCO-CAP guidelines.
Relative agreement between the models was determined using pairwise comparisons and measurements of overall percent agreement and Cohen's kappa coefficient.
Of the 10 models included in the analysis, only one had regulatory clearance. Seven of the models only used immunohistochemically stained images, two only used H&E-stained whole-image slides, and one model used both. Five of the models required human intervention. Seven of the models produced outputs using ASCO-CAP categorical scores, six reported H-scores, five reported tumor cells per staining percentages, and five reported invasive carcinoma cell counts.
Results
Quality control failure rates for the AI models, which ranged from 0.2% to 6%, were attributed to artifacts, poor staining quality, or inadequate tissue representation.
The median pairwise percent agreement was 65.1% for ASCO-CAP scores (interquartile range [IQR] = 60.3%–69.1%), and Cohen's kappa coefficient was 0.51 (IQR = 0.45–0.55). Overall percent agreement was similar across breast pathologist assessments (64.4%). Pathologists had higher numerical categorical agreement than the models, with a median overall percent agreement of 70.4% (IQR = 68.7%–74.6%).
Conditional agreement probabilities were highest between models for HER2 scores of 3+ and lowest for 1+ at 88% and 60%, respectively. The models demonstrated the highest agreement for distinguishing HER2 3+ scores from other categories, with a median overall percent agreement of 97.3% (IQR = 95.9%–97.9%) and a kappa coefficient of 0.86 (IQR = 0.82–0.90). The lowest agreement (79.9%) was found for scores of 0 and 1+ vs 2+ and 3+.
The highest positive agreement was reported for scores of 0 vs not-0 (91.3%), while negative agreement was highest for scores of not 3+ vs 3+ (98.5%). Discrepancies were most often found between adjacent categories, such as 1+ vs 2+.
Agreement varied depending on sample type, tumor grade, and biomarker status. Estrogen/progesterone receptor negativity, higher-grade disease, elevated Ki-67 levels, and biopsy specimen type were all associated with higher agreement between models.
When comparing sample-level categorical agreement with staining pattern percentages, agreement was highest among samples with higher staining pattern (SP) percentages of SP0 or SP3. In the fourth cluster, with mostly SP3 samples, the overall percent agreement was 97.6%. The intraclass correlation coefficient for reliability was also highest for SP3 and lower for SP0 (0.80 vs 0.50). The study authors suggested that strong staining of more than 10% was probably recognized more consistently across all models than other categories.
Concordance correlation coefficients were high for the detection of total invasive carcinoma cells (median = 0.88) as well as stained carcinoma cells (median = 0.80) between the models. No clear pattern emerged to explain why discordant scores were found between model pairs.
“These findings indicate strong model agreement in recognizing invasive carcinoma cells, suggesting that observed HER2 scoring discrepancies are more likely due to differences in staining interpretation or intensity assessment rather than tumor cell identification,” McKelvey et al wrote.
When analyzing H-scores across models, the most comparable intraclass correlation coefficient for reliability was seen with intense membrane staining of SP3.
Some of the challenges associated with greater discordance between the models included sparse cellularity, the presence of Paget's disease, high tumor-infiltrating lymphocytes, and a fibroblast-rich stroma. Some models also overscored benign cells or ductal carcinoma in situ samples with positive HER2 staining.
DISCLOSURES: The work was funded by Friends of Cancer Research through unrestricted grants. For author disclosures, visit modernpathology.org.
ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.
Performance of a convolutional neural network in determining differentiation levels of cutaneous squamous cell carcinomas was on par with that of experienced dermatologists, according to the results of a recent study published in JAAD International.
“This type of cancer, which is a result of mutations of the most common cell type in the top layer of the skin, is strongly linked to accumulated [ultraviolet] radiation over time. It develops in sun-exposed areas, often on skin already showing signs of sun damage, with rough scaly patches, uneven pigmentation, and decreased elasticity,” stated lead researcher Sam Polesie, MD, PhD, Associate Professor of Dermatology and Venereology at the University of Gothenburg and Practicing Dermatologist at Sahlgrenska University Hospital, both in Gothenburg, Sweden.