News Research Breast Cancer Diagnostics & Imaging

Two Studies Demonstrate Real-World Feasibility and Accuracy of AI-Integrated Breast Cancer Screening

May 04, 2026 By Lisa Astor 12 min read
Share Share via Email Share on Facebook Share on LinkedIn Share on Twitter

In the United Kingdom, machine learning is being tested within radiology workflows to determine whether AI can help address a significant challenge. The National Health Service (NHS) Breast Screening Programme relies on a rigorous and time-consuming workflow. However, with a growing shortage of radiologists and an increasing population of women at risk, there may not be enough workforce capacity to sustain the program.

Two companion studies published in Nature Cancer evaluated the performance of AI integrated into the UK mammography workflow, with the potential to reduce radiologist workload while maintaining high breast cancer detection accuracy. Both were conducted as part of the UK NHS Health Research Authority’s AI in Mammography Screening (AIMS) project in collaboration with Google Research.

The first study assessed the accuracy of machine learning models compared with human specialists, and the second compared the standard double-read screening workflow with an AI-assisted workflow with AI serving as the second reader.

“Breast screening programmes rely on highly skilled specialists, but there is increasing pressure on the workforce. It was encouraging to find that a combination of human expertise and AI achieved a similar level of performance to two human readers,” stated Lucy M. Warren, PhD, AI Research Lead at the NHS Royal Surrey Foundation Trust during the study period and author on both papers, including lead author of second study. “The AIMS study was a success because of the collaborative work between multi-disciplinary teams from multiple trusts and institutions.”

Background

The NHS Breast Screening Programme (NHSBSP) is currently offered to women aged 50 to 70 years, with mammography performed every 3 years. Each mammogram is independently read by two readers to determine whether additional follow-up is needed. In cases of disagreement, the images are referred to an arbitration panel for a final recall decision. Some smaller centers refer all cases to arbitration for additional assessment.

However, the radiology workforce in the United Kingdom is in crisis, with a growing shortage of clinical radiologists that is expected to reach 40% by 2028.

Prior research demonstrated the potential of an AI system in a breast cancer screening workflow. In that study, Google’s AI system showed greater breast cancer detection accuracy than specialists, with an absolute difference margin of 11.5% compared with radiologist readers.

“A systematic review of standalone AI performance for breast cancer detection at screening concluded that it performed as well [as] or better than radiologists. However, it is less clear how radiologists perform when interacting with AI tools in situations closer to real-world screening,” Warren et al wrote in the second study. “A pilot of an independent external validation process concluded that there needed to be a clinical validation of the impact of AI on the decisions made by radiologists during arbitration. This study addresses this issue.”

Researchers conducted the AIMS project to assess whether AI could be feasibly integrated into the NHSBSP and accurately and safely screen for cancer. They hypothesized that AI use would reduce radiologist workload and time to results, while increasing service capacity, improving accuracy, and enhancing patient experience.  

Model Methods

Both studies used Google’s mammography AI system (v1.2), an updated version of the model from prior research, to analyze 2D full-field digital mammograms, generating a binary determination and highlighting suspicious regions of interest. The AI model was trained on data from 76,142 women across a broad range of geographies, screening sites, vendors, and acquisition protocols, as well as key demographic and technical subgroups.

The v1.2 model builds on the three-model ensemble architecture of v1.0, combining unified functionality with upgraded backbones. It consists of three components: a global model that analyzes four mammogram views to generate a case-level prediction (cancer score); a detection model that identifies bounding boxes of lesions in each view; and a hybrid model that integrates features from the global model's final layer with detection outputs to assign a score to each box. The final score reflects the highest scoring bounding boxes in the scan.

Operating points, or score cutoffs at which the AI flagged a finding for recall, were selected on a service-specific basis to enhance sensitivity without modifying model weights.

First Study

Design

Part A of the AIMS project was a retrospective, multicenter study assessing the accuracy and fairness of the AI system for breast cancer detection. It included approximately 125,000 women (about 115,000 after exclusions) aged 50 to 70 years undergoing routine screening across five services in the United Kingdom with three distinct workflows.

Women were randomly selected from those who underwent screenings in 2016 and had either a subsequent screening 24 to 39 months later or a cancer diagnosis documented within 39 months of the scan. Exclusion criteria included nonroutine screening practices, breast implants, and repeat imaging due to poor diagnostic imaging quality.

Data were collected from each service using the OPTIMAM mammography image database infrastructure.

In this part of the project, AI integration was observational only to avoid impacting assessments or recall rates. Researchers assessed the implications and nuances of integration into the clinical workflow through workflow mapping, codesign, workshops, and interviews to better understand emerging challenges.

The primary endpoint was AI sensitivity and specificity compared with those of first human readers, using a 5% noninferiority margin. Secondary endpoints included comparisons with human second readers or consensus readers, as well as breast-level analyses.

Results

The AI system demonstrated superior sensitivity and noninferior specificity compared with human first readers, second readers, and consensus decisions after arbitration, at both the case and breast levels (P < .001 for all).

Cancer detection rates were higher across all services with AI vs a human first reader (9.33 vs 7.54 per 1,000 women). The recall rate was also higher with AI (6.5% vs 5.5%).

The AI system showed particularly strong performance compared with human first readers among women undergoing screening for the first time. In this group, AI had the lowest recall rate at 7.1% vs 11.8% for human first readers and 8.5% for consensus reads. The cancer detection rate was 10.0 vs 9.19 per 1,000 women for AI and the human first reader, respectively.

Additionally, the AI system detected 25% of future interval cancer cases; of these, 88% were localized to the correct breast and 58.1% to the exact location. Another 25% of next-round cancers were correctly identified by AI.

No significant differences in performance were observed between the AI model and human first readers in an exploratory analysis of clinical and sociodemographic subgroups.

An analysis of reader time savings showed that using AI as a second reader would reduce reading time by 32.1% and increase the cancer detection rate from 17.7% to 20.2%.

Operating points were adjusted during the second study period to better align with target recall rates. “There was substantial week-to-week variation…highlighting the challenges in detecting drift in this type of low-prevalence screening population,” Kelly et al wrote.

Second Study

Design

Part B of the AIMS project was a prospective, multicenter technical feasibility study conducted at two screening services. Approximately 25,000 women per center were randomly selected for prospective evaluation (45,602 women after exclusions).

The study included two arms: a standard-of-care arm with human first and second reads of the scans and arbitration as needed, and an AI arm with a human first read and the AI system.

Standard arbitration policies were applied at each center. At the first center, all cases with disagreement between readers were reviewed, whereas at the second center, cases recalled by either reader were reviewed by the arbitration team; all arbitration reviews were conducted in pairs.  In the AI arm, readers were shown both the human and AI decisions simultaneously. Warren et al noted: “It was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers’ decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically.”

All readers were NHSBSP-accredited and trained by the AI vendor to interpret the tool. They were instructed to assign malignancy scores for each breast using the Royal College of Radiology 5-point scale and breast density scores using Breast Imaging Reporting and Data System categories A–D.

Mammograms from all positive cases were annotated by expert radiologists who did not participate in the study to establish a ground truth reference.

The primary endpoint was noninferiority of the AI arm vs the human reader arm for both sensitivity and specificity, with a prespecified absolute margin of 5%. If noninferiority was demonstrated, the study design allowed for a subsequent one-tailed superiority test affecting statistical power.

Results

After arbitration, the AI arm achieved a sensitivity of 49.2% vs 48% in the human arm (P < .0001) and a specificity of 96.8% vs 96.5% (P < .0001), demonstrating noninferiority. The study authors noted that a sensitivity of around 50% was expected due to the long follow-up period.

No significant difference was observed between the two arms in recall rate (3.9% vs 4.2%; P = .076) or cancer detection rate (7.8% vs 7.6%; P = .299).

The impact of the AI system on workload was assessed separately for each screening center due to differences in protocol. At both centers, the number of human reads was 50% lower in the AI arm vs the human arm. However, arbitration rates were significantly higher in the AI arm, increasing by 142% at the first center and 22% at the second.

The AI arm had a sensitivity of 92.3% for screen-detected cancers, 8.8% for interval cancers, and 8.1% for next-round cancers.

Before arbitration, sensitivity was higher in the AI arm than in the human arm for interval and next-round cancers, but the rates were similar after arbitration. “For interval cancers and next-round cancers, there was a larger reduction in sensitivity after arbitration and the decrease was larger for the AI arm than the human arm. Consequently, there was not a notable difference in sensitivity between the two arms for interval cancers and next-round cancers after arbitration. Therefore, after arbitration, replacing the second reader with AI did not result in cancers being detected earlier,” Warren et al wrote.

Several positive cases correctly recalled by AI (n = 93) were overruled during arbitration, including 13 screen-detected cancers, 28 interval cancers, and 52 next-round cancers. In about half of these cases, the AI did not localize the cancer correctly. Many had prior images available, which influenced arbitration decisions, but the AI tool did not analyze prior images.

Twenty-one of 22 readers completed post-study surveys, with most indicating they “somewhat trusted” the information provided by the AI tool. More than half reported that the AI was unreliable for overcalling calcifications and recalling cases with prior images.

Collective Implications

Together, these studies show that AI-assisted breast cancer screening is feasible, with AI detecting more cancers than human readers alone while recalling fewer false positives. The overall impact of AI use in breast cancer screening was considered noninferior to the standard workflow.

“This is the closest AI has ever come to helping reduce breast cancer deaths within the NHS, so the potential for the NHS to take this forward is significant,” commented Hutan Ashrafian, BSc(Hons), MBBS, PhD, MBA, of the Institute of Global Health Innovation (IGHI) at Imperial College London, who was a co-corresponding author on both studies.

In the AIMS project summary, researchers noted that some cancers identified by the AI system were not followed up, underscoring the importance of human decision-making.

“AI-enabled screening has the potential to significantly reduce overall human reading workload and reading time, while increasing cancer detection rates, particularly for invasive cancers and first-time screens. However, realizing AI's full potential will require overcoming operational issues such as managing increased arbitration volumes, improving model explainability, and actively managing data drift through continuous performance monitoring and local threshold calibration,” concluded Lihong Xi and Daniel Golden of Google Research. “Ultimately, this work supports the idea that AI-enabled screening may enable a sustainable healthcare system, where technology and human expertise work in tandem to detect cancer earlier and, most importantly, save more lives.”

DISCLOSURES: The AIMS study was funded by a National Institute for Health and Care Research (NIHR) award from the Secretary of State for Health and Social Care. The research was carried out in partnership with Imperial College London, Imperial College Healthcare NHS Trust, St George’s University Hospitals NHS Foundation Trust, and Google Research. Several of the study authors from both papers are employees of Google or paid consultants of Google. For full disclosures of the study authors and access to the source data, visit nature.com.

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

KOL Commentary
Watch

Related Content