News Research Diagnostics & Imaging Gastrointestinal Cancers

3D Vision–Language Foundation Model Interprets Abdominal CT Scans, May Improve Clinical Assessment

A 3D vision–language foundation model trained to interpret abdominal computed tomography (CT) scans may help expedite clinical assessment, including diagnosis, staging, and other aspects of care. Beyond disease identification and classification, the multimodal model—named Merlin—also demonstrated potential for biomarker discovery and disease risk stratification using CT imaging.

March 31, 2026 By Lisa Astor 6 min read

The researchers developed the model to address the shortage of radiologists in the United States and the growing workload in the field.

“With Merlin, you could potentially go beyond traditional radiology and jump straight from imaging to a possible diagnosis. And that’s just one potential use,” stated Louis Blankemeier, PhD, co-founder and CEO of Cognita, who conducted this work as a graduate student at Stanford University and was one of the study's lead authors.

Model Methods

Merlin was trained on a single NVIDIA A6000 graphics processing unit using paired CT scans, electronic health record data, and radiology reports from the Stanford University School of Medicine. The training data set included 25,528 CT scans (10,628,509 2D images), International Classification of Diseases (ICD)-9 and ICD-10 diagnostic codes (2,041,280 codes representing 16,553 unique values), and 10,051,571 tokens from radiology reports.

Both the radiology reports and the diagnostic codes were incorporated into the training process using loss functions to support unsupervised learning. To address the length of standard reports, radiology report preprocessing involved splitting reports into anatomical sections; full reports and segmented versions were alternated during training to evaluate performance differences. The researchers also applied PheWAS Phecode mapping to define hierarchical phenotypes, resulting in 1,692 grouped phenotypes. Training of the Merlin model required approximately 160 hours, they noted.

The Merlin model comprised an inflated 3D (I3D) ResNet152 image encoder and a Longformer text encoder. The image encoder reused pretrained 2D model weights, extending them across the third dimension of the 3D filter. The Longformer encoder enabled longer context lengths compared with other pretrained biomedical masked language models. Architecture ablation studies were conducted to eliminate redundancies.

The 3D vision–language model was evaluated across six categories of diagnostic, prognostic, and quality-related tasks, encompassing 752 individual tasks. Merlin was internally validated on 5,137 CT scans and externally validated on 44,098 scans from three external sites, as well as two publicly available abdominal CT data sets.

Model performance was then compared with that of alternative architectures, including fine-tuned 2D vision–language models, 2D-to-3D lifted vision–language models, and 3D vision-only models, to evaluate different training strategies.

Results

One nonadapted task involved zero-shot classification of findings, in which the model identified classes not seen during training, such as renal cysts or ascites on CT scans. Merlin achieved an unweighted mean F1 score of 0.741 (95% confidence interval [CI] = 0.727–0.755) in internal validation testing and an average F1 score of 0.647 (95% CI = 0.607–0.678) on external validation, outperforming 2D models.

Performance was higher, as expected, for diseases with salient imaging features but lower for findings requiring finer-grained distinctions, such as metastatic disease or lymphadenopathy. Without radiology report splitting—separating reports by anatomical structures for contrastive learning—F1 scores were comparable to those observed in external validation.

Compared with other models, Merlin performed best in ablation analyses (F1 score = 0.741), followed by models using staged training of EHR and radiology report data (F1 score = 0.735), and those trained on radiology reports without EHR supervision (F1 score = 0.730).

Merlin was also evaluated for its ability to predict phenotypes based on ICD codes, including oncologic diagnoses. Across 692 phenotypes, the model achieved a macro-average area under the receiver operating characteristic curve (AUROC) of 0.812 (95% CI = 0.808–0.816), with values exceeding 0.9 for 15% of all phenotypes. Performance also improved with increasing training data.

Zero-shot cross-modal retrieval was used to assess whether the model could match CT images with corresponding findings or sections of radiology reports, and vice versa. Merlin significantly outperformed contrastive language–image pretraining (CLIP) models—which learn visual concepts from natural language—in identifying correct image-to-findings matches among 64 cases (P < .001), as well as findings-to-image matches. However, retrieval performance declined on external test data. For example, Merlin achieved a score of 0.75 for detecting metastatic disease on internal data, compared with 0.45 on external data. The study authors also noted that report splitting did not improve model performance for retrieval tasks.

“Moreover, we are particularly excited by the notion of multi-modal retrieval. Effectively, this allows us to search for similar patients either based on the imaging appearance or the radiology report. This could be helpful to find ‘patients like me’,” senior author Akshay Chaudhari, PhD, Professor of Radiology and Biomedical Data Science at Stanford University, told ASCO AI in Oncology.

The researchers fine-tuned Merlin to predict the development of several chronic diseases within 5 years in patients who were healthy at the time of their CT scan. Across six chronic diseases, the model achieved an average AUROC of 0.757 (95% CI = 0.743–0.772).

When evaluated for radiology report generation from CT images, Merlin consistently outperformed other multimodal models. The model accurately placed findings within the appropriate anatomical sections and matched findings to images, but often under-reported positive findings.

When compared with nnU-Net, a self-configuring deep learning framework for medical image segmentation, Merlin outperformed the framework in settings with limited training data, demonstrating its advantage in label-scarce settings.

Comparisons of model architectures revealed that vision–language pretraining outperformed vision-only approaches and that fine-tuning improved performance for both 2D and 2D-to-3D models. Merlin also outperformed recent CT embedding foundation models, as well as 2D and 2D-to-3D vision–language models.

“Our model and the data will provide the community [with] a robust backbone to build upon,” Dr. Chaudhari concluded. “From here, the sky’s the limit.”

DISCLOSURE: The research was funded by the National Institutes of Health’s (NIH) National Institute of Biomedical Imaging and Bioengineering (NIBIB); the Medical Imaging and Data Resource Center (MIDRC); the National Health, Lung and Blood Institute (NHLBI); and the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS). For full disclosures of the study authors, as well as access to the publicly available data and code used in the study, visit nature.com.

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO^®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

Performance of a convolutional neural network in determining differentiation levels of cutaneous squamous cell carcinomas was on par with that of experienced dermatologists, according to the results of a recent study published in JAAD International.

“This type of cancer, which is a result of mutations of the most common cell type in the top layer of the skin, is strongly linked to accumulated [ultraviolet] radiation over time. It develops in sun-exposed areas, often on skin already showing signs of sun damage, with rough scaly patches, uneven pigmentation, and decreased elasticity,” stated lead researcher Sam Polesie, MD, PhD, Associate Professor of Dermatology and Venereology at the University of Gothenburg and Practicing Dermatologist at Sahlgrenska University Hospital, both in Gothenburg, Sweden.

KOL Commentary

Watch

3D Vision–Language Foundation Model Interprets Abdominal CT Scans, May Improve Clinical Assessment

Model Methods

Results

Related Content