Advancing Clinical Trials and Decision-Making With Synthetic Real-World Data
Synthetic real-world data generated by AI can mimic treatment patterns and clinical outcomes across large patient cohorts while accelerating clinical trials and drug development, according to Eddy Saad, MD, MS, a Research Fellow in Medicine at Dana-Farber Cancer Institute.
In a presentation, Dr. Saad outlined methods for generating synthetic real-world datasets and highlighted their clinical applications. He demonstrated that a model built using classification and regression trees produced survival outcomes comparable to those of the original patient cohort.
“Synthetic real-world data can strike the optimal balance between data utility and privacy. They faithfully reproduce univariate survival and multivariate patterns and have an acceptable re-identification risk,” Dr. Saad said at the European Society for Clinical Oncology (ESMO) Congress 2025 (Abstract 3136O). “Synthetic datasets can therefore be leveraged to facilitate data sharing, to optimize clinical trial design, and to improve clinical decision-making.”
The Rise of Synthetic Datasets
Dr. Saad noted that the accelerating pace of cancer drug approvals by the U.S. Food and Drug Administration (FDA) over the past decade has created a “vicious cycle,” driving demand for more data and clinical trials on increasingly compressed timelines. “With each new therapy, we uncover new unmet needs, and these create an exponential demand for data…from clinical trials,” he said. “And these require enormous resources, perhaps most importantly, time.”
Real-world data have increasingly been considered as an alternative source for clinical trials, but their use raises privacy concerns and regulatory challenges. Instead, Dr. Saad pointed to AI-generated models that can create synthetic versions of these datasets to address these issues.
What Are Synthetic Real-World Data?
Synthetic real-world data are AI-generated datasets that replicate the statistical patterns and properties of real-world patient data without containing any individual patient identifiers.
This approach enables researchers to analyze data, as they would with standard real-world patient data, while protecting patient privacy. It also supports research in settings where resources—such as time and clinical trial participation—may be limited.
“These synthetic datasets are not meant to replace the original ones,” Dr. Saad clarified. “They’re meant to complement them in areas where it’s harder for us to do research right now.”
Establishing Synthetic Real-World Datasets
Dr. Saad and colleagues generated synthetic real-world datasets and evaluated different modeling approaches. All AI-based synthetic cohorts were derived from a source cohort of 19,164 patients with metastatic breast cancer in the Flatiron Health database who received first-line treatment between 2011 and 2023.
The first AI-based model was a conditional tabular generative adversarial network (CTGAN), which uses a dynamic interaction between two components—a generator and a discriminator—that train each other to produce increasingly more accurate synthetic data until it is indistinguishable from the source cohort. To maintain model stability and enforce privacy constraints, the researchers applied noise weight clipping, generating five versions of the synthetic CTGAN datasets: CTGAN base, CTGAN ln, CTGAN low, CTGAN medium, and CTGAN high.
The second model was a classification and regression tree (CART), which learns relationships within the data to replicate clinical patterns and decision-making, similar to flowcharts that guide treatment decisions. The model constructs decision trees to generate and apply a synthetic version of the real-world dataset.
“CTGANs can be a sort of black box, so we wanted to offer a more transparent, clinician-friendly approach,” Dr. Saad said. “These rely on a series of branching, if-then decisions, [similar to] how a clinician would reason through a case.”
Testing the Models
The six resulting models were evaluated for fidelity to the original dataset and for protection of patient privacy.
Univariate and multivariate analyses were conducted to compare model performance with the original dataset. Absolute standardized mean difference analysis showed that models with higher levels of privacy, including CTGAN medium and CTGAN high, deviated more from the original dataset, whereas the CART model showed minimal differences from the source cohort.
Kaplan-Meier curves for progression-free survival (PFS) and overall survival showed greater divergence from the source cohort among models with higher levels of privacy, whereas the CART model closely overlapped with the source cohort.
Multivariate analysis using regression models further demonstrated that the AI-based models—particularly the CART model—recapitulated the original variables and correlations. The CART model also showed strong agreement between the synthetic and real-world cohorts for PFS outcomes.
The study authors then assessed the risk of re-identification in the synthetic cohorts to determine whether any real patients could be identified from the synthetic records. With an acceptable threshold defined as 9%, all of the models demonstrated risk levels of 2% or lower, with the CART model exhibiting the highest risk among the six.
Clinical Applications
Our vision is [for] an integrated platform where we could continuously ingest real-world data from multiple sources, process them through these AI models, obtain their twin datasets, and use these to improve clinical research and clinical outcomes for patients.
- Eddy Saad, MD, MS
Dr. Saad then outlined several potential clinical applications for these synthetic models. “We wanted to find ways that would facilitate the sharing of data across multiple stakeholders [around the world]. And this is exactly what synthetic data sets do [and why we started this project in the first place]. They allow us to bypass some of the regulatory bottlenecks that are usually dealt with when we talk about real-world data,” he explained.
Additional applications include modeling real-world control arms for clinical trial design, as well as patient matching and outcome prediction. With modern synthetic control arms, “we could compare a new regimen and perhaps even accelerate the design of drugs and approval of these drugs,” Dr. Saad said.
With patient matching and outcome prediction, a patient’s clinical and disease characteristics could be compared with those of “nearest neighbors,” as Dr. Saad described patients with similar profiles, to evaluate their treatments and associated outcomes.
He presented a hypothetical case of a 63-year-old woman with HR-positive, HER2-positive metastatic breast cancer, an ECOG performance status of 1, and a body mass index of 27.4. She was diagnosed with metastatic disease in 2021, when her cancer spread to the bone, lung, and distant lymph nodes. Her case was compared with 200 “nearest neighbors,” who generally received one of four treatment options: an aromatase inhibitor plus a CDK4/6 inhibitor (real-world PFS = 11.6 months); an aromatase inhibitor plus a HER2-targeted agent (PFS = 10.9 months); a HER2-targeted agent plus a taxane (PFS = 17.2 months); or an aromatase inhibitor, HER2-targeted agent, and a taxane (PFS = 24.2 months).
“Looking ahead, our vision is that of an integrated platform where we could continuously ingest real-world data from multiple sources, process them through these AI models, obtain their twin datasets, and use these to improve clinical research and clinical outcomes for patients,” Dr. Saad concluded.
DISCLOSURE: Dr. Saad reported receiving research funding from Genentech, EMD Serono, and Oncohost.
Expert Point of View
The Yin and Yang of AI in Oncology
Julien Vibert, MD, PhD, Drug Development Department, Sarcoma Team, National Precision Medicine Center in Oncology, Gustave Roussy, Paris, France, and an invited discussant at the European Society for Medical Oncology (ESMO), provided additional context on how AI can transform the clinical practice of oncology.
Dr. Vibert clarified that synthetic real-world data differ from de-identified patient data, in which identifying information is removed. In contrast, synthetic cohorts are generated by AI algorithms that restructure the source data so that individual patients cannot be re-identified.
He noted a tradeoff between fidelity to the source data and preservation of patient privacy, with improvements in one often coming at the expense of the other.
Dr. Vibert highlighted Dr. Saad’s use of the Flatiron Health database as one of the largest synthetic datasets (n = 19,164 patients) in oncology to date. The model demonstrated validated fidelity in terms of distribution, survival outcomes, and correlations, while maintaining a low re-identification risk and preserving patient privacy. He also noted that the models described by Dr. Saad—CART and CTGAN—are among the most advanced AI approaches currently in use.
Although he highlighted how these synthetic models can enable data sharing and facilitate external control arms in clinical trials, Dr. Vibert cautioned that synthetic datasets may amplify biases in the source data and carry risks of misuse or overconfidence if not properly validated. He added that synthetic real-world data can also pose risks related to security breaches and ethical concerns around data governance. These challenges, among others, have complicated the regulatory acceptance of synthetic cohorts to date.
Beyond the clinical applications described by Dr. Saad, Dr. Vibert added that synthetic cohorts can also be used to create digital twins—virtual representations of patients based on their clinical characteristics, history, and expected disease trajectory—to simulate outcomes.
“AI can potentially help us to design smarter trials,” Dr. Vibert said. He suggested that AI-based synthetic cohorts can reduce inefficiencies and maximize impact but cannot replace real-world data or clinical trials. Striking the right balance will be essential, he suggested, ensuring sufficient patient data for valid, reliable findings while leveraging AI to improve speed and efficiency. Clinical validation, he emphasized, will remain indispensable.
DISCLOSURE: Dr. Vibert reported no conflicts of interest.
ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.