News Research Operational Efficiency Clinical Trials

Can AI-Extracted EHR Data Be Trusted? The VALID Framework Takes Aim at a Growing Problem

The VALID framework outlines a structured approach for evaluating the accuracy, consistency, and real-world reliability of AI-curated clinical data.

April 08, 2026 By Meg Barbor, MPH 5 min read

Structured clinical data found from electronic health records (EHRs) are an important tool in oncology research, regulatory submissions, and more. The rapid evolution of large language models (LLMs) and machine learning (ML) tools have provided an opportunity to automate the extraction of real-world data for greater scalability and efficiency. But as these tools are applied to increasingly large and complex data sets, a critical challenge emerges: ensuring the data are accurate, consistent, and fit for use.

In a study published in JCO Clinical Cancer Informatics, researchers from Flatiron Health proposed a new framework, Validation of Accuracy for LLM-/ML-Extracted Information and Data (VALID), to address that gap.

“By providing a rigorous and transparent method for assessing LLM-extracted real-world data, this framework advances industry standards and supports the trustworthy use of AI-powered evidence generation,” Estevez et al wrote in their published report.

The Challenge

LLMs offer clear advantages to data extraction due to their ability to process large volumes of unstructured clinical text. However, their outputs are not always stable or reliable.

These models can generate inconsistent results even when inputs do not change and may also produce hallucinations. Furthermore, they can struggle with the complexity and ambiguity of clinical documentation, an issue that even experienced human abstractors encounter due to conflicting or unclear records.

The researchers drew upon their experience validating LLM-/ML-based outputs to build an approach for evaluating the accuracy and completeness of the extracted clinical data from the stage of model development to data delivery.

“This paper aims to move beyond existing limited resources and provide a practical, transparent, and holistic framework for evaluating the quality of AI model–extracted oncology data…to raise the scientific community standard and foster greater confidence in the use of AI-powered RWD for research and decision making,” the study authors wrote.

A Three-Pronged Approach to Data Quality

The VALID framework takes a multidimensional approach to validation, built around three core components:

Variable-Level Performance Metrics: These assess how accurately individual data elements are extracted using measures such as recall, precision, and F1 score. Importantly, the framework evaluates results relative to expert human abstraction, and compares both to a common reference standard for a source of truth, providing essential context for how well the model performs on complex clinical concepts.
Verification Checks: These identify inconsistencies within the data set across three categories: conformance, plausibility, and consistency. These include patient-level errors (such as events occurring in an illogical order), as well as cohort-level patterns that deviate from expected clinical distributions or established National Comprehensive Cancer Network (NCCN) guidelines.
Replication and Benchmarking Analyses: These evaluate whether conclusions drawn from an LLM-extracted data set align with those derived from human-curated data or established external findings, such as the Surveillance, Epidemiology, and End Results (SEER) database. This step determines if a data set is truly "fit for purpose" for research or regulatory use.

When Accuracy Is Not the Whole Story

One of the key insights of the framework is that small errors at the variable level do not necessarily translate into meaningful differences in clinical outcomes.

In a case study evaluating data extraction focused on real-world progression across 14 tumor types, agreement between the LLM and human curation ranged from 77% to 91%. Performance was similar between the model and human abstractors, with differences in F1 scores remaining under 10% for 12 of the 14 cancer types studied. The difference in F1 scores was lowest for small cell lung cancer (0.7%) and highest for hepatocellular cancer (11.2%).

Consequently, the resulting progression-free survival curves were nearly identical between the model and human abstractors. This highlights an important distinction: the clinical utility of a data set is determined not just by the frequency of model errors, but by how those errors propagate through downstream analyses. If the final scientific conclusions (such as hazard ratios or median survival), remain stable despite minor variable-level discrepancies, the data may still be considered fit for purpose.

Addressing Bias and Real-World Variability

Beyond accuracy, the framework incorporates methods to assess bias to ensure accurate conclusions from real-world data. By stratifying performance metrics and verification checks across demographic, clinical, and practice-level subgroups, investigators can determine whether errors are evenly distributed or whether certain populations are disproportionately affected.

The framework also accounts for variability in EHR systems, documentation practices, and clinical workflows, all of which can influence model performance in real-world settings.

What Comes Next

As LLM-extracted data sets are increasingly used to support real-world evidence generation, the need for standardized validation approaches is growing. The VALID framework offers a structured path forward, guiding model development, validation, and ongoing quality assessment while emphasizing not just accuracy, but also consistency, transparency, and clinical relevance.

As the authors noted, “the central question is no longer whether to use LLM-extracted data, but how to rigorously evaluate its reliability, fairness, and fitness for purpose.”

DISCLOSURES: This framework was supported by Flatiron Health, Inc-an independent member of the Roche Group. All study authors are employed by Flatiron Health. For full disclosures of the study authors, visit ascopubs.org.

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO^®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

KOL Commentary

Watch