News Research Colorectal Cancer Genetics/Genomics Decision-Making Support

ML Model Uses Methylation Patterns to Identify Tissue of Origin 

April 22, 2026 By Lisa Astor 6 min read
Share Share via Email Share on Facebook Share on LinkedIn Share on Twitter

A machine learning model integrating a limited set of DNA methylation patterns demonstrated strong performance in predicting tissue of origin in cancers of unknown primary.

“We developed a reduced CpG panel that can robustly predict tissue-of-origin site across different data sets and platforms. We were able to capture meaningful tumor biology with this set and [any] classification challenges that we saw mainly reflected biological complexity rather than technical limitations,” said presenting author Marco A. De Velasco, PhD, Faculty Member in the Department of Genome Biology, Kindai University, Japan, during a press conference at the American Association for Cancer Research (AACR) Annual Meeting 2026. “Together, these findings support the practical development of this tissue origin classifier for cancer of unknown primary classification.”

Dr. De Velasco noted that patients with a cancer of unknown primary site have a poor prognosis, as most receive nonspecific chemotherapy and have low median survival. He emphasized the need for improved tools to identify tissue of origin and enable site-directed therapy to improve outcomes.

DNA methylation was considered a more accurate tool for tissue-of-origin prediction than start biomarkers and genomic profiling, as methylation patterns are stable and tissue-specific, even in metastases.

“Our goal here was to develop a classifier that can predict tissue of origin using a focused set of CpG sites,” Dr. De Velasco said, adding that the classifier was intended to help improve site-directed therapy decision-making. “We wanted to improve practicality while maintaining strong performance.”

Study and Model Methods

The researchers analyzed 7,421 tumors across 21 cancer types using Infinium HumanMethylation450 data from The Cancer Genome Atlas (TCGA) and ovarian cancer data from the Gene Expression Omnibus. In a 70:30 split by cancer type, 5,210 tumors were used to train the model and 2,211 to test it.

Using the training cohort, the researchers performed feature selection and trained the model with cross-validation. The feature set was then reduced from 485,000 to 1,000 CpG regions using gradient boosting and Shapley values, improving both accuracy and interpretability. Models incorporating ridge regression to reduce overfitting performed best among those tested.

The model was then tested on the remaining data set and validated in independent cross-platform cohorts.

Findings

The researchers assessed the biological relevance of the selected CpG sites using Pearson correlation, demonstrating distinct clustering by cancer type and high concordance between phenotypic clusters and tumor labels.

Unsupervised clustering largely corresponded to a single cancer type, although some overlap was observed among biologically related cancers, such as colon and rectum adenocarcinoma, as well as among gynecological cancers.

DNA methylation patterns were also consistent with gene expression profiles. “Gene expression profile also preserved the cancer type structure, and methylation was inversely associated with gene expression, which was what we would expect,” Dr. De Velasco said, noting that hypermethylation was associated with reduced gene expression. “This gives more evidence that this model reflects the underlying tumor biology, rather than just looking at statistical patterns.”

On internal validation, the model achieved an area under the curve of 0.998, classification accuracy of 0.954, an F1 score for classification performance of 0.953, and a Matthews correlation coefficient for classification quality of 0.951. The average precision over classes was 0.953, and recall was 0.954.

In the TCGA held-out test cohort, the model achieved an area under the curve of 0.998, classification accuracy of 0.947, an F1 score of 0.945, and a Matthews correlation coefficient of 0.943.

In the Kindai in-house cohort, average scores across classes were 0.993 for area under the curve, 0.871 for classification accuracy, 0.847 for F1 score, and 0.867 for the Matthews correlation coefficient.

In an expanded multicohort validation group, using Infinium Methylation EPIC v2.0 data, the model achieved a classification accuracy of 93.8% in identifying colorectal cancers, with errors reflecting biological similarities rather than model failure, according to Dr. De Velasco. When colon and rectum adenocarcinomas were combined to account for colorectal cancers, performance improved. In this expanded validation set with biologically similar subtypes merged, the F1 score was 0.899. The average area under the curve was 0.993, accuracy was 0.892, precision was 0.933, recall was 0.892, and the Matthews correlation coefficient was 0.882.

The researchers then explored the impact of class size and heterogeneity on model performance, but found no significant association with any factor. “What this means is that performance differences are likely to be associated with intrinsic biological variation across the different cancer types,” Dr. De Velasco explained.

Going forward, Dr. De Velasco noted that the research team plans to evaluate the model in cohorts of patients with true cancers of unknown primary and to explore tissue-of-origin testing using a liquid biopsy to enable more informed, personalized treatment decision-making.

“I believe that cancers of unknown primary are definitely one of the more challenging problems in oncology because we don't know their site of origin. Their eligibility for clinical trials and their treatments are so challenging for oncologists, and so even now that we have molecular profiling, and there are some advances, we are still not able to predict the cancer type of origin,” commented AACR press conference moderator Ecaterina E. Dumbrava, MD, co-chair of the AACR Annual Meeting Clinical Trials Committee, and Assistant Professor, Department of Investigational Therapeutics, The University of Texas MD Anderson Cancer Center. “Integrating this would be very useful, like having a fingerprint of the tumor and trying to predict the origin [could] direct the treatments, especially early on, because the choices of treatments matter for that patient.”

Dr. De Velasco added that DNA methylation patterns can complement, rather than replace, genomic profiling.

DISCLOSURES: Funding for this study was provided by the Japan Society for the Promotion of Science. Dr. De Velasco reported no conflicts of interest. For full disclosures of the other study authors, visit abstractsonline.com

ASCO AI in Oncology is published by Conexiant under a license arrangement with the American Society of Clinical Oncology, Inc. (ASCO®). The ideas and opinions expressed in ASCO AI in Oncology do not necessarily reflect those of Conexiant or ASCO. For more information, see Policies.

KOL Commentary
Watch

Related Content