Author(s) :
Tudor I Oprea1,2,4 and Virgil Păunescu3,4
1 The University of New Mexico School of Medicine, Albuquerque, New Mexico, USA
2 Expert Systems Inc., San Diego, USA
3 “Victor Babeş” University of Medicine and Pharmacy, Timişoara, Romania
4 Oncogen Center for Gene and Cellular Cancer Therapies, Timișoara, Romania
Corresponding author: Tudor I Oprea, Email: tudorzinho@gmail.com
Published: IV, 1, 30 July 2024, v - viii DOI: 10.53011/JMRO.2024.01.01
For decades, scientists have approached cancer as a disease of the genome (1). Efforts to collect multi-faceted, heterogeneous data such as tissue-based somatic mutations (2) and cancer cell line expression and perturbation (3), have contributed to breakthroughs such as the Hallmarks of Cancer (4,5) and The Cancer Genome Atlas (TCGA) (6). These efforts have framed our understanding of cancer at the molecular level and laid the foundational roadmap for drug target identification in oncology. The therapeutic management of cancer, an out-of-control process of cellular proliferation and dissemination, typically aims to selectively inhibit specific molecules or pathways crucial for tumor growth and survival (7). Targeting specific mutations, such as BRAF V600E and KRAS G12C, has resulted in clinically successful treatments for melanoma (e.g., vemurafenib as BRAF inhibitor) and non-small cell lung carcinoma (e.g., sotorasib as KRAS inhibitor) (8).
Target selection is a critical step in pharmaceutical research and development, as it remains the major driver for therapeutic efficacy and patient safety. As outlined elsewhere (8), target selection starts from identifying tumor-specific actionable mutations via NGS (Next-Generation Sequencing). This nucleic acid sequencing technology identifies common and rare genetic aberrations in cancer. Through sequential oligonucleotide capture, amplification, and NGS, point-of-care diagnostic tools further support this process through mutational evaluation. In addition to patient-derived clinical data, pan-cancer analyses, and biomedical literature are frequently used to understand molecular pathways affected by specific mutations, further guiding therapeutic target selection. Functional genomics (9), genome-wide association studies (GWAS), and polygenic scores (10) are increasingly incorporated in clinical model assessments of cancer therapeutic targets.
Despite the widespread usage of these methodologies, several limitations have become apparent. First, cancer is a complex disease, with a subtle interplay between the environmental and genetic factors concerning tumor growth and survival. Intra-tumor heterogeneity studies improve our understanding of the evolutionary forces driving subclonal selection (11), whereas genetic (clonal) and non-genetic adaptive reprogramming events can explain primary and secondary drug resistance in cancer (12). Furthermore, elucidating the exact mechanism of action (MoA) drug targets in cancer is not trivial, as many anti-cancer drugs continue to exhibit tumoricidal activity even after the (suspected) MoA targets have been knocked out (13). Indeed, off-target effects often compound biological phenotype interpretation (e.g., loss of cell viability or slowing tumor growth) (14). Against this backdrop, large-scale data integration coupled with artificial intelligence and machine learning (AIML) (15) can improve target selection in oncology.
AIML technologies can rapidly process a diverse set of oncology-related resources such as TCGA (6), COSMIC (2), DepMap (16), and others by coalescing large datasets into a seamlessly integrated platform. This is particularly true if large language models (LLMs) such as GPT-4 (17) are incorporated into the data ingestion workflow. From genomic and transcriptomic data to real-world evidence, AIML can sift through layers of evidence and produce models faster than traditional methods. This potential efficiency increase and the ability to develop multiple parallel models can offer testable hypotheses.
The ability to integrate and analyze vast datasets with AIML techniques holds promise for uncovering novel insights and therapeutic targets in various fields of medicine. By leveraging these AIML advancements, these technologies can be applied to most complex diseases, not just oncology. For instance, neurodegenerative diseases like Alzheimer’s disease present similar challenges due to their multifactorial nature and the interplay between genetic and environmental factors.
Recognizing the potential of AIML in complex disease biology modeling, we integrated a set of 17 different resources focused on expression data, pathways, functional terms, and phenotypic information with XGBoost (18), an optimized gradient boosting (machine learning) algorithm, and Metapath (19), a feature-extraction technique, to seek novel genes associated with Alzheimer’s disease (20). Of the top-20 ML-predicted genes previously not associated with Alzheimer’s pathology, five were experimentally confirmed using multiple methods. The same set of integrated resources, combined with MetaPath and XGBoost, resulted in the temporally validated identification of seven top-20 and two bottom-20 genes associated with autophagy (21).
Building on our success in Alzheimer’s and autophagy research, we used this integrated approach (the above dataset and algorithms) to develop 41 distinct blood cancer AIML models starting from primary tumor type and histology (22). We contrasted 725 cancer-specific genes curated in the COSMIC cancer gene census, serving as the positive set, with 440 manually curated housekeeping genes that served as the negative set. The 41 AIML models identified the expected “frequent hitters,” such as GAPDH, AKT1, HRAS, TLR4, and TP53, all having well understood roles in cancer. Other genes, such as IRAK3, EPHB1, ITPKB, ACVR2B, and CAMK2D, were predicted to be relevant in 10 or more hematology/oncology malignancies. In contrast, some genes were associated with just one cancer: For example, LPAR5, GPR18, and FCER2 are predicted to be relevant only in primary bone diffuse large B cell lymphoma (22). Cell-based validation studies for some of these genes are ongoing.
Although AI-based target selection in oncology primarily relies on gene-phenotype association models, it also offers other potential applications: 1) processing oncology biomarkers for therapeutic targeting; 2) enhancing the understanding of gene variants of uncertain significance (VUS) through in-depth context and real-world evidence; and 3) improving animal and preclinically validated model interpretation by incorporating human pathology and physiology.
Challenges and limitations of AIML technologies include: 1) data and information quality, where the maxim “garbage in, garbage out” underscores the importance of data veracity; 2) model interpretability, which is increasingly addressed through “explainable AI” to ensure that AIML models can be interpreted by humans and can aid decision-making in research and clinical development; and 3) awareness of data bias and leakage as well as ethical considerations, to prevent discriminatory practices and ensure fairness in model development.
The future of target selection in oncology is likely to incorporate AIML technologies. By processing vast datasets more rapidly and efficiently and by offering enhanced context for gene VUS, somatic mutations, and biomolecular pathways, AIML models are poised to improve target identification and validation for common and rare cancers.
Abbreviations
AIML – artificial intelligence and machine learning
GWAS – genome-wide association studies
LLMs – large language models
MoA – mechanism of action
NGS – Next-Generation Sequencing
TCGA – The Cancer Genome Atlas,
VUS – variants of uncertain significance
Statements
Authors’ contributions: TIO drafted the paper and VP reviewed it
Consent for publication: As the corresponding author, I confirm that the manuscript has been
read and approved for submission by all named authors.
Conflict of interests: Tudor I Oprea is CEO of Expert Systems Inc and Virgil Păunescu is Director and Founder of OncoGen Cancer Research Center
Funding Sources: None
Statement of Ethics: This study was not subject to ethical review and approval due to its article
format.
References
- Varmus, H. Of oncogenes and open science: an interview with Harold Varmus. Dis. Model. Mech. 12, (2019).
- Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019).
- Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).
- Hanahan, D. & Weinberg, R. A. The hallmarks of cancer. Cell 100, 57–70 (2000).
- Hanahan, D. Hallmarks of Cancer: New Dimensions. Cancer Discov. 12, 31–46 (2022).
- Liu, J. et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 173, 400–416.e11 (2018).
- Hoelder, S., Clarke, P. A. & Workman, P. Discovery of small molecule cancer drugs: successes, challenges and
opportunities. Mol. Oncol. 6, 155–176 (2012). - Waarts, M. R., Stonestrom, A. J., Park, Y. C. & Levine, R. L. Targeting mutations in cancer. J. Clin. Invest. 132, (2022).
- O’Loughlin, T. A. & Gilbert, L. A. Functional Genomics for Cancer Research: Applications In Vivo and In Vitro.
Annual Review of Cancer Biology 3, 345–363 (2019). - Yang, X., Kar, S., Antoniou, A. C. & Pharoah, P. D. P. Polygenic scores in cancer. Nat. Rev. Cancer 23, 619–630
(2023). - Black, J. R. M. & McGranahan, N. Genetic and non-genetic clonal diversity in cancer evolution. Nat. Rev. Cancer
21, 379–392 (2021). - Marine, J.-C., Dawson, S.-J. & Dawson, M. A. Non-genetic mechanisms of therapeutic resistance in cancer. Nat.
Rev. Cancer 20, 743–756 (2020). - Lin, A. et al. Off-target toxicity is a common mechanism of action of cancer drugs undergoing clinical trials. Sci.
Transl. Med. 11, (2019). - Kaelin, W. G., Jr. Common pitfalls in preclinical cancer target validation. Nat. Rev. Cancer 17, 425–440 (2017).
- Hasselgren, C. & Oprea, T. I. Artificial Intelligence for Drug Discovery: Are We There Yet? Annu. Rev. Pharmacol.
Toxicol. 64, 527–550 (2024). - Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564–576.e16 (2017).
- OpenAI et al. GPT-4 Technical Report. arXiv cs.CL.
- Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery,
New York, NY, USA, 2016). - Fu, G. et al. Predicting drug target interactions using meta-path-based semantic network analysis. BMC Bioinformatics 17, 160 (2016).
- Binder, J. et al. Machine learning prediction and tau-based screening identifies potential Alzheimer’s disease
genes relevant to immunity. Commun Biol 5, 125 (2022). - Ranjbar M, Yang JJ, Kumar P, Byrd DR, Bearer EL, Oprea TI. Autophagy dark genes: Can we find them with
machine learning? Natural Sciences 3, e20220067 (2023). - Quazi, M. et al. Abstract 3535: Seeking novel therapeutic targets in oncology using machine learning. Cancer
Res. 84, 3535–3535 (2024).