Merge pull request #4 from jjc2718/background_header

Chapter 1: intro and data/applications section
greenelab · Sep 4, 2023 · 00639a0 · 00639a0
2 parents fe81603 + a797617
commit 00639a0
Show file tree

Hide file tree

Showing 16 changed files with 19,126 additions and 0 deletions.
diff --git a/content/10.background_header.md b/content/10.background_header.md
@@ -0,0 +1,8 @@
+## Chapter 1: Background
+
+This chapter was formatted for this dissertation to provide background information and context for the following chapters. Some elements of the second subsection on machine learning modeling techniques were previously published in the _Current Opinion in Biotechnology_ journal as "Incorporating biological structure into machine learning models in biomedicine" (https://doi.org/10.1016/j.copbio.2019.12.021).
+
+**Contributions:**
+For the unpublished parts of this chapter, I was the sole author.
+For the published parts of this chapter, I wrote the original draft of the review paper, which was edited based on feedback from Casey S. Greene and anonymous reviewers.
+
diff --git a/content/11.introduction.md b/content/11.introduction.md
@@ -0,0 +1,28 @@
+### Introduction
+
+Precision oncology, or the selection of cancer treatments based on molecular or cellular features of patients' tumors, has become a fundamental part of the standard of care for some cancers [@doi:10.1093/annonc/mdx707].
+Although each tumor is unique, the successes of precision oncology reinforce the idea that there are commonalities that can be understood and therapeutically targeted.
+Targeted therapies that have been successfully applied across cancer types and patient subsets include _HER2_ (_ERBB2_) inhibitors in breast and stomach cancer [@doi:10.1093/jnci/djp341], BTK inhibitors in various hematological malignancies [@doi:10.1186/s13045-022-01353-w], and _EGFR_ inhibitors across a variety of carcinomas [@doi:10.1186/s13045-022-01311-6].
+The genes and mutations that drive cancer are often specific to a given cancer type or subtype, but they tend to converge on a few pathways [@doi:10.1016/j.cell.2018.03.035; @doi:10.1016/j.cell.2020.11.045], making more general targeted treatments possible.
+
+The past decade has seen an expansion in the size and diversity of cancer genomics datasets, both publicly available and otherwise.
+The Cancer Genome Atlas (TCGA) Pan-Cancer Atlas [@pancanatlas] is a large, public human tumor sample dataset, containing >10,000 samples from 33 different cancer types, each profiled for varying -omics types with associated clinical information.
+There are also public datasets containing model system data, including the Cancer Cell Line Encyclopedia (CCLE) containing -omics data from human-derived cancer cell lines [@doi:10.1038/s41586-019-1186-3], and the Genomics of Drug Sensitivity in Cancer (GDSC) dataset containing drug sensitivity data for thousands of the same cell lines across hundreds of drugs [@doi:10.1093/nar/gks1111].
+These datasets exhibit heterogeneity on multiple levels.
+Overall, they vary in size, with TCGA having about an order of magnitude more tumor samples than the number of cell lines in CCLE.
+Going a level deeper, the cancer types within them vary in size as well: TCGA has 1,218 breast cancer samples with gene expression data, but only 265 soft tissue sarcoma samples, and only 45 cholangiocarcinoma samples.
+
+In modern machine learning research using text and images, there is a trend toward bigger models capable of solving broader arrays of tasks.
+Foundation models, trained on large datasets to generalize to new tasks with no or minimal task-specific fine-tuning, are in many cases competitive with task-specific models [@arxiv:2205.09911], although they are not without unique caveats [@arxiv:2108.07258].
+Similarly, in genomics, early examples of foundation models are beginning to appear [@arxiv:2306.15794; @doi:10.1101/2023.04.30.538439; @doi:10.1101/2023.05.29.542705].
+Training foundation models on pan-cancer, pan-omics data would be a natural extension of these ideas, which could improve power to detect correlations between biomarkers and phenotypes of interest, or to identify drug susceptibilities in patient sub-populations.
+
+As a whole, this dissertation explores ways in which the structure of large, public pan-cancer datasets can present unexpected challenges and caveats for machine learning.
+TCGA and CCLE both contain data from various -omics types (feature groups) and samples from diverse cancer types/tissues of origin (sample groups).
+There are additional, less obvious forms of structure in these data such as patient sub-populations and sample collection locations, which we will not address directly in this dissertation but which can affect model training and performance as well.
+
+This chapter, Chapter 1, describes existing work at the intersection of cancer -omics and machine learning, which will provide context for the following chapters.
+In Chapter 2, we show that the choice of optimization method can affect model selection and tuning, for prediction from cancer transcriptomic data.
+Chapter 3 explores the relative information content of -omics types/feature groups in TCGA, showing that gene expression tends to contain the most information relative to cancer driver mutations, but most -omics types can serve as effective, and likely somewhat redundant, readouts.
+In Chapter 4, we test generalization across cancer types in TCGA and across datasets (CCLE to TCGA and vice-versa), showing that smaller models do not tend to generalize better across contexts, and cross-validation performance is in most cases a sufficient model selection criterion.
+Finally, in Chapter 5, we conclude by summarizing the implications of these results and discussing future directions.
diff --git a/content/12.data_review.md b/content/12.data_review.md
@@ -0,0 +1,43 @@
+## Cancer -omics data and applications
+
+### Publicly available cancer -omics data resources
+
+A wealth of public cancer genomics and multi-omics human sample resources have been generated in the past decade.
+As mentioned in the introduction, the TCGA Pan-Cancer Atlas [@pancanatlas] contains data spanning 33 cancer types and multiple -omics data types, including mutation, CNV, gene expression, miRNA, DNA methylation, reverse phase protein array (RPPA) proteomics data, and clinical outcome data [@doi:10.1016/j.cell.2018.02.052].
+The International Cancer Genome Consortium (ICGC) data portal is an initiative to unite and harmonize data from many worldwide cancer projects including TCGA, mostly focused on DNA/somatic mutation data but containing some gene expression and other -omics data [@doi:10.1038/s41587-019-0055-9].
+The Pan-Cancer Analysis of Whole Genomes (PCAWG) project attempts to expand from the whole-exome sequencing provided by TCGA to whole-genome sequencing, providing data and analysis for 2,658 whole genome cancer samples [@doi:10.1038/s41586-020-1969-6]
+The American Association for Cancer Research (AACR)'s Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange) is another large-scale initiative to share genomic data, with the intention of complementing TCGA and allowing for external validation of methods and biological findings [@doi:10.1158/2159-8290.CD-17-0151].
+Unlike TCGA, which contains whole-exome sequencing data, the GENIE dataset is primarily comprised of targeted sequencing panels of a subset of cancer-relevant genes.
+
+In addition to samples derived from human tumors or neoplasms, data from model systems such as cancer cell lines and mouse models are an important element of therapeutic development.
+The Cancer Cell Line Encyclopedia (CCLE) contains a variety of uniformly processed -omics data across more than 1000 human-derived cell lines, including somatic mutations, CNV data, gene fusion information, and gene expression [@doi:10.1038/s41586-019-1186-3].
+The Cancer Dependency Map (DepMap) complements CCLE with information about cancer cell line vulnerabilities, derived from CRISPR and RNAi knockout screens [@doi:10.1038/s41467-021-21898-7; @doi:10.1038/s41588-021-00819-w].
+The Connectivity Map (CMap) and LINCS L1000 project aims to catalog the responses of cell lines to both genetic and pharmacological perturbations, identifying the changes to gene expression and protein expression that result [@doi:10.1016/j.cell.2017.10.049].
+The GDSC and PRISM drug screening datasets provide cell viability dose-response readings for many of the cell lines in CCLE, after perturbation with small molecules [@doi:10.1093/nar/gks1111; @doi:10.1038/s43018-019-0018-6].
+Aside from cell lines, the PDX Encyclopedia is a dataset of patient-derived xenograft (PDX) mouse model data, including more than 1000 models with mutation, CNV, and gene expression data for each [@doi:10.1038/nm.3954].
+The National Cancer Institute's Patient-Derived Models Repository (PDMR) also contains mutation and gene expression profiles for mouse models and patient-derived tumor organoids, or tumoroids [@doi:10.1038/s41467-021-25177-3; @pdmr], although it is still under development.
+
+### Applications of machine learning in cancer genomics
+
+Historically, one common use of -omics data in cancer has been to define subtypes, or clinically relevant patient subsets that may have similar prognosis or respond similarly to therapy.
+Many studies have sought to distinguish tumor samples from control/normal samples, to identify subtypes of a particular cancer type, or to distinguish samples of a particular cancer type/tissue of origin from samples of other cancer types (e.g. [@doi:10.1186/s12920-020-0677-2; @doi:10.3389/fbioe.2020.00737; @doi:10.1109/TBME.2012.2225622; @doi:10.1186/s13073-023-01176-5]).
+External validation is difficult, however, since samples in TCGA were taken from patients who had already been clinically diagnosed with a particular cancer type or subtype, i.e. without using machine learning.
+Potentially a more clinically relevant way to frame the problem is to classify cancers of unknown primary (CUP), which are metastatic cancers where the primary site cannot be identified in the clinic.
+Machine learning approaches have identified cell lineages and developmental trajectories for CUP samples [@doi:10.1158/2159-8290.CD-21-1443] and integrated electronic health record (EHR) data and genomic data to suggest targeted therapies for CUP patients [@doi:10.1038/s41591-023-02482-6].
+Relatedly, distinguishing primary samples from metastatic samples, or predicting metastatic potential of primary samples, is another classification problem which -omics data has been used for [@doi:10.1038/s41467-019-13825-8; @doi:10.1101/2020.09.07.286583; @doi:10.1371/journal.pcbi.1009956]
+
+Prediction of drug response from genomic data, often combined with clinical features or other metadata, is a machine learning problem with clear clinical applications.
+Given the availability and uniformity of the cell line data in CCLE, and drug response data in GDSC, PRISM and other cell line datasets, many method development efforts have centered on these data sources.
+Examples include prediction of drug response from integrated multi-omics data [@doi:10.1093/bioinformatics/btz318], prediction of drug response using perturbation modeling via CMap as an intermediate step [@doi:10.1093/bioinformatics/btz158], and prediction of drug response via single-cell transcriptomic data [@doi:10.1101/2022.01.11.475728], among many others reviewed in [@doi:10.1093/bib/bbab294; @doi:10.1038/s41467-022-34277-7; @doi:10.1038/s41598-023-39179-2].
+Large datasets of human-derived genomic data with associated drug response annotations are more difficult to find.
+Still, there have been attempts to develop and/or validate models on human data, including for prediction of immunotherapy response which benefits from applications across a wide range of cancer types [@doi:10.1038/s41587-021-01070-8; @doi:10.1016/j.ccell.2023.06.006; @doi:10.1101/2020.09.03.260265].
+Prognosis or patient survival prediction from multi-omics data is another area of modeling that leverages widely available clinical metadata, reviewed in detail in many existing papers [@doi:10.1186/1471-2288-12-102; @doi:10.1093/bib/bbu003; @doi:10.1186/s12885-021-08796-3; @doi:10.1016/j.csbj.2014.11.005].
+
+Much of our work, described later in this thesis, stems from the idea of predicting the mutation status in key driver genes of cancer samples, based on functional readouts such as gene expression [@doi:10.1158/1078-0432.CCR-13-1943; @doi:10.1016/j.celrep.2018.03.046; @doi:10.1186/s13059-020-02021-3; @doi:10.1371/journal.pone.0241514].
+At first consideration, an accurate mutation status classifier may not seem particularly useful, since for a patient sample a clinician could simply sequence the genome, or select genes in the genome, to determine driver mutation status.
+One application of accurate mutation status classifiers, however, is to identify samples with a similar phenotype to those with a driver mutation, but _without_ the mutation being present in DNA sequencing data.
+Observed examples of this phenomenon include the "BRCAness" phenotype in tumors without observed _BRCA1_/_BRCA2_ mutations [@doi:10.1038/nrc.2015.21], and the "Ph-like" leukemia phenotype in the absence of the Philadelphia chromosome fusion [@doi:10.1182/asheducation-2016.1.561], among others.
+Following this line of reasoning, algorithms have been developed to identify mutations that "phenocopy" known cancer drivers [@doi:10.1142/9789811215636_0031; @doi:10.1101/2022.07.28.501874], and to integrate this information into drug response prediction pipelines to define larger and more accurate patient subgroups [@doi:10.1038/s41525-022-00328-7].
+Related machine learning approaches to genomic prediction/phenotype identification include methods for identifying DNA damage repair deficiencies based on genomic data [@doi:10.1038/nm.4292; @doi:10.1038/s43018-022-00474-y] and for identifying synthetic lethal relationships for use in targeted therapy selection [@doi:10.1016/j.cell.2021.03.030].
+Such methods could be useful for defining broader and more representative patient groups than would be possible based solely on somatic mutation status, that may exhibit similar tumor phenotypes or respond to similar therapies.
+For example, in "basket" clinical trials where patients are included across cancer types based on the presence or absence of individual molecular markers [@doi:10.1200/jco.2014.58.2007], including "phenocopies" could improve efficacy for some targeted therapies.