greenelab · jjc2718 · Aug 29, 2023 · Aug 29, 2023 · Aug 29, 2023
diff --git a/content/20.header.md b/content/20.header.md
@@ -2,5 +2,6 @@
 
 This chapter has been posted as a preprint on bioRxiv (https://www.biorxiv.org/content/10.1101/2023.06.26.546586v1) and submitted for publication at Bioinformatics Advances as "Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction".
 
-_Contributions_: I designed and ran the experiments, created the figures, wrote the initial draft of the manuscript, and edited the manuscript. Maria Chikina gave feedback on an initial version of the manuscript, gave guidance on experimental design, and edited the manuscript. Casey S. Greene gave feedback and guidance on experiments, and edited the manuscript.
+**Contributions**:
+I designed and ran the experiments, created the figures, wrote the initial draft of the manuscript, and edited the manuscript. Maria Chikina gave feedback on an initial version of the manuscript, gave guidance on experimental design, and edited the manuscript. Casey S. Greene gave feedback and guidance on experiments, and edited the manuscript.
 
diff --git a/content/34.results.md b/content/34.results.md
@@ -2,15 +2,15 @@
 
 #### Using diverse data modalities to predict cancer alterations
 
-We collected five different data modalities from cancer samples in the TCGA Pan-Cancer Atlas, capturing five steps of cellular function that are perturbed by genetic alterations in cancer (Figure {@fig:overview}A).
+We collected five different data modalities from cancer samples in the TCGA Pan-Cancer Atlas, capturing five steps of cellular function that are perturbed by genetic alterations in cancer (Figure {@fig:omics_overview}A).
 These included gene expression (RNA-seq data), DNA methylation (27K and 450K Illumina BeadChip arrays), protein abundance (RPPA data), microRNA expression data, and patterns of somatic mutation (mutational signatures).
-To link these diverse data modalities to changes in mutation status, we used elastic net logistic regression to predict the presence or absence of mutations in cancer genes, using these readouts as predictive features (Figure {@fig:overview}B).
-We evaluated the resulting mutation status classifiers in a pan-cancer setting, preserving the proportions of each of the 33 cancer types in TCGA for eight train/test splits (4 folds x 2 replicates) in each of approximately 250 cancer genes (Figure {@fig:overview}C).
+To link these diverse data modalities to changes in mutation status, we used elastic net logistic regression to predict the presence or absence of mutations in cancer genes, using these readouts as predictive features (Figure {@fig:omics_overview}B).
+We evaluated the resulting mutation status classifiers in a pan-cancer setting, preserving the proportions of each of the 33 cancer types in TCGA for eight train/test splits (4 folds x 2 replicates) in each of approximately 250 cancer genes (Figure {@fig:omics_overview}C).
 
 We sought to compare classifiers against a baseline where mutation labels are permuted (to identify genes whose mutation status correlates strongly with a functional signature in a given data type) and also to compare classifiers trained on true labels across different data types (to identify data types that are more or less predictive of mutations in a given gene).
 To account for variation between dataset splits in making these comparisons, we treat classification metrics from the eight train/test splits as performance distributions, which we compare using _t_-tests.
 We summarize performance across all genes in our cancer gene set using a similar approach to a volcano plot, in which each point is a gene.
-In our summary plots, the x-axis shows the magnitude of the change in the classification metric between conditions, and the y-axis shows the _p_-value for the associated _t_-test (Figure {@fig:overview}C).
+In our summary plots, the x-axis shows the magnitude of the change in the classification metric between conditions, and the y-axis shows the _p_-value for the associated _t_-test (Figure {@fig:omics_overview}C).
 
 ![
 **A.** Cancer mutations can perturb cellular function via a variety of cellular processes.
@@ -20,7 +20,7 @@ Note that this does not reflect all possible relationships between cellular proc
 In this study, we use functional readouts from TCGA as predictive features and the presence or absence of mutation in a given gene as labels.
 This reverses the primary direction of information flow shown in Panel A.
 **C.** Schematic of evaluation pipeline.
-](images/omics/figure_1.png){#fig:overview}
+](images/omics/figure_1.png){#fig:omics_overview}
 
 #### Selection of cancer-related genes improves predictive signal
 

diff --git a/content/40.header.md b/content/40.header.md
@@ -0,0 +1,6 @@
+## Chapter 4: Smaller models do not exhibit superior generalization performance
+
+This chapter has been posted as a preprint on bioRxiv (TODO) under the title "Smaller models do not exhibit superior generalization performance".
+
+**Contributions:**
+I designed and ran the experiments, created the figures, wrote the initial draft of the manuscript, and edited the manuscript. Casey S. Greene gave feedback and guidance on experiments, and edited the manuscript.
diff --git a/content/41.abstract.md b/content/41.abstract.md
@@ -0,0 +1,11 @@
+### Abstract
+
+Existing guidelines in statistical modeling for genomics hold that simpler models have advantages over more complex ones.
+Potential advantages include cost, interpretability, and improved generalization across datasets or biological contexts.
+In cancer transcriptomics, this manifests as a preference for small "gene signatures", or groups of genes whose expression is used to define cancer subtypes or suggest therapeutic interventions.
+To test the assumption that small gene signatures generalize better, we examined the generalization of mutation status prediction models across datasets (from cell lines to human tumors and vice-versa) and contexts (holding out entire cancer types from pan-cancer data).
+We compared two simple procedures for model selection, one that exclusively relies on cross-validation performance and one that combines cross-validation performance with regularization strength.
+We did not observe that more regularized signatures generalized better.
+This result held across multiple problems and both linear models (LASSO logistic regression) and non-linear ones (neural networks).
+When the goal of an analysis is to produce generalizable predictive models, we recommend choosing the ones that perform best on held-out data or in cross-validation, instead of those that are smaller or more regularized.
+
diff --git a/content/42.introduction.md b/content/42.introduction.md
@@ -0,0 +1,26 @@
+### Introduction
+
+Gene expression datasets are typically "wide", with many gene features and relatively few samples.
+These feature-rich datasets present obstacles in many aspects of machine learning, including overfitting and multicollinearity, and challenges in interpretation.
+To facilitate the use of feature-rich gene expression data in machine learning models, feature selection and/or dimension reduction are commonly used to distill a more condensed data representation from the input space of all genes [@doi:10.1093/bioinformatics/btg062; @doi:10.1186/s13059-019-1861-6].
+The intuition is that many gene expression features are likely irrelevant to the prediction problem, redundant, or contain no meaningful variation across samples, so transforming them or selecting a subset can generate a more reliable predictor.
+
+In cancer transcriptomics, this preference for small, parsimonious sets of genes can be seen in the popularity of "gene signatures".
+These are groups of genes whose expression levels are used to define cancer subtypes or to predict prognosis or therapeutic response [@doi:10.1038/nrg.2017.96; @doi:10.1016/j.ejca.2013.02.021].
+Many studies specify the size of the signature in the paper's title or abstract, suggesting that the fewer genes in a gene signature, the better, e.g. [@doi:10.1056/NEJMoa060096; @doi:10.1158/0008-5472.CAN-08-0436; @doi:10.1056/NEJMoa1602253].
+Clinically, there are many reasons why a smaller gene signature may be preferable, including cost (fewer genes may be less expensive to profile or validate, whereas a large signature likely requires a targeted array or NGS analysis [@doi:10.1586/erm.09.32]) and interpretability (it is easier to reason about the function and biological role of a smaller gene set than a large one since even disjoint gene signatures tend to converge on common biological pathways [@doi:10.1056/NEJMe068292; @doi:10.1038/nrclinonc.2011.125]).
+There is also an underlying assumption that smaller gene signatures tend to be more robust: that for a new patient or in a new biological context, a smaller gene set or more parsimonious model will be more likely to maintain its predictive performance than a larger one.
+This assumption has rarely been explicitly tested in genomics applications, but it is often included in guidelines or rules of thumb for statistical modeling or machine learning in biology, e.g. [@doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961], and it is related in spirit to information-theoretic model selection approaches such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) [@doi:10.1109/TAC.1974.1100705; @doi:10.1214/aos/1176344136].
+
+In this study, we sought to test the robustness assumption directly by evaluating model generalization across biological contexts, inspired by previous work on domain adaptation and transfer learning in cancer transcriptomics [@doi:10.1038/s43018-020-00169-2; @doi:10.1038/s42256-021-00408-w; @doi:10.1073/pnas.2106682118].
+We used two large, heterogeneous public cancer datasets: The Cancer Genome Atlas (TCGA) for human tumor sample data [@doi:10.1038/ng.2764], and the Cancer Cell Line Encyclopedia (CCLE) for human cell line data [@doi:10.1038/s41586-019-1186-3].
+These datasets contain overlapping -omics data types derived from distinct data sources, allowing us to quantify model generalization across data sources.
+In addition, each dataset contains samples from a wide range of different cancer types/tissues of origin, allowing us to quantify model generalization across cancer types.
+We trained both linear and non-linear models to predict mutation status (presence or absence) from RNA-seq gene expression for approximately 70 cancer driver genes, across varying levels of model simplicity and degrees of regularization, resulting in a variety of gene signature sizes.
+We compared two simple procedures for model selection, one that combines cross-validation performance with model parsimony and one that only relies on cross-validation performance, for each classifier in each context.
+
+Our results suggest that, in general, mutation status classification models that perform well in cross-validation within a biological context also generalize well across biological contexts.
+There are some individual genes and some individual cancer types where more regularized well-performing models outperform the best-performing model.
+However, we do not observe a systematic generalization advantage for smaller/more regularized models across all genes and cancer types.
+These results provide evidence that good cross-validation performance within a biological context (data source or cancer type) is a sufficient proxy for robust performance across contexts.
+