greenelab · jjc2718 · Aug 28, 2023 · Aug 17, 2023 · Aug 28, 2023 · Aug 28, 2023
diff --git a/.github/workflows/manubot.yaml b/.github/workflows/manubot.yaml
@@ -45,6 +45,7 @@ jobs:
       # Set SPELLCHECK to true/false for whether to check spelling in this action.
       # For workflow dispatch jobs, this SPELLCHECK setting will be overridden by the user input.
       SPELLCHECK: true
+      BUILD_DOCX: true
     defaults:
       run:
         shell: bash --login {0}

diff --git a/build/pandoc/defaults/docx.yaml b/build/pandoc/defaults/docx.yaml
@@ -2,7 +2,7 @@
 # Load on top of common defaults.
 to: docx
 output-file: output/manuscript.docx
-reference-doc: build/themes/default.docx
+reference-doc: build/themes/upenn-dissertation-template.docx
 resource-path:
  - '.'
  - content
diff --git a/build/themes/upenn-dissertation-template.docx b/build/themes/upenn-dissertation-template.docx
diff --git a/content/02.delete-me.md b/content/02.delete-me.md
diff --git a/content/02.introduction.md b/content/02.introduction.md
@@ -0,0 +1,5 @@
+## Chapter 1
+
+* Modeling strategies (copy existing review)
+
+* Machine learning for cancer transcriptomics (add some new text)
diff --git a/content/20.header.md b/content/20.header.md
@@ -0,0 +1,6 @@
+## Chapter 2: optimization strongly influences model selection in transcriptomic prediction
+
+This chapter has been posted as a preprint on bioRxiv (https://www.biorxiv.org/content/10.1101/2023.06.26.546586v1) and submitted for publication at Bioinformatics Advances as "Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction".
+
+_Contributions_: I designed and ran the experiments, created the figures, wrote the initial draft of the manuscript, and edited the manuscript. Maria Chikina gave feedback on an initial version of the manuscript, gave guidance on experimental design, and edited the manuscript. Casey S. Greene gave feedback and guidance on experiments, and edited the manuscript.
+
diff --git a/content/21.abstract.md b/content/21.abstract.md
@@ -0,0 +1,19 @@
+### Abstract
+
+#### Motivation
+
+Most models can be fit to data using various optimization approaches.
+While model choice is frequently reported in machine-learning-based research, optimizers are not often noted.
+We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent and stochastic gradient descent), to predict driver mutation presence or absence from gene expression across 84 pan-cancer driver genes.
+Across varying levels of regularization, we compared performance and model sparsity between optimizers.
+
+#### Results
+
+After model selection and tuning, we found that coordinate descent (implemented in the `liblinear` library) and SGD tended to perform comparably.
+`liblinear` models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter.
+SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased.
+Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.
+
+#### Availability and implementation
+
+The code used to carry out the analyses in this study is available at <https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification>. Performance/regularization strength curves for all genes in the Vogelstein et al. 2013 dataset are available at <https://doi.org/10.6084/m9.figshare.22728644>.
diff --git a/content/22.introduction.md b/content/22.introduction.md
@@ -0,0 +1,17 @@
+### Introduction
+
+Gene expression profiles are widely used to classify samples or patients into relevant groups or categories, both preclinically [@doi:10.1371/journal.pcbi.1009926; @doi:10.1093/bioinformatics/btaa150] and clinically [@doi:10.1200/JCO.2008.18.1370; @doi:10/bp4rtw].
+To extract informative gene features and to perform classification, a diverse array of algorithms exist, and different algorithms perform well across varying datasets and tasks [@doi:10.1371/journal.pcbi.1009926].
+Even within a given model class, multiple optimization methods can often be applied to find well-performing model parameters or to optimize a model's loss function.
+One commonly used example is logistic regression.
+The widely used scikit-learn Python package for machine learning [@url:https://jmlr.org/papers/v12/pedregosa11a.html] provides two modules for fitting logistic regression classifiers: `LogisticRegression`, which uses the `liblinear` coordinate descent method [@url:https://www.jmlr.org/papers/v9/fan08a.html] to find parameters that optimize the logistic loss function, and `SGDClassifier`, which uses stochastic gradient descent [@online-learning] to optimize the same loss function.
+
+Using scikit-learn, we compared the `liblinear` (coordinate descent) and SGD optimization techniques for prediction of driver mutation status in tumor samples, across a wide variety of genes implicated in cancer initiation and development [@doi:10.1126/science.1235122].
+We applied LASSO (L1-regularized) logistic regression, and tuned the strength of the regularization to compare model selection between optimizers.
+We found that across a variety of models (i.e. varying regularization strengths), the training dynamics of the optimizers were considerably different: models fit using `liblinear` tended to perform best at fairly high regularization strengths (100-1000 nonzero features in the model) and overfit easily with low regularization strengths.
+On the other hand, after tuning the learning rate, models fit using SGD tended to perform well across both higher and lower regularization strengths, and overfitting was less common.
+
+Our results caution against viewing optimizer choice as a "black box" component of machine learning modeling.
+The observation that LASSO logistic regression models fit using SGD tended to perform well for low levels of regularization, across diverse driver genes, runs counter to conventional wisdom in machine learning for high-dimensional data which generally states that explicit regularization and/or feature selection is necessary.
+Comparing optimizers or model implementations directly is rare in applications of machine learning for genomics, and our work shows that this choice can affect generalization and interpretation properties of the model significantly.
+Based on our results, we recommend considering the appropriate optimization approach carefully based on the goals of each individual analysis.