Skip to content

Commit

Permalink
fix max feature count
Browse files Browse the repository at this point in the history
  • Loading branch information
jjc2718 committed Nov 8, 2023
1 parent b4cb1c7 commit e6c6500
Show file tree
Hide file tree
Showing 22 changed files with 6,624 additions and 3,249 deletions.
6 changes: 3 additions & 3 deletions content/02.main-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ The `optimal` learning rate schedule is used by default.

When we compared these four approaches, we used a constant learning rate of 0.0005, and an initial learning rate of 0.1 for the `adaptive` and `invscaling` schedules.
We also tested a fifth approach that we called "`constant_search`", in which we tested a range of constant learning rates in a grid search on a validation dataset, then evaluated the model on the test data using the best-performing constant learning rate by validation AUPR.
For the grid search, we used the following range of constant learning rates: {0.000005, 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.01}.
For the grid search, we used the following range of constant learning rates: {0.00001, 0.0001, 0.001, 0.01}.
Unless otherwise specified, results for SGD in the main paper figures used the `constant_search` approach, which performed the best in our comparison between schedulers.

### DepMap gene essentiality prediction
Expand Down Expand Up @@ -116,11 +116,11 @@ Previous work has shown that pan-cancer classifiers of Ras mutation status are a
We first evaluated models for KRAS mutation prediction.
As model complexity increases (more nonzero coefficients) for the `liblinear` optimizer, we observed that performance increases then decreases, corresponding to overfitting for high model complexities/numbers of nonzero coefficients (Figure {@fig:optimizer_compare_mutations}A).
On the other hand, for the SGD optimizer, we observed consistent performance as model complexity increases, with models having no nonzero coefficients performing comparably to the best (Figure {@fig:optimizer_compare_mutations}B).
In this case, top performance for SGD (a regularization parameter of 10^-1^) is slightly better than top performance for `liblinear` (a regularization parameter of 1 / 3.16 x 10^2^): we observed a mean test AUPR of 0.722 for SGD vs. mean AUPR of 0.692 for `liblinear`.
In this case, top performance for SGD (a regularization parameter of 3.16 x 10^-3^) is slightly better than top performance for `liblinear` (a regularization parameter of 1 / 3.16 x 10^2^): we observed a mean test AUPR of 0.725 for SGD vs. mean AUPR of 0.685 for `liblinear`.

To determine how relative performance trends with `liblinear` tend to compare across the genes in the Vogelstein dataset at large, we looked at the difference in performance between optimizers for the best-performing models for each gene (Figure {@fig:optimizer_compare_mutations}C).
The distribution is centered around 0 and more or less symmetrical, suggesting that across the gene set, `liblinear` and SGD tend to perform comparably to one another.
We saw that for 52/84 genes, performance for the best-performing model was better using SGD than `liblinear`, and for the other 32 genes performance was better using `liblinear`.
We saw that for 58/84 genes, performance for the best-performing model was better using SGD than `liblinear`, and for the other 25 genes performance was better using `liblinear`.
In order to quantify whether the overfitting tendencies (or lack thereof) also hold across the gene set, we plotted the difference in performance between the best-performing model and the largest (least regularized) model; classifiers with a large difference in performance exhibit strong overfitting, and classifiers with a small difference in performance do not overfit (Figure {@fig:optimizer_compare_mutations}D).
For SGD, the least regularized models tend to perform comparably to the best-performing models, whereas for `liblinear` the distribution is wider suggesting that overfitting is more common.

Expand Down
2 changes: 1 addition & 1 deletion content/91.supp-info.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Supplementary Material {.page_break_before}

![Number of nonzero coefficients (model sparsity) across varying regularization parameter settings for KRAS mutation prediction using SGD and `liblinear` optimizers.](images/supp_figure_1.png){#fig:compare_sparsity tag="S1" width="100%"}
![Number of nonzero coefficients (model sparsity) across varying regularization parameter settings for KRAS mutation prediction using SGD and `liblinear` optimizers, and averaged across all genes for both optimizers. In the "all genes" plot, the black dotted line shows the median parameter selected for `liblinear`, and the grey dotted line shows the median parameter selected for SGD.](images/supp_figure_1.png){#fig:compare_sparsity tag="S1" width="100%"}

![Distribution of performance difference between best-performing model for `liblinear` and SGD optimizers, across all 84 genes in Vogelstein driver gene set, for varying SGD learning rate schedulers. Positive numbers on the x-axis indicate better performance using `liblinear`, and negative numbers indicate better performance using SGD.](images/supp_figure_2.png){#fig:compare_all_lr tag="S2" width="100%" .page_break_before}

Expand Down
Binary file modified content/images/figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit e6c6500

Please sign in to comment.