Linear and Nonlinear Signals

This repo contains the code to reproduce the results of the manuscript "The Effects of Nonlinear Signal on Expression-Based Prediction Performance". In short, we compare linear and nonlinear models in multiple prediction tasks, and find that their predictive ability is roughly equivalent. Further, this similarity is despite the fact that predictive nonlinear signal exists in the data for each of the tasks.

Installation

Python dependencies

The Python dependencies for this project are managed via Conda. To install them and activate the environment, use the following commands in bash:

conda env create --file environment.yml
conda activate linear_models

R setup

The R dependencies for this project are managed via Renv. To set up Renv for the repository, use the commands below within R while working in the linear_signals repo:

install.packages('renv')
renv::init()
renv::restore()

Sex prediction setup

Before running scripts involving sex prediction, you need to download the Flynn et al. labels from this link and put the results in the saged/data directory. Because of the settings on the figshare repo it isn't possible to incorporate that part of the data download into the Snakefile, otherwise I would.

Neptune setup

If you want to log training results, you will need to sign up for a free neptune account here.

The neptune module is already installed as part of the saged conda environment, but you'll need to grab an API token from the website.
Create a neptune project for storing your logs.
Store the token in secrets.yml in the format neptune_api_token: "<your_token>", and update the neptune_config file to use your info.

Reproducing results

The pipeline to download all the data, and produce all the results shown in the manuscript is managed by Snakemake. To reproduce all results files and figures, run

snakemake -j <NUM_CORES>

Successfully running the full pipeline takes a few months on a single machine. For reference specs, my machine has an 64 GB of RAM, an AMD Ryzen 7 3800xt processor and an NVIDIA 3090 GPU). You can get by with less ram, vRAM, and processor cores, by reducing the degree of paralellism. I imagine the analyses can comfortably fit on a machine with 32GB of ram and a ~1080ti GPU, but I haven't tested the pipeline in such an environment.

If you want to speed up the process and see similar results, you can run the pipeline without hyperparameter optimization with

snakemake -s no_hyperopt_snakefile -j <NUM_CORES>

If you are going to be running the pipeline in a cluster environment, it may be helpful to read through the file slurm_snakefile. This blog post might also be helpful.

Intermediate steps

When running the full pipeline via snakemake, the data required will be automatically downloaded (excluding the sex prediction labels mentioned in the section below). If you'd like to skip the data download (and in doing so save yourself about a week of downloading and processing things), you can rehydrate this Zenodo archive into the data/ dir.

Likewise, if you'd like to download the results files, they can be found here. If you only need the saved models, they can be found here.

Directory Layout

File/dir	Description
Snakefile	Contains the rules Snakemake uses to run the full project
environment.yml	Lists the python dependencies and their versions in a format readable by Conda
neptune.yml	Lists information for Neptune logging
secrets.yml	Stores neputne API token (see Neptune setup section)

data/	Stores the raw and intermediate data files used for training models
dataset_configs/	Stores config information telling Dataset objects how to construct themselves
figures/	Contains images visualizing the results of the various analyses
logs/	Holds serialized versions of trained models
model_configs/	Stores config information for models such as default hyperparameters
notebook/	Stores notebooks used for visualizing results
results/	Records the accuracies of the models on various tasks
src/	The source code used to run the analyses
test/	Tests for the source code (runnable with pytest)

Name		Name	Last commit message	Last commit date
Latest commit History 667 Commits
.github/workflows		.github/workflows
dataset_configs		dataset_configs
figures		figures
model_configs/supervised		model_configs/supervised
notebook/analysis		notebook/analysis
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
neptune.yml		neptune.yml
no_hyperopt_snakefile		no_hyperopt_snakefile
renv.lock		renv.lock
slurm_snakefile		slurm_snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linear and Nonlinear Signals

Installation

Python dependencies

R setup

Sex prediction setup

Neptune setup

Reproducing results

Intermediate steps

Directory Layout

About

Releases

Packages

Languages

License

greenelab/linear_signal

Folders and files

Latest commit

History

Repository files navigation

Linear and Nonlinear Signals

Installation

Python dependencies

R setup

Sex prediction setup

Neptune setup

Reproducing results

Intermediate steps

Directory Layout

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages