Skip to content

RSC-RP/rnaseq_count_nf

Repository files navigation

RNA-seq Alignment, QC, and Quantification Nextflow Pipeline

This pipeline uses publically available modules from nf-core with some locally created modules. The primary functionality is to run a workflow on 10s - 1000s of samples in parallel on the Seattle Children's Cybertron HPC using the PBS job scheduler and containerized scientific software.

First, follow the steps on this page to make a personal copy of this repository. Then, the step-by-step instructions to run the workflow: workflow_docs/workflow_run.md can be used.

About the Workflow

This workflow is designed to output gene expression counts from STAR aligner using --quantmode. It will also perform general QC statistics on the fastqs with fastqc and the alignment using rseqc. Finally, the QC reports are collected into a single file using multiQC.

A DAG (directed acyclic graph) of the workflow is show below:

Set-up the Environment

Code Repository

First, fork the repository from Children’s bitbucket. Do this by clicking the “create fork” symbol from the bitbucket web interface and fork it to your personal bitbucket account, as illustrated below.

Next, you will need to clone your personal repository to your home in Cybertron. See the image below for where you can find the correct URL on your forked bitbucket repo.

Copy that URL to replace https://childrens-atlassian/bitbucket/scm/~jsmi26/rnaseq_count_nf.git below.

# on a terminal on the Cybertron login nodes
cd ~

# your fork should have your own userID (rather than jsmi26)
git clone https://childrens-atlassian/bitbucket/scm/~MY_USERID/rnaseq_count_nf.git
cd ~/rnaseq_count_nf

Once inside the code repository directory, use the latest release branch or make sure you're using the same release as prior analysis by using git.

git fetch
git branch -a

The git branch command will show all available remote branches, including remote branches, like:

* main
  remotes/origin/HEAD -> origin/main
  remotes/origin/dev
  remotes/origin/main
  remotes/origin/release/1.1.2

Checkout the most current release branch, which will be the largest value (eg use release/1.2.0 if avaiable). You can use the most up-to-date branch by using this command:

git checkout release/1.0.0

Which will state that you are now on release/1.0.0 branch and that it is tracking the release branch in your personal repository.

Checking out files: 100% (55/55), done. Branch release/1.0.0 set up to track remote branch release/1.0.0 from origin. Switched to a new branch 'release/1.0.0'

Conda Environment

Find your project code by listing all your projects on the Cybertron terminal.

# lists all HPC project names that you have access to use
project info

Grab an interactive session compute node and activate the conda environment. It is also be best practice to use tmux or screen to ensure that if at the session is disconnected, then you’re nextflow workflow (if running) won’t end with SIGKILL error.

Change the QUEUE and NAME variables in the code chunk below to be accurate for your Cybertron projects.

tmux new-session -s nextflow
# the variable 'NAME' will be an HPC project that you have access to
NAME="RSC_adhoc"
QUEUE="paidq"
qsub -I -q $QUEUE -P $(project code $NAME) -l select=1:ncpus=1:mem=8g -l walltime=8:00:00
cd ~/rnaseq_count_nf

If you don’t have conda installed yet, please follow these directions. You may stop following the directions after the conda deactivate step.

Next, for the conda environment to be solved, you will need to set channel_priority to flexible in your conda configs as well. To read more about conda environments and thier configurations, check out the documentation.

# check config settings
conda config --describe channel_priority # print your current conda settings
conda config --set channel_priority flexible # set to flexible if not already done

# Create the environement only once. Skip this step if you've already created the environment
conda env create -f env/nextflow.yaml
# Activate the conda environment. 
conda activate nextflow

Optional: Conda/Mamba at SCRI

SCRI uses a TLS and/or SSL Certificate to inspect web traffic and its specific to SCRI. Nextflow itself orchestrates many types of downloads such as genomic references, scientific software images from public repositories, and conda packages.

If you are running into SSL errors, you will need to configure your conda installation to use SCRI certificates.

Please see Research Scientific Computing for more help in getting set-up and this bitbucket repo for the current certificates.

Run the pipeline

Open the step-by-step instructions to run the workflow in workflow_docs/workflow_run.md.

Authors

Acknowledgements

This pipeline was generated using nf-core tools CLI suite and publically available modules from nf-core.

The nf-core project came about at the start of 2018. Phil Ewels (@ewels) was the head of the development facility at NGI Stockholm (National Genomics Infrastructure), part of SciLifeLab in Sweden.

The NGI had been developing analysis pipelines for use with it’s genomics data for several years and started using a set of standards for each pipeline created. This helped other people run the pipelines on their own systems; typically Swedish research groups at first, but later on other groups and core genomics facilities too such as QBIC in Tübingen.

As the number of users and contributors grew, the pipelines began to outgrow the SciLifeLab and NGI branding. To try to open up the effort into a truly collaborative project, nf-core was created and all relevant pipelines moved to this new GitHub Organisation.

The early days of nf-core were greatly shaped by Alex Peltzer (@apeltzer), Sven Fillinger (@sven1103) and Andreas Wilm (@andreas-wilm). Without them, the project would not exist.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages