Skip to content

Latest commit

 

History

History
278 lines (205 loc) · 15.7 KB

README_ONCOANALYSER.md

File metadata and controls

278 lines (205 loc) · 15.7 KB

Running the HMF pipeline with oncoanalyser

Table of contents

Overview

The recommended way to run hartwigmedical/hmftools workflows or components is with oncoanalyser, a Nextflow implementation of the HMF pipeline.

A principal aim of oncoanalyser is provide the HMF pipeline in a highly accessible form that is usable with a minimal set of inputs. This is achieved through flexible predefined configuration for individual tools, prebuilt Docker images retrieved at runtime for each process, and automated on-demand staging of reference genomes and resource files. The only required input to run an analysis with oncoanalyser is a samplesheet listing the sample inputs.

Both the WGS/WTS and targeted sequencing workflows are available in oncoanalyser. The targeted sequencing workflow has built-in support for the TSO500 panel and can also analyse any custom panel data where the required panel-specific normalisation data is available.

As oncoanalyser is written using Nextflow, it supports a range of compute environments including AWS, Azure, GCP, and HPC. Other features include continuous checkpointing with run resuming and the ability to integrate with Seqera Platform, a user-friendly monitoring and management service for Nextflow pipelines.

Further information on Nextflow can be found here and generic configuration options are well described in the Nextflow documentation.

Supported workflows

Workflow inputs

The starting input for oncoanalyser is either FASTQ or BAM files. If alignment and BAM processing is performed outside oncoanalyser, one of the below aligners with the specified criteria must be used:

Sequence Type Aligner Requirements
DNA • BWA-MEM
• BWA-MEM2
• DRAGEN
• Supplementary alignment soft-clipping (-Y)
• Duplicate marking with hmftools MarkDups
RNA • STAR • Several essential STAR settings for WGTS
• Duplicate marking with the Picard algorithm
• Ensembl v74 annotations for GRCh37
• Ensembl v105 annotations for GRCh38

Warning

BAMs are expected to have been generated by aligning to the Hartwig-distributed GRCh37 or GRCh38 reference genomes.

Duplicate read marking and UMI processing

The hmftools workflows is optimised to analyse reads processed by MarkDups, which has specialised approaches for duplicate read marking and UMI processing that are distinct from other common tools (e.g. Picard, Sambamba, UMI-tools, etc). Hence, it is strongly recommended that externally-generated BAMs are processed with MarkDups, this is particularly important where there are high rates of read duplicates or where UMIs have been used.

WGTS workflow

HMF_Pipeline

Available analysis types

Require inputs shown as ✅ for available analyses

Analysis name Tumor DNA (FASTQ/BAM) Normal DNA (FASTQ/BAM) Tumor RNA (FASTQ/BAM)
Tumor/normal WGTS
Tumor/normal WGS -
Tumor only WGS - -
Tumor only WTS - -

Targeted sequencing workflow

HMF_Pipeline

Available analysis types

Require inputs shown as ✅ for available analyses

Analysis name Tumor DNA (FASTQ/BAM) Tumor RNA (FASTQ/BAM)
Tumor only optional

Usage

Software requirements

Note

Docker on Windows and macOS can perform poorly, so only running oncoanalyser on Linux is currently recommended.

Input samplesheet

Running an analysis with oncoanalyser requires a samplesheet describing input files and samples. The samplesheet contains information that allows oncoanalyser to appropriately group samples (e.g. tumor/normal pairs), locate input files, and select relevant tools to run.

Each entry in the samplesheet represents a single input file (or, in the case of paired FASTQ, the forward and reverse FASTQ files) and is connected with metadata such as sample/group identifiers, sample type (tumor/normal), sequence type (DNA/RNA), and filetype. All entries with the same group_id value will be grouped together for processing, and the composition of a group determines the type of analysis run.

An example samplesheet for the WGTS workflow with BAM inputs is shown:

group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829_example,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829_example,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam

In this example, there is a single group (COLO829_example) that contains paired tumor/normal DNA BAMs and an RNA BAM, so a full tumor/normal WGTS analysis will be run. For further details on workflow inputs and impact on execution, you can refer to the WGTS workflow inputs and targeted sequencing workflow inputs sections.

Multiple groups can also be provided in a single sample sheet:

group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829_example,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829_example,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
SEQC_example,SEQC,SEQCT,tumor,dna,bam,/path/to/SEQCT.dna.bam

Here the SEQC_example has been added to the previous example. Since only a tumor DNA BAM is provided for this additional group, just a tumor-only WGS analysis is run for the SEQC sample.

Note

Input filepaths can be absolute local paths, URLs, or S3 URIs

Warning

BAM indexes are expected to exist alongside the respective input BAM but can also be provided as a separate samplesheet entry by using the bai filetype

Given the importance of processing input BAMs with MarkDups prior to commencing analysis with the hmftools workflow, oncoanalyser will run MarkDups by default in order to apply specialised duplicate read marking, read consensus, and unmapping of low-quality reads. See MarkDups for more info.

The MarkDups step can be skipped where an input BAM has previously been processed by setting the samplesheet filetype as bam_markdups instead of bam.

FASTQ inputs

An analysis can also be started from FASTQ inputs where oncoanalyser will perform alignment against the selected reference genome using bwa-mem2 (DNA reads) or STAR (RNA reads) then subsequently apply all necessary post-alignment processing. Continuing with the previous example, we can provide FASTQ files for COLO829:

group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
COLO829_example,COLO829,COLO829R,normal,dna,fastq,library_id:COLO829R_library;lane:001,/path/to/COLO829R.dna.001_R1.fastq.gz;/path/to/COLO829R.dna.001_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:001,/path/to/COLO829T.dna.001_R1.fastq.gz;/path/to/COLO829T.dna.001_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:002,/path/to/COLO829T.dna.002_R1.fastq.gz;/path/to/COLO829T.dna.002_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:003,/path/to/COLO829T.dna.003_R1.fastq.gz;/path/to/COLO829T.dna.003_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:004,/path/to/COLO829T.dna.004_R1.fastq.gz;/path/to/COLO829T.dna.004_R2.fastq.gz
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,fastq,library_id:COLO829T_RNA_library;lane:001,/path/to/COLO829T.rna.001_R1.fastq.gz;/path/to/COLO829T.rna.001_R2.fastq.gz
SEQC_example,SEQC,SEQCT,tumor,dna,bam_markdups,,/path/to/SEQCT.markdups.dna.bam

Importantly we have now added the info column to the samplesheet so that we can provide the required lane and library data for FASTQ entries with each field delimited by a semi-column. The forward and reverse FASTQ files are set in the filepath column and are also separated by a semi-column, and are strictly ordered with forward reads in position one and reverse in position two.

Note

Only gzipped compressed, non-interleaved pair-end FASTQs are currently supported

Samplesheet column descriptions

Column Description
group_id Group ID for a set of samples and inputs
subject_id Subject/patient ID
sample_id Sample ID
sample_type Sample type: tumor, normal
sequence_type Sequence type: dna, rna
filetype File type: fastq, bam, bam_markdups, bai, etc
info For fastq file types, specify library id and lane, e.g. library_id:COLO829_library;lane:001
filepath Absolute filepath to input file (can be local filepath, URL, S3 URI)

The identifiers provided in the samplesheet are used to set output file paths:

  • group_id: top-level output directory for analysis files e.g. output/COLO829_example/
  • tumor sample_id: output prefix for most filenames e.g. COLO829T.purple.sv.vcf.gz
  • normal sample_id: output prefix for some filenames e.g. COLO829R.cobalt.ratio.pcf

Example command

To launch oncoanalyser you must provide at least the input samplesheet, the reference genome used for read alignment, and the desired workflow. When running the targeted sequencing workflow the applicable panel name is also required.

Note

Setting -revision to use a specific version of oncoanalyser is strongly recommended to improve reproducibility and stability.

Warning

It is recommended to only run oncoanalyser with Docker, which is done by with -profile docker.

WGTS workflow command

nextflow run nf-core/oncoanalyser \
  -profile docker \
  -revision 0.4.5 \
  --mode wgts \
  --genome GRCh38_hmf \
  --input samplesheet.csv \
  --outdir output/

Targeted sequencing workflow command

nextflow run nf-core/oncoanalyser \
  -profile docker \
  -revision 0.4.5 \
  --mode targeted \
  --panel tso500 \
  --genome GRCh38_hmf \
  --input samplesheet.csv \
  --outdir output/

Argument descriptions

Argument Group Description
-profile Nextflow Profile name: docker (no other profiles supported at this time)
-revision Nextflow Specific oncoanalyser version to run
-resume Nextflow Use cache from existing run to resume
--input oncoanalyser Samplesheet filepath
--outdir oncoanalyser Output directory path
--mode oncoanalyser Workflow name: wgts, targeted
--panel oncoanalyser Panel name (only applicable with --mode targeted): tso500
--genome oncoanalyser Reference genome: GRCh37_hmf, GRCh38_hmf
--max_cpus oncoanalyser Enforce an upper limit of CPUs each process can use
--max_memory oncoanalyser Enforce an upper limit of memory available to each process

Outputs

The selected results files are written to the output directory and arranged into their corresponding groups by directories named with the respective group_id value from the input samplesheet. Within each group directory, outputs are further organised by tool.

All intermediate files used by each process are kept in the Nextflow work directory (default: work/). Once an analysis has completed this directory can be removed.

Sample reports

Report Path Description
ORANGE <group_id>/orange/<tumor_sample_id>.orange.pdf PDF summary report of key finding of the HMF pipeline
LINX <group_id>/linx/MDX210176_linx.html Interactive HMTL report of all SV plots

Pipeline reports

Report Path Description
Execution pipeline_info/execution_report_*.html HTML report of execution metrics and details
Timeline pipeline_info/execution_timeline_*.html Timeline diagram showing process execution (start/duration/finish)

Future Improvements

The following improvements are planned for the next few releases:

  • longitudinal analysis of patient samples including ctDNA samples
  • cloud-specific instructions and optimisations (ie for AWS, Azure and GCP)

Acknowledgements

The oncoanalyser pipeline was written by Stephen Watts at the University of Melbourne Centre for Cancer Research with the support of Oliver Hofmann and the Hartwig Medical Foundation Australia.