The recommended way to run hartwigmedical/hmftools workflows or components is with oncoanalyser, a Nextflow implementation of the HMF pipeline.
A principal aim of oncoanalyser is provide the HMF pipeline in a highly accessible form that is usable with a minimal set of inputs. This is achieved through flexible predefined configuration for individual tools, prebuilt Docker images retrieved at runtime for each process, and automated on-demand staging of reference genomes and resource files. The only required input to run an analysis with oncoanalyser is a samplesheet listing the sample inputs.
Both the WGS/WTS and targeted sequencing workflows are available in oncoanalyser. The targeted sequencing workflow has built-in support for the TSO500 panel and can also analyse any custom panel data where the required panel-specific normalisation data is available.
As oncoanalyser is written using Nextflow, it supports a range of compute environments including AWS, Azure, GCP, and HPC. Other features include continuous checkpointing with run resuming and the ability to integrate with Seqera Platform, a user-friendly monitoring and management service for Nextflow pipelines.
Further information on Nextflow can be found here and generic configuration options are well described in the Nextflow documentation.
The starting input for oncoanalyser is either FASTQ or BAM files. If alignment and BAM processing is performed outside oncoanalyser, one of the below aligners with the specified criteria must be used:
Sequence Type | Aligner | Requirements |
---|---|---|
DNA | • BWA-MEM • BWA-MEM2 • DRAGEN |
• Supplementary alignment soft-clipping (-Y )• Duplicate marking with hmftools MarkDups |
RNA | • STAR | • Several essential STAR settings for WGTS • Duplicate marking with the Picard algorithm • Ensembl v74 annotations for GRCh37 • Ensembl v105 annotations for GRCh38 |
Warning
BAMs are expected to have been generated by aligning to the Hartwig-distributed GRCh37 or GRCh38 reference genomes.
The hmftools workflows is optimised to analyse reads processed by MarkDups, which has specialised approaches for duplicate read marking and UMI processing that are distinct from other common tools (e.g. Picard, Sambamba, UMI-tools, etc). Hence, it is strongly recommended that externally-generated BAMs are processed with MarkDups, this is particularly important where there are high rates of read duplicates or where UMIs have been used.
Require inputs shown as ✅ for available analyses
Analysis name | Tumor DNA (FASTQ/BAM) | Normal DNA (FASTQ/BAM) | Tumor RNA (FASTQ/BAM) |
---|---|---|---|
Tumor/normal WGTS | ✅ | ✅ | ✅ |
Tumor/normal WGS | ✅ | ✅ | - |
Tumor only WGS | ✅ | - | - |
Tumor only WTS | - | - | ✅ |
Require inputs shown as ✅ for available analyses
Analysis name | Tumor DNA (FASTQ/BAM) | Tumor RNA (FASTQ/BAM) |
---|---|---|
Tumor only | ✅ | optional |
- Nextflow >=22.10.5 (instructions)
- Docker (instructions)
Note
Docker on Windows and macOS can perform poorly, so only running oncoanalyser on Linux is currently recommended.
Running an analysis with oncoanalyser requires a samplesheet describing input files and samples. The samplesheet contains information that allows oncoanalyser to appropriately group samples (e.g. tumor/normal pairs), locate input files, and select relevant tools to run.
Each entry in the samplesheet represents a single input file (or, in the case of paired FASTQ, the forward and reverse
FASTQ files) and is connected with metadata such as sample/group identifiers, sample type (tumor/normal), sequence type
(DNA/RNA), and filetype. All entries with the same group_id
value will be grouped together for processing, and the
composition of a group determines the type of analysis run.
An example samplesheet for the WGTS workflow with BAM inputs is shown:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829_example,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829_example,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
In this example, there is a single group (COLO829_example
) that contains paired tumor/normal DNA BAMs and an RNA BAM,
so a full tumor/normal WGTS analysis will be run. For further details on workflow inputs and impact on execution, you
can refer to the WGTS workflow inputs and targeted sequencing workflow
inputs sections.
Multiple groups can also be provided in a single sample sheet:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829_example,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829_example,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
SEQC_example,SEQC,SEQCT,tumor,dna,bam,/path/to/SEQCT.dna.bam
Here the SEQC_example
has been added to the previous example. Since only a tumor DNA BAM is provided for this
additional group, just a tumor-only WGS analysis is run for the SEQC sample.
Note
Input filepaths can be absolute local paths, URLs, or S3 URIs
Warning
BAM indexes are expected to exist alongside the respective input BAM but can also be provided as a separate
samplesheet entry by using the bai
filetype
Given the importance of processing input BAMs with MarkDups prior to commencing analysis with the hmftools workflow, oncoanalyser will run MarkDups by default in order to apply specialised duplicate read marking, read consensus, and unmapping of low-quality reads. See MarkDups for more info.
The MarkDups step can be skipped where an input BAM has previously been processed by setting the samplesheet filetype as
bam_markdups
instead of bam
.
An analysis can also be started from FASTQ inputs where oncoanalyser will perform alignment against the selected reference genome using bwa-mem2 (DNA reads) or STAR (RNA reads) then subsequently apply all necessary post-alignment processing. Continuing with the previous example, we can provide FASTQ files for COLO829:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
COLO829_example,COLO829,COLO829R,normal,dna,fastq,library_id:COLO829R_library;lane:001,/path/to/COLO829R.dna.001_R1.fastq.gz;/path/to/COLO829R.dna.001_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:001,/path/to/COLO829T.dna.001_R1.fastq.gz;/path/to/COLO829T.dna.001_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:002,/path/to/COLO829T.dna.002_R1.fastq.gz;/path/to/COLO829T.dna.002_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:003,/path/to/COLO829T.dna.003_R1.fastq.gz;/path/to/COLO829T.dna.003_R2.fastq.gz
COLO829_example,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:004,/path/to/COLO829T.dna.004_R1.fastq.gz;/path/to/COLO829T.dna.004_R2.fastq.gz
COLO829_example,COLO829,COLO829T_RNA,tumor,rna,fastq,library_id:COLO829T_RNA_library;lane:001,/path/to/COLO829T.rna.001_R1.fastq.gz;/path/to/COLO829T.rna.001_R2.fastq.gz
SEQC_example,SEQC,SEQCT,tumor,dna,bam_markdups,,/path/to/SEQCT.markdups.dna.bam
Importantly we have now added the info
column to the samplesheet so that we can provide the required lane and library
data for FASTQ entries with each field delimited by a semi-column. The forward and reverse FASTQ files are set in the
filepath
column and are also separated by a semi-column, and are strictly ordered with forward reads in position one
and reverse in position two.
Note
Only gzipped compressed, non-interleaved pair-end FASTQs are currently supported
Column | Description |
---|---|
group_id | Group ID for a set of samples and inputs |
subject_id | Subject/patient ID |
sample_id | Sample ID |
sample_type | Sample type: tumor , normal |
sequence_type | Sequence type: dna , rna |
filetype | File type: fastq , bam , bam_markdups , bai , etc |
info | For fastq file types, specify library id and lane, e.g. library_id:COLO829_library;lane:001 |
filepath | Absolute filepath to input file (can be local filepath, URL, S3 URI) |
The identifiers provided in the samplesheet are used to set output file paths:
group_id
: top-level output directory for analysis files e.g.output/COLO829_example/
- tumor
sample_id
: output prefix for most filenames e.g.COLO829T.purple.sv.vcf.gz
- normal
sample_id
: output prefix for some filenames e.g.COLO829R.cobalt.ratio.pcf
To launch oncoanalyser you must provide at least the input samplesheet, the reference genome used for read alignment, and the desired workflow. When running the targeted sequencing workflow the applicable panel name is also required.
Note
Setting -revision
to use a specific version of oncoanalyser is strongly recommended to improve reproducibility and
stability.
Warning
It is recommended to only run oncoanalyser with Docker, which is done by with -profile docker
.
nextflow run nf-core/oncoanalyser \
-profile docker \
-revision 0.4.5 \
--mode wgts \
--genome GRCh38_hmf \
--input samplesheet.csv \
--outdir output/
nextflow run nf-core/oncoanalyser \
-profile docker \
-revision 0.4.5 \
--mode targeted \
--panel tso500 \
--genome GRCh38_hmf \
--input samplesheet.csv \
--outdir output/
Argument | Group | Description |
---|---|---|
-profile |
Nextflow | Profile name: docker (no other profiles supported at this time) |
-revision |
Nextflow | Specific oncoanalyser version to run |
-resume |
Nextflow | Use cache from existing run to resume |
--input |
oncoanalyser | Samplesheet filepath |
--outdir |
oncoanalyser | Output directory path |
--mode |
oncoanalyser | Workflow name: wgts , targeted |
--panel |
oncoanalyser | Panel name (only applicable with --mode targeted ): tso500 |
--genome |
oncoanalyser | Reference genome: GRCh37_hmf , GRCh38_hmf |
--max_cpus |
oncoanalyser | Enforce an upper limit of CPUs each process can use |
--max_memory |
oncoanalyser | Enforce an upper limit of memory available to each process |
The selected results files are written to the output directory and arranged into their corresponding groups by
directories named with the respective group_id
value from the input samplesheet. Within each group directory, outputs
are further organised by tool.
All intermediate files used by each process are kept in the Nextflow work directory (default: work/
). Once an analysis
has completed this directory can be removed.
Report | Path | Description |
---|---|---|
ORANGE | <group_id>/orange/<tumor_sample_id>.orange.pdf |
PDF summary report of key finding of the HMF pipeline |
LINX | <group_id>/linx/MDX210176_linx.html |
Interactive HMTL report of all SV plots |
Report | Path | Description |
---|---|---|
Execution | pipeline_info/execution_report_*.html |
HTML report of execution metrics and details |
Timeline | pipeline_info/execution_timeline_*.html |
Timeline diagram showing process execution (start/duration/finish) |
The following improvements are planned for the next few releases:
- longitudinal analysis of patient samples including ctDNA samples
- cloud-specific instructions and optimisations (ie for AWS, Azure and GCP)
The oncoanalyser pipeline was written by Stephen Watts at the University of Melbourne Centre for Cancer Research with the support of Oliver Hofmann and the Hartwig Medical Foundation Australia.