Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow local databases to be used for kraken2, centrifuge, and busco #504

Merged
merged 27 commits into from
Oct 9, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
328740c
handle local kraken2 db
gregorysprenger Aug 31, 2023
7f8909b
handle local centrifuge db
gregorysprenger Aug 31, 2023
118271c
use busco_db and handle local and downloaded tar.gz files
gregorysprenger Sep 5, 2023
d4b6258
update information about kraken, centrifuge, and busco databases
gregorysprenger Sep 5, 2023
4304cf6
update parameter handling
gregorysprenger Sep 15, 2023
190129e
set busco_db input as a path channel
gregorysprenger Sep 15, 2023
1961247
handle busco_db inputs and allow for busco to auto download lineages
gregorysprenger Sep 15, 2023
4c064c0
update centrifuge_db help text
gregorysprenger Sep 28, 2023
9640d64
update kraken2_db help text
gregorysprenger Sep 28, 2023
d9e7cc8
update busco_db help text
gregorysprenger Sep 28, 2023
7f054c6
fix grammar error
gregorysprenger Sep 28, 2023
e09c6a0
space needed for when adding to p var
gregorysprenger Oct 5, 2023
1f5d1a7
fix busco directory handling and have consistency on channel vars
gregorysprenger Oct 5, 2023
42b914d
harshil rule on emit lines
gregorysprenger Oct 5, 2023
3521da0
db_name has to be the same as centrifuge filenames
gregorysprenger Oct 5, 2023
04ceb89
update changelog
gregorysprenger Oct 6, 2023
5cb4f41
fix spelling error
gregorysprenger Oct 6, 2023
e4525bc
use file attribute getBaseName
gregorysprenger Oct 6, 2023
0d84819
revert back to only decompressing tar.gz files
gregorysprenger Oct 6, 2023
10ae039
handle centrifuge and kraken db parsing
gregorysprenger Oct 6, 2023
e77e44d
less verbose way of checking multiple file extensions
gregorysprenger Oct 8, 2023
21fcad2
remove view function from channel
gregorysprenger Oct 8, 2023
55fd640
use getSimpleName file attribute
gregorysprenger Oct 8, 2023
d122b49
update changelog
gregorysprenger Oct 8, 2023
a111ca8
Merge branch 'dev' into add_local_db
gregorysprenger Oct 9, 2023
f0e4ce6
Add depcrecateion and updated changed entries
jfy133 Oct 9, 2023
6190222
[automated] Fix linting with Prettier
nf-core-bot Oct 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#481](https://github.com/nf-core/mag/pull/481) - Adds MetaEuk for annotation of eukaryotic MAGs, and MMSeqs2 to enable downloading databases for MetaEuk (by @prototaxites)
- [#437](https://github.com/nf-core/mag/pull/429) - `--gtdb_db` also now supports directory input of an pre-uncompressed GTDB archive directory (reported by @alneberg, fix by @jfy133)
- [#494](https://github.com/nf-core/mag/pull/494) - Adds support for saving the BAM files from Bowtie2 mapping of input reads back to assembly (fix by @jfy133)
- [#504](https://github.com/nf-core/mag/pull/504) - `--busco_db`, `--kraken2_db`, and `--centrifuge_db` now support direcotry input of a pre-uncompressed database archive directory (by @gregorysprenger).
gregorysprenger marked this conversation as resolved.
Show resolved Hide resolved

### `Changed`

Expand Down
2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
skip_gtdbtk = true
skip_concoct = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_adapterremoval.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
clip_tool = 'adapterremoval'
skip_concoct = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_ancient_dna.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
ancient_dna = true
binning_map_mode = 'own'
Expand Down
2 changes: 1 addition & 1 deletion conf/test_bbnorm.config
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
skip_gtdbtk = true
bbnorm = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_binrefinement.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
refine_bins_dastool = true
refine_bins_dastool_threshold = 0
Expand Down
2 changes: 1 addition & 1 deletion conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
spades_fix_cpus = 10
spadeshybrid_fix_cpus = 10
megahit_fix_cpu_1 = true
// available options to enable reproducibility for BUSCO (--busco_download_path or --busco_reference) not used here
// available options to enable reproducibility for BUSCO (--busco_db) not used here
// to allow detection of possible problems in automated lineage selection mode using public databases

// test CAT with official taxonomic ranks only
Expand Down
2 changes: 1 addition & 1 deletion conf/test_host_rm.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ params {
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/mag/samplesheets/samplesheet.host_rm.csv'
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
skip_concoct = true
}
4 changes: 2 additions & 2 deletions conf/test_hybrid.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ params {
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/mag/samplesheets/samplesheet.hybrid.csv'
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
skip_gtdbtk = true
skip_concoct = true
}
8 changes: 4 additions & 4 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,7 @@ For each bin or refined bin the median sequencing depth is computed based on the

#### BUSCO

[BUSCO](https://busco.ezlab.org/) is a tool used to assess the completeness of a genome assembly. It is run on all the genome bins and high quality contigs obtained by the applied binning and/or binning refinement methods (depending on the `--postbinning_input` parameter). By default, BUSCO is run in automated lineage selection mode in which it first tries to select the domain and then a more specific lineage based on phylogenetic placement. If available, result files for both the selected domain lineage and the selected more specific lineage are placed in the output directory. If a lineage dataset is specified already with `--busco_reference`, only results for this specific lineage will be generated.
[BUSCO](https://busco.ezlab.org/) is a tool used to assess the completeness of a genome assembly. It is run on all the genome bins and high quality contigs obtained by the applied binning and/or binning refinement methods (depending on the `--postbinning_input` parameter). By default, BUSCO is run in automated lineage selection mode in which it first tries to select the domain and then a more specific lineage based on phylogenetic placement. If available, result files for both the selected domain lineage and the selected more specific lineage are placed in the output directory. If a lineage dataset is specified already with `--busco_db`, only results for this specific lineage will be generated.

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -493,21 +493,21 @@ For each bin or refined bin the median sequencing depth is computed based on the
- `[assembler]-[bin]_busco.log`: Log file containing the standard output of BUSCO.
- `[assembler]-[bin]_busco.err`: File containing potential error messages returned from BUSCO.
- `short_summary.domain.[lineage].[assembler]-[bin].txt`: BUSCO summary of the results for the selected domain when run in automated lineage selection mode. Not available for bins for which a viral lineage was selected.
- `short_summary.specific_lineage.[lineage].[assembler]-[bin].txt`: BUSCO summary of the results in case a more specific lineage than the domain could be selected or for the lineage provided via `--busco_reference`.
- `short_summary.specific_lineage.[lineage].[assembler]-[bin].txt`: BUSCO summary of the results in case a more specific lineage than the domain could be selected or for the lineage provided via `--busco_db`.
- `[assembler]-[bin]_buscos.[lineage].fna.gz`: Nucleotide sequence of all identified BUSCOs for used lineages (domain or specific).
- `[assembler]-[bin]_buscos.[lineage].faa.gz`: Aminoacid sequence of all identified BUSCOs for used lineages (domain or specific).
- `[assembler]-[bin]_prodigal.gff`: Genes predicted with Prodigal.

</details>

If the parameter `--save_busco_reference` is set, additionally the used BUSCO lineage datasets are stored in the output directory.
If the parameter `--save_busco_db` is set, additionally the used BUSCO lineage datasets are stored in the output directory.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/BUSCO/`
- `busco_downloads/`: All files and lineage datasets downloaded by BUSCO when run in automated lineage selection mode. (Can currently not be used to reproduce analysis, see the [nf-core/mag website documentation](https://nf-co.re/mag/usage#reproducibility) how to achieve reproducible BUSCO results).
- `reference/*.tar.gz`: BUSCO reference lineage dataset that was provided via `--busco_reference`.
- `reference/*.tar.gz`: BUSCO reference lineage dataset that was provided via `--busco_db`.

</details>

Expand Down
2 changes: 1 addition & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ You can fix this by using the prameter `--megahit_fix_cpu_1`. In both cases, do

MetaBAT2 is run by default with a fixed seed within this pipeline, thus producing reproducible results.

To allow also reproducible bin QC with BUSCO, run BUSCO providing already downloaded lineage datasets with `--busco_download_path` (BUSCO will be run using automated lineage selection in offline mode) or provide a specific lineage dataset via `--busco_reference` and use the parameter `--save_busco_reference`. This may be useful since BUSCO datasets are frequently updated and old versions do not always remain (easily) accessible.
To allow also reproducible bin QC with BUSCO, run BUSCO providing already downloaded lineage datasets (BUSCO will be run using automated lineage selection in offline mode) or provide a specific lineage dataset via `--busco_db` and use the parameter `--save_busco_db`. This may be useful since BUSCO datasets are frequently updated and old versions do not always remain (easily) accessible.

For the taxonomic bin classification with [CAT](https://github.com/dutilh/CAT), when running the pipeline with `--cat_db_generate` the parameter `--save_cat_db` can be used to also save the generated database to allow reproducibility in future runs. Note that when specifying a pre-built database with `--cat_db`, currently the database can not be saved.

Expand Down
13 changes: 2 additions & 11 deletions lib/WorkflowMag.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -102,22 +102,13 @@ class WorkflowMag {
Nextflow.error('Both --skip_binqc and --binqc_tool \'checkm\' are specified! Invalid combination, please specify either --skip_binqc or --binqc_tool.')
}
if (params.skip_binqc) {
if (params.busco_reference) {
Nextflow.error('Both --skip_binqc and --busco_reference are specified! Invalid combination, please specify either --skip_binqc or --binqc_tool \'busco\' with --busco_reference.')
}
if (params.busco_download_path) {
Nextflow.error('Both --skip_binqc and --busco_download_path are specified! Invalid combination, please specify either --skip_binqc or --binqc_tool \'busco\' with --busco_download_path.')
if (params.busco_db) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outwith the scope of this PR for sure, but I think cases like this where you skip a step but specify a database should probably just print a warning rather than quitting. Makes debugging a little easier if you just need to quickly turn something off.

Nextflow.error('Both --skip_binqc and --busco_db are specified! Invalid combination, please specify either --skip_binqc or --binqc_tool \'busco\' with --busco_db.')
}
if (params.busco_auto_lineage_prok) {
Nextflow.error('Both --skip_binqc and --busco_auto_lineage_prok are specified! Invalid combination, please specify either --skip_binqc or --binqc_tool \'busco\' with --busco_auto_lineage_prok.')
}
}
if (params.busco_reference && params.busco_download_path) {
Nextflow.error('Both --busco_reference and --busco_download_path are specified! Invalid combination, please specify either --busco_reference or --busco_download_path.')
}
if (params.busco_auto_lineage_prok && params.busco_reference) {
Nextflow.error('Both --busco_auto_lineage_prok and --busco_reference are specified! Invalid combination, please specify either --busco_auto_lineage_prok or --busco_reference.')
}

if (params.skip_binqc && !params.skip_gtdbtk) {
log.warn '--skip_binqc is specified, but --skip_gtdbtk is explictly set to run! GTDB-tk will be omitted because GTDB-tk bin classification requires bin filtering based on BUSCO or CheckM QC results to avoid GTDB-tk errors.'
Expand Down
16 changes: 7 additions & 9 deletions modules/local/busco.nf
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ process BUSCO {

input:
tuple val(meta), path(bin)
path(db)
path(download_folder)
tuple val(db_meta), path(db)

output:
tuple val(meta), path("short_summary.domain.*.${bin}.txt") , optional:true , emit: summary_domain
Expand All @@ -25,17 +24,16 @@ process BUSCO {

script:
def cp_augustus_config = workflow.profile.toString().indexOf("conda") != -1 ? "N" : "Y"
def lineage_dataset_provided = params.busco_reference ? "Y" : "N"
def lineage_dataset_provided = "${db_meta.lineage}"
jfy133 marked this conversation as resolved.
Show resolved Hide resolved
def busco_clean = params.busco_clean ? "Y" : "N"

def p = "--auto-lineage"
if (params.busco_reference){
def p = params.busco_auto_lineage_prok ? "--auto-lineage-prok" : "--auto-lineage"
jfy133 marked this conversation as resolved.
Show resolved Hide resolved
if ( "${lineage_dataset_provided}" == "Y" ) {
p = "--lineage_dataset dataset/${db}"
} else if ( "${lineage_dataset_provided}" == "N" ) {
p += " --offline --download_path ${db}"
} else {
if (params.busco_auto_lineage_prok)
p = "--auto-lineage-prok"
if (params.busco_download_path)
p += " --offline --download_path ${download_folder}"
lineage_dataset_provided = ""
}
"""
run_busco.sh "${p}" "${cp_augustus_config}" "${db}" "${bin}" ${task.cpus} "${lineage_dataset_provided}" "${busco_clean}"
Expand Down
5 changes: 2 additions & 3 deletions modules/local/busco_db_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,8 @@ process BUSCO_DB_PREPARATION {
path database

output:
path "buscodb/*" , emit: db
path database , emit: database
path "versions.yml" , emit: versions
tuple val("${database.toString().replace(".tar.gz", "")}"), path("buscodb/*"), emit: db
gregorysprenger marked this conversation as resolved.
Show resolved Hide resolved
path "versions.yml" , emit: versions

script:
"""
Expand Down
5 changes: 3 additions & 2 deletions modules/local/busco_summary.nf
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,12 @@ process BUSCO_SUMMARY {
path "versions.yml" , emit: versions

script:
def auto = params.busco_reference ? "" : "-a"
def reference = params.busco_db.toString().contains('odb10')
def auto = reference ? "" : "-a"
def ss = summaries_specific.sort().size() > 0 ? "-ss ${summaries_specific}" : ""
def sd = summaries_domain.sort().size() > 0 ? "-sd ${summaries_domain}" : ""
def f = ""
if (!params.busco_reference && failed_bins.sort().size() > 0)
if ("${reference}" == false && failed_bins.sort().size() > 0)
f = "-f ${failed_bins}"
"""
summary_busco.py $auto $ss $sd $f -o busco_summary.tsv
Expand Down
2 changes: 1 addition & 1 deletion modules/local/centrifuge.nf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ process CENTRIFUGE {
output:
tuple val("centrifuge"), val(meta), path("results.krona"), emit: results_for_krona
path "report.txt" , emit: report
tuple val(meta), path("*kreport.txt") , emit: kreport
tuple val(meta), path("*kreport.txt") , emit: kreport
path "versions.yml" , emit: versions

script:
Expand Down
10 changes: 7 additions & 3 deletions modules/local/centrifuge_db_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,16 @@ process CENTRIFUGE_DB_PREPARATION {
path db

output:
tuple val("${db.toString().replace(".tar.gz", "")}"), path("*.cf"), emit: db
path "versions.yml" , emit: versions
path("*.cf") , emit: db
path "versions.yml", emit: versions

script:
"""
tar -xf "${db}"
if [[ -d ${db} ]]; then
ln -srf `find ${db}/ -type f -name "*.cf"` \${PWD}
else
tar -xf "${db}"
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
19 changes: 15 additions & 4 deletions modules/local/kraken2_db_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,21 @@ process KRAKEN2_DB_PREPARATION {

script:
"""
mkdir db_tmp
tar -xf "${db}" -C db_tmp
mkdir database
mv `find db_tmp/ -name "*.k2d"` database/
if [[ -d ${db} ]]; then
gregorysprenger marked this conversation as resolved.
Show resolved Hide resolved
if [[ ${db} != database ]]; then
ln -sr ${db} database
fi

# Make sure {hash,opts,taxo}.k2d are found in directory input
if [[ \$(find database/ -name "*.k2d" | wc -l) -lt 3 ]]; then
error "ERROR: Kraken2 requires '{hash,opts,taxo}.k2d' files."
fi
else
mkdir db_tmp
tar -xf "${db}" -C db_tmp
mkdir database
mv `find db_tmp/ -name "*.k2d"` database/
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
5 changes: 2 additions & 3 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,9 @@ params {
// Bin QC
skip_binqc = false
binqc_tool = 'busco'
busco_reference = null
busco_download_path = null
busco_db = null
busco_auto_lineage_prok = false
save_busco_reference = false
save_busco_db = false
busco_clean = false
checkm_download_url = "https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz"
checkm_db = null
Expand Down
Loading
Loading