Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_depths for co-assembly taking only average depth and not sample wise read depths for maxbin2 binning with solution #663

Open
uel3 opened this issue Sep 5, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@uel3
Copy link

uel3 commented Sep 5, 2024

Description of the bug

I ran into the issue of not generating high quality bins for a known bacteria present in my samples (0.25 by DASTool scoring) when using nf-mag for generating MAGs from a metagenomic co-assembly. This resulted in the known bacteria bins not being included in the final refined DASTool bins using default parameters. I was able to generate a high quality (0.95 by DASTool scoring) bin of the known bacteria present in my samples when I ran MaxBin2, MetaBat2, and DASTool in a separate mNGS pipeline using the same parameters but instead passing reads_list into MaxBin2 instead of the abund_file like in nf-mag. When I looked at the input and output files for the nf-mag processes, I noticed far less depth information being used to generate bins with MaxBin2, only the total depth from METABAT2_JGISUMMARIZE output was being passed as the -abund_file for MaxBin2 instead of sample wise or individual read depth for each contig.

I believe the issue lies here in line 21 of CONVERT_DEPTHS process used in the BINNING subworkflow:

bioawk -t '{ { if (NR > 1) { { print \$1, \$3 } } } }' ${depth.toString() - '.gz'} > ${prefix}_mb2_depth.txt

I figured out how to change the process to provide sample wise depths and generate my missing high quality bin. Instead of passing the abund_file that comes from CONVERT_DEPTHS output, the mNGS reads can be directly passed via the -reads or -reads_list flag as I did in my separate mNGS pipeline. Using this approach nf-mag generates the high quality bins for my known pathogen but requires more time and resources to do so. My fix is to use still use the depth information generated by METABAT2_JGISUMMARIZE but use the sample-wise depth information for all contigs and pass as -abund_list which is the solution I offer below.

Command used and terminal output

$ nextflow run nf-core/mag --coassemble_group --binning_map_mode 'group' --refine_bins_dastool --postbinning_input 'refined_bins_only'

Relevant files

Command.sh from CONVERT_DEPTHS with my data:

#!/bin/bash -euo pipefail
gunzip -f MEGAHIT-group-Col-depth.txt.gz
bioawk -t '{ { if (NR > 1) { { print $1, $3 } } } }' MEGAHIT-group-Col-depth.txt > group-Col_mb2_depth.txt
 
cat <<-END_VERSIONS > versions.yml
"NFCORE_UNO:UNO:BINNING:CONVERT_DEPTHS":
    bioawk: $(bioawk --version | cut -f 3 -d ' ' )
END_VERSIONS

The first 10 lines of the files processed by CONVERT_DEPTHS command.sh to show the data transformation:
MEGAHIT-group-Col-depth.txt
contigName contigLen totalAvgDepth MEGAHIT-group-Col-Loopy.bam MEGAHIT-group-Col-Loopy.bam-var MEGAHIT-group-Col-Reinvent.bam MEGAHIT-group-Col-Reinvent.bam-var MEGAHIT-group-Col-Dizzy2.bam MEGAHIT-group-Col-Dizzy2.bam-var MEGAHIT-group-Col-Florid.bam MEGAHIT-group-Col-Florid.bam-var MEGAHIT-group-Col-Usual.bam MEGAHIT-group-Col-Usual.bam-var
k127_1462844 244 0 0 0 0 0 0 0 0 0 0 0
k127_3291397 255 0 0 0 0 0 0 0 0 0 0 0
k127_1097133 238 0 0 0 0 0 0 0 0 0 0 0
k127_2925687 323 6 0 0 0 0 2 0 1 0 3 0
k127_1828555 269 0 0 0 0 0 0 0 0 0 0 0
k127_2559976 451 7.08638 4.09302 1.43798 2.15947 0.621155 0 0 0.833887 0.138981 0 0
k127_1462849 222 0 0 0 0 0 0 0 0 0 0 0
k127_2925689 207 0 0 0 0 0 0 0 0 0 0 0
k127_4022816 444 19.7007 2.28231 0.551426 1.47279 0.25011 2.26531 0.632444 5.80952 3.80658 7.87075 5.80579

group-Col_mb2_depth.txt
k127_1462844 0
k127_3291397 0
k127_1097133 0
k127_2925687 6
k127_1828555 0
k127_2559976 7.08638
k127_1462849 0
k127_2925689 0
k127_4022816 19.7007

My Fix: CONVERT_DEPTHS_ALL with my data:

#!/bin/bash -euo pipefail
gunzip -f MEGAHIT-group-Col-depth.txt.gz

# Determine the number of abundance columns
n_abund=$(awk 'NR==1 {print int((NF-3)/2)}' MEGAHIT-group-Col-depth.txt)

# Generate abundance files for each read set
for i in $(seq 1 $n_abund); do
    col=$((i*2+2))
    bioawk -t '{if (NR > 1) {print $1, $'"$col"'}}' MEGAHIT-group-Col-depth.txt > group-Col_mb2_depth_$i.txt
done

# Create a list of abundance files with full paths, each on a new line
for file in group-Col_mb2_depth_*.txt; do
    echo "$PWD/$file" >> abund_list.txt
done
cat <<-END_VERSIONS > versions.yml
"NFCORE_UNO:UNO:BINNING:CONVERT_DEPTHS_ALL":
    bioawk: $(bioawk --version | cut -f 3 -d ' ' )
END_VERSIONS

Attached files:
The log files for MAXBIN2 using the current CONVERT_DEPTHS output (maxbin2_CONVERT_DEPTHS.log) and my updated CONVERT_DEPTHS_ALL output (maxbin2_CONVERT_DEPTHS_ALL.log). The output from CONVERT_DEPTHS_ALL (abund_list.txt) The updated CONVERT_DEPTHS_ALL.nf script (convert_depths_all_reads.txt).

abund_list.txt
convert_depths_all_reads.txt
maxbin2_CONVERT_DEPTHS.log
maxbin2_CONVERT_DEPTHS_ALL.log

System information

nextflow/23.10.0
run on HPC executed locally

@uel3 uel3 added the bug Something isn't working label Sep 5, 2024
@jfy133
Copy link
Member

jfy133 commented Sep 6, 2024

@uel3 thanks for this! Could you also provide the context you executed the pipeline, you said you were doing co-assembley on slack if I remember correctly?

@jfy133
Copy link
Member

jfy133 commented Sep 6, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants