-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision
#489
Conversation
NFCORE_MAG:MAG:CAT_SUMMARY
input file name collision`NFCORE_MAG:MAG:CAT_SUMMARY input file name collision
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf
)?
Otherwise LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One minor thought - I added params.cat_official_taxonomy
before I really got to grips with how $args
works. Now that CAT has a $args
option, might it make sense to move that into there inside the config?
A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in
modules.conf
)?
I am not sure that there are clashes in my experience, but I don't think more detailed labelling of files hurts (except in user readability down the line). This might be something to think about more deeply when planning v3.0!
@prototaxites I think it would make sense to move Also, do you think it would be a good idea to standardize how prefixes are defined (currently some prefixes are defined in within the module's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Had some minor comments/ideas though. Thanks for the implementation and Bugfix!
@maxibor (or anyone else) any thoughts on my original feedback:
|
Update: this is actually a more generalized issue with processes that perform a collect: CAT, but also Quast, and GTDB-Tk |
I guess this was correct then??? I wonder why this issue hasn't come up before...?! |
It's a new issue because samples can now appear multiple times in downstream channels for the following reasons:
When they're collected at the final steps (Quast, GTDB-Tk, or CAT), bins that were not renamed by the way they were treated now appear multiple times with same name, and triggers the input file name collision. Which in addition was particularly challenging to debug because of a misreporting of the actual error nextflow-io/nextflow#3828 |
I’ve been thinking about this and I’m not sure we need the refined/unrefined label to fix this - this might specifically just be a bug in the code that assigns bins post-refinement. If we think about the original sources of bins, we have bins coming from the three binners - metabat, maxbin, and concoct. Each of these bins should have a unique numbered file name. If refinement is enabled, these bins are fed into DAS tool and a new set of bins are returned - that are not necessarily the same as the input bins. These should also be uniquely named, differently to the original bins because their binner is listed as DAS tool. If domain classification is enabled with bin refinement, then undefined eukaryotic bins are passed as “refined” at this stage as otherwise they will be lost, unless ‘both’ is set for post-binning input. I think this is the only time that a duplicate bin can slip through. So don’t we just have to fix the code so that eukaryotic bins are only fed to the refined bin channel if ‘both’ is not selected? Otherwise, every bin should be unique and only be able to enter each process downstream exactly once, because they might get split into separate channels, and recombined, but never duplicated? |
A small schematic to illustrate the issue. I've now update the meta's accordingly, and passed them through when needed. |
There are now 4 possible values for the
Note that the |
That's a really helpful diagram @maxibor thank you very much! |
I am currently downloading the latest GTDB database, then I'll do a manual 'full test' and if all looks good then we can merge! |
Self notes:
This can be merged once I've done one test without domain classsifcation, and I'll follow up with a few other things I noticed with a separate PR Footnotes |
I propose merge this, patch release, then I'll make issues and a further patch release of the other more minor things I found (once i've had a chance to look into them) |
I'm happy with this now after some testing! Just need pull upstream from dev and CHANGELOG update :) |
SPeaking privately apparently Maxime found more issues (but is currently on holiday) |
Ok, there are multiple issues but not related directly to problem this PR is solving. We will break this down bit-by-bit so merging this now, and fixes will come in over time 👍 Thanks again @maxibor ! EDIT: once the latest dev branch has been pulled into this PR and changelog updated! |
@nf-core-bot fix linting |
Just need to remove some dumps, but hopefully the |
Long time in the making update: new_channel = channel_name.collectFile(keepHeader: true) {
meta, f ->
[f.getBaseName()+'.tsv', f]
}
WHATEVER_SUMMARY_PROCESS(new_channel.collect()) This pattern takes all the files from the cat a/test.tsv b/test.tsv > test.tsv |
Hm interesting... but do we want to collapse those bins back into a single bin again? I'm pretty sure of the tools rely on the file name to disambiguate between bins so I'm worried we would then not be able to distinguish between eukaryotic vs non-eukaryotic bins in the summary tools. Furthermore, does it support gzipped files? The original error you reported are gzipped output from CAT and I don't see reference to that in the NXF docs on That said, I may be being overly paranoid there. So I suggest that I merge this current PR as a conservative approach (making sure everything is very explicitly independent). But then I will make an issue describing what you're suggesting as an enhancement - as other than my (maybe unfounded) worry above - that would indeed be an elegant solution 👍 |
This PR brings a fix to what was first described in #433
The issue happened when using
--postbinning_input both
in combination with domain annotation, bin_refinement and CAT profiling.This is because bins annotated as eukaryotic are sent to the refined bins channels as is, bypassing dastool, and hence keeping the same name.
This fix adds a "refined" entry to the meta, and includes it in the new prefix for the CAT process.
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).