Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compression to MSA modules #4754

Merged
merged 68 commits into from
Feb 15, 2024
Merged

Conversation

lrauschning
Copy link
Contributor

@lrauschning lrauschning commented Jan 17, 2024

Reopened, this time without duplicating commits.
This PR introduces output compression to the FAMSA, MTMalign, MAFFT, CLUSTALO and MUSCLE5 (EDIT: and TCOFFEE) modules.
This is required at 7. in https://nf-co.re/docs/contributing/modules#general
Behaviour of the modules is not otherwise changed.

@lrauschning
Copy link
Contributor Author

Tested locally on conda, let's see if singularity also works.

@lrauschning
Copy link
Contributor Author

For MTMalign, the file is written first to disk, and then compressed on disk, as I wasn't able to cleanly isolate MTMalign's output (it has a hardcoded outfile name for the FASTA output, and pollutes stdout with debug messages).
I think it still makes sense to implement it, so that the modules have a (somewhat) standard interface.

Copy link
Contributor

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the stubs. You could add tests to check them as well.

Comment on lines +25 to +30
def write_output = compress ? "--force -o >(pigz -cp ${task.cpus} > ${prefix}.aln.gz)" : "> ${prefix}.aln"
// using >() is necessary to preserve the return value,
// so nextflow knows to display an error when it failed
// the --force -o is necessary, as clustalo expands the commandline input,
// causing it to treat the pipe as a parameter and fail
// this way, the command expands to /dev/fd/<id>, and --force allows writing output to an already existing file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of this but I guess most pipeline developers will leave it on true and forget about it, so why not?

Comment on lines 45 to 49
stub:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.aln
touch ${prefix}.aln.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should change based on the compress value. Something like this:

    stub:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    def output = compress ? "${prefix}.aln.gz" : "${prefix}.aln"
    """
    touch ${output}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, didn't change that since introducing the compress input channel. Might also affect some of the other modules, I'll have a look.

$args \\
-t ${task.cpus} \\
${fasta} \\
${prefix}.aln
${prefix}.aln${compress ? '.gz':''}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm coming around to this idea a bit more, it seems to be cleaner and harder to mess up.

Copy link
Contributor Author

@lrauschning lrauschning Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it being the most clean and straightforward to understand/document (edit: compared to the other options we came up with) is I think the main advantage.
Especially for tools like FAMSA which natively support compression its also cleaner than the output format changing based on a parameter passed via ext.args.

@lrauschning lrauschning added this pull request to the merge queue Feb 15, 2024
Merged via the queue into nf-core:master with commit faf557b Feb 15, 2024
39 checks passed
@lrauschning lrauschning deleted the msa-compression branch February 15, 2024 10:31
jch-13 pushed a commit to jch-13/modules that referenced this pull request Mar 19, 2024
* add pigz to clustalo

* add compression to muscle5

* enabled compression flag for famsa

* added compression to mafft

* compression for mtmalign

* set to mulled containers

* more informative test name

* change mtmalign test to search after unzipping

* update mtmalign tests to work with gzip, fix typo

* regenerate test snaps

* muscle5: zip multiple output files, if present

* Change MUSCLE5 tests to the same testcase TCOFFEE is using, also fix it

* add tags requested by nf-core-lint

* add full url to singularity/biocontainers

* fix famsa

* regenerated snapshots with nf-test 0.8.3. Reenabled snapshots for muscle5 and mtmalign

* forgot to regenerate mafft, also mtmalign seems to still be nondeterministic

* update metas

* compression support for tcoffee modules

* added pigz to tools in meta

* fix typo

* regenerate snaps, adjust test to gzip

* added mulled containers for tcoffee

* implement compression switching with channel

* add tags wanted by lint

* regenerate snapshots

* whoops, regenerated using container this time

* update meta.yml

* update glob in meta.yml

* support compressed input in irmsd

* assign more precise type in meta.yml

* add tag flagged by lint to tcoffee/irmsd

* set tcoffee/irmsd to use mulled container

* tcoffee/irmsd: do not compress template file, and correctly uncompress for irmsd

* tcoffee/align: reimplement toggling compression

* tcoffee/align: use new pipe name everywhere

* tcoffee/align: reenable default html output, add comment

* fix escaped line at end of comment...

* tcoffee/align: make tcoffee write to stdout, avoid using fifo

* clustalo/align: add optional compression

* muscle5/super5: add optional compression, also expand tests

* update snapshot

* muscle5/super5: re-add empty config file

* mafft: implement optional output compression, handle compressed input

* muscle5/super5: better parallelization for compressed -perm all

* mtmalign/align: implement optional compression

* mtmalign/align: add pigz to versions.yml

* mtmalign/align: fix

* regenerate snapshot

* famsa/align: implement optional compression

* whoops, fix tests

* clustalo/align: fix

* update snapshots

* generate different snapshots for compressed & uncompressed tests, prettify code

* updated snapshots

* mtmalign/align: update input pattern

* tcoffee/alncompare,irmsd: implement jose's suggestion

* tcoffee/irmsd: additional test for compressed input

* tcoffee/irmsd: add tag required by lint

* Revert "mtmalign/align: update input pattern"

This reverts commit 7a0e78d.

* incorporate adams suggestion, fix stub filename extensions

* apparently this requires regenerating the snapshots?

* try removing test match names, as per sateesh's suggestion

* Revert "try removing test match names, as per sateesh's suggestion"

This reverts commit 706d05f.

* tcoffee/align change snapshot names

* make snapshot names unique for nf-test 0.8.4

---------

Co-authored-by: Leon Rauschning <[email protected]>
jennylsmith pushed a commit to RSC-RP/modules that referenced this pull request Mar 20, 2024
* add pigz to clustalo

* add compression to muscle5

* enabled compression flag for famsa

* added compression to mafft

* compression for mtmalign

* set to mulled containers

* more informative test name

* change mtmalign test to search after unzipping

* update mtmalign tests to work with gzip, fix typo

* regenerate test snaps

* muscle5: zip multiple output files, if present

* Change MUSCLE5 tests to the same testcase TCOFFEE is using, also fix it

* add tags requested by nf-core-lint

* add full url to singularity/biocontainers

* fix famsa

* regenerated snapshots with nf-test 0.8.3. Reenabled snapshots for muscle5 and mtmalign

* forgot to regenerate mafft, also mtmalign seems to still be nondeterministic

* update metas

* compression support for tcoffee modules

* added pigz to tools in meta

* fix typo

* regenerate snaps, adjust test to gzip

* added mulled containers for tcoffee

* implement compression switching with channel

* add tags wanted by lint

* regenerate snapshots

* whoops, regenerated using container this time

* update meta.yml

* update glob in meta.yml

* support compressed input in irmsd

* assign more precise type in meta.yml

* add tag flagged by lint to tcoffee/irmsd

* set tcoffee/irmsd to use mulled container

* tcoffee/irmsd: do not compress template file, and correctly uncompress for irmsd

* tcoffee/align: reimplement toggling compression

* tcoffee/align: use new pipe name everywhere

* tcoffee/align: reenable default html output, add comment

* fix escaped line at end of comment...

* tcoffee/align: make tcoffee write to stdout, avoid using fifo

* clustalo/align: add optional compression

* muscle5/super5: add optional compression, also expand tests

* update snapshot

* muscle5/super5: re-add empty config file

* mafft: implement optional output compression, handle compressed input

* muscle5/super5: better parallelization for compressed -perm all

* mtmalign/align: implement optional compression

* mtmalign/align: add pigz to versions.yml

* mtmalign/align: fix

* regenerate snapshot

* famsa/align: implement optional compression

* whoops, fix tests

* clustalo/align: fix

* update snapshots

* generate different snapshots for compressed & uncompressed tests, prettify code

* updated snapshots

* mtmalign/align: update input pattern

* tcoffee/alncompare,irmsd: implement jose's suggestion

* tcoffee/irmsd: additional test for compressed input

* tcoffee/irmsd: add tag required by lint

* Revert "mtmalign/align: update input pattern"

This reverts commit 7a0e78d.

* incorporate adams suggestion, fix stub filename extensions

* apparently this requires regenerating the snapshots?

* try removing test match names, as per sateesh's suggestion

* Revert "try removing test match names, as per sateesh's suggestion"

This reverts commit 706d05f.

* tcoffee/align change snapshot names

* make snapshot names unique for nf-test 0.8.4

---------

Co-authored-by: Leon Rauschning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants