Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur merge is slow to read in metadata #1628

Open
2 tasks
tsibley opened this issue Sep 10, 2024 · 0 comments
Open
2 tasks

augur merge is slow to read in metadata #1628

tsibley opened this issue Sep 10, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@tsibley
Copy link
Member

tsibley commented Sep 10, 2024

based on my comment on the initial augur merge PR

augur merge is stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of augur read-file and augur write-file makes it quite quick, as it should be. However, augur {read,write}-file are important for proper and robust handling of newlines and compression formats and can't be jettisoned without significant additional work. More to the point, we don't have to do that work (and take on the additional complexity) if we make other improvements.

Improvements we can/should make:

@tsibley tsibley added the enhancement New feature or request label Sep 10, 2024
tsibley added a commit to nextstrain/measles that referenced this issue Sep 10, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the rule now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
tsibley added a commit to nextstrain/measles that referenced this issue Sep 10, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
tsibley added a commit to nextstrain/pathogen-repo-guide that referenced this issue Oct 3, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields.  We really
need to sort out our TSV formats³, but that's for a larger project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>

Ported-from: <nextstrain/measles@4d73b7f>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant