`augur merge` is slow to read in metadata #1628

tsibley · 2024-09-10T18:37:29Z

based on my comment on the initial augur merge PR

augur merge is stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of augur read-file and augur write-file makes it quite quick, as it should be. However, augur {read,write}-file are important for proper and robust handling of newlines and compression formats and can't be jettisoned without significant additional work. More to the point, we don't have to do that work (and take on the additional complexity) if we make other improvements.

Improvements we can/should make:

Improve startup time #472 (prior attempt, but not acceptable)
merge: Import metadata files concurrently? #1617

The text was updated successfully, but these errors were encountered:

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the rule now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields. We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566> Ported-from: <nextstrain/measles@4d73b7f> Related-to: <nextstrain/measles#52> Related-to: <#65>

tsibley added the enhancement New feature or request label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`augur merge` is slow to read in metadata #1628

`augur merge` is slow to read in metadata #1628

tsibley commented Sep 10, 2024

augur merge is slow to read in metadata #1628

augur merge is slow to read in metadata #1628

Comments

tsibley commented Sep 10, 2024

`augur merge` is slow to read in metadata #1628

`augur merge` is slow to read in metadata #1628