filter: Improve speed of `--output-strains` and `--output-metadata` #1469

victorlin · 2024-05-18T21:32:16Z

Description of proposed changes

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

Related issue(s)

Prompted by Slack discussion

Checklist

Address FIXMEs
Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-05-18T21:41:54Z

Codecov Report

Attention: Patch coverage is 59.52381% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 68.70%. Comparing base (4923408) to head (a48025b).
Report is 1 commits behind head on master.

Files	Patch %	Lines
augur/filter/io.py	54.05%	14 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1469      +/-   ##
==========================================
- Coverage   68.85%   68.70%   -0.16%     
==========================================
  Files          69       69              
  Lines        7607     7624      +17     
  Branches     1861     1867       +6     
==========================================
  Hits         5238     5238              
- Misses       2086     2100      +14     
- Partials      283      286       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2024-05-22T19:00:59Z

augur/filter/io.py

-        output_metadata_handle.close()
-    if output_strains:
-        output_strains.close()
+    tsv_join = which("tsv-join")


Using tsv-utils/tsv-join in Augur

@tsibley and I chatted about this yesterday. Two options:

Detect tsv-join in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation of fasttree/raxml/iqtree/vcftools/etc. except those are explicitly requested by the user while tsv-join could be detected and used automatically as a faster alternative to the Python approach.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.

At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

[bundling] seems like the best way to provide this better experience to the most users

I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join in the managed runtimes.

[bundling] follows the pattern of bundling other third-party tools like you mention above.

Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.

What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Great point - I think this will be the easiest way to push the feature through:

don't bundle

use tsv-join if it's available

use Python fallback with a warning to consider downloading tsv-join in the environment if experiencing slowness

We can still consider bundling in a future version.

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.

All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base uses the conda-forge one seamlessly.

tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.

victorlin

This is more fragile than I initially expected.

tsv-join will only be used when all of these conditions are met:
- tsv-join is available
- xzcat/gzcat/zstdcat is available if the input type is compressed
- the output type is uncompressed (due to limitations)
Even with uncompressed output, there are some slight differences in behavior when it comes to handling quoted columns: afb010c

Threads for each point below.

augur/filter/io.py

victorlin · 2024-07-17T22:33:04Z

tests/functional/filter/cram/filter-output-metadata-header.t

@@ -7,6 +7,8 @@ the default quotechar, any column names with that character may be altered.

 Quoted columns containing the tab delimiter are left unchanged.

+# FIXME: tsv-join has different behavior here. Test both?


These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. --output-metadata-attempt-tsv-utils.

augur/filter/io.py

Write the strain list directly instead of going through the metadata. This is much faster on large datasets. The side effect is that --output-strains is sorted alphabetically instead of retaining the order from the original metadata. That order was noted to be retained in 24.2.0 changelog but it's not explicitly said anywhere else.

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

victorlin self-assigned this May 18, 2024

victorlin commented May 22, 2024

View reviewed changes

victorlin force-pushed the victorlin/update-filter-outputs branch from a48025b to afb010c Compare July 17, 2024 22:14

victorlin commented Jul 17, 2024

View reviewed changes

tsibley reviewed Jul 24, 2024

View reviewed changes

augur/filter/io.py Outdated Show resolved Hide resolved

augur/filter/io.py Outdated Show resolved Hide resolved

victorlin added 4 commits August 2, 2024 18:43

Add tests for compressed metadata outputs

69dd347

Use tsv-utils for --output-metadata

e9d4e60

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

🚧 Note slight difference in behavior

91dafbf

victorlin force-pushed the victorlin/update-filter-outputs branch from 022fcd3 to 91dafbf Compare August 3, 2024 01:44

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter without replacing Pandas #1573

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Improve speed of `--output-strains` and `--output-metadata` #1469

filter: Improve speed of `--output-strains` and `--output-metadata` #1469

victorlin commented May 18, 2024 •

edited

Loading

codecov bot commented May 18, 2024

victorlin May 22, 2024

jameshadfield May 22, 2024

huddlej May 28, 2024

victorlin May 29, 2024 •

edited

Loading

jameshadfield Jul 18, 2024

corneliusroemer Jul 19, 2024

corneliusroemer Jul 19, 2024

victorlin left a comment

victorlin Jul 17, 2024

		@@ -7,6 +7,8 @@ the default quotechar, any column names with that character may be altered.

		Quoted columns containing the tab delimiter are left unchanged.

		# FIXME: tsv-join has different behavior here. Test both?

filter: Improve speed of --output-strains and --output-metadata #1469

Are you sure you want to change the base?

filter: Improve speed of --output-strains and --output-metadata #1469

Conversation

victorlin commented May 18, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented May 18, 2024

Codecov Report

victorlin May 22, 2024

Choose a reason for hiding this comment

Using tsv-utils/tsv-join in Augur

jameshadfield May 22, 2024

Choose a reason for hiding this comment

huddlej May 28, 2024

Choose a reason for hiding this comment

victorlin May 29, 2024 • edited Loading

Choose a reason for hiding this comment

jameshadfield Jul 18, 2024

Choose a reason for hiding this comment

corneliusroemer Jul 19, 2024

Choose a reason for hiding this comment

corneliusroemer Jul 19, 2024

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

victorlin Jul 17, 2024

Choose a reason for hiding this comment

filter: Improve speed of `--output-strains` and `--output-metadata` #1469

filter: Improve speed of `--output-strains` and `--output-metadata` #1469

victorlin commented May 18, 2024 •

edited

Loading

victorlin May 29, 2024 •

edited

Loading