Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Improve speed of --output-strains and --output-metadata #1469

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Aug 3, 2024

  1. Configuration menu
    Copy the full SHA
    69dd347 View commit details
    Browse the repository at this point in the history
  2. Split strain and metadata outputs

    Write the strain list directly instead of going through the metadata.
    This is much faster on large datasets.
    
    The side effect is that --output-strains is sorted alphabetically
    instead of retaining the order from the original metadata. That order
    was noted to be retained in 24.2.0 changelog but it's not explicitly
    said anywhere else.
    victorlin committed Aug 3, 2024
    Configuration menu
    Copy the full SHA
    1a3bd3e View commit details
    Browse the repository at this point in the history
  3. Use tsv-utils for --output-metadata

    tsv-join is much faster than the other implementation here (18x faster -
    12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16
    million rows).
    victorlin committed Aug 3, 2024
    Configuration menu
    Copy the full SHA
    e9d4e60 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    91dafbf View commit details
    Browse the repository at this point in the history