Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group segments by strains #18

Merged
merged 7 commits into from
Oct 11, 2024
Merged

Group segments by strains #18

merged 7 commits into from
Oct 11, 2024

Commits on Oct 9, 2024

  1. [ingest] Add segment information to metadata

    Produces a single large metadata file keyed off accession with both
    strain names and segment-level information encoded therein. The
    following commit will group this "long" metadata TSV into a "wide"
    TSV where each row represents a strain.
    jameshadfield committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    d2b13d1 View commit details
    Browse the repository at this point in the history
  2. [ingest] parse nextclade TSV using csvtk

    Allows us to avoid another python script.
    
    Original implementation by @joverlee521 in
    <#18 (comment)>
    jameshadfield committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    9bfce1b View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. [ingest] Group segments by strain

    Transforms a metadata file with one row per accession into a file with
    one row per strain. Where a strain contains sequences for multiple segments this
    will group those segments together.
    
    Segment-specific field names (i.e. those not in --common-strain-fields) are modified
    to ensure their suffix is "_{segment}". For instance "accession → accession_HA".
    
    Rows are matched on strain name and basic sanity checking is performed when grouping.
    Segments with multiple matches for a given strain are dropped, and strains where all
    segments have either zero or multiple matches are dropped entirely. Manual resolutions
    may be provided via a `--resolutions` YAML which is a list of dictionaries, each with
    keys "strain", "accession" and "segment" informing the program which accession (of
    multiple) to use. The resolutions here are taken both from exploring the data and
    the existing phylo exclude list.
    
    Any disagreement within the "--common-strain-fields" will result in the strain being
    dropped, however empty values may be replaced and ambiguous dates may be replaced with
    specific ones (where appropriate).
    jameshadfield committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    25d2a75 View commit details
    Browse the repository at this point in the history
  2. [ingest] fix metadata conflicts across segments

    Mismatched field values across segments (e.g. segments disagree on the
    'date') are now resolved by choosing the most common occurrence with
    the intention they are resolved upstream, as implemented here.
    
    This approach was the third implementation. Initially I resolved
    disagreements within `group_segments.py` via a provided resolutions
    YAML. After discussion with @joverlee521 we decided this could be better
    implemented via `augur curate` and the original implementation here did
    this _after_ the segment grouping, however this made it impossible to
    distinguish disagreements which will be fixed vs those which won't¹
    
    NOTE: Here we use accession as the ID, however using strain name would
    be better going forward as it would reduce the duplication needed in the
    current format. We can't (currently) do this in oropouche because strain
    names are added _after_ the curate chain runs.
    
    ¹ <#18 (comment)>
    jameshadfield committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    c8ab87c View commit details
    Browse the repository at this point in the history
  3. Update workflows to use new ingest outputs

    Updates the files which ingest uploads and makes the corresponding
    changes to the phylogenetic workflow. As metadata (and sequences) now
    use "strain" as the unique ID a number of simplifications can be made
    to the workflow.
    
    There is one regression: the "accession" column no longer exists and
    is thus not exported. We'll fix this in a subsequent commit.
    jameshadfield committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    968e6f2 View commit details
    Browse the repository at this point in the history
  4. [phylo] export segment specific metadata

    This adds back in segment-specific metadata to the Auspice JSON. There
    are multiple ways this can be done, each with trade-offs. The approach
    employed here leaves the "_{segment}" suffix on the field names.
    Alternatively we could remap the metadata file for each `export` call so
    that (e.g.) "accession_S" becomes "accession".
    jameshadfield committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    922551a View commit details
    Browse the repository at this point in the history
  5. Update example data (files & rule)

    To use the new metadata format where we group by strain. Steps to regenerate:
    
    1. Populate 'data/' with metadata & sequences from an ingest run.
    2. Subsample this as example data via:
    
    ```
    augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv
    augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta
    augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta
    augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta
    ```
    jameshadfield committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    091dc7f View commit details
    Browse the repository at this point in the history