Group segments by strains #18

Produces a single large metadata file keyed off accession with both strain names and segment-level information encoded therein. The following commit will group this "long" metadata TSV into a "wide" TSV where each row represents a strain.

@joverlee521

Allows us to avoid another python script. Original implementation by @joverlee521 in <#18 (comment)>

Transforms a metadata file with one row per accession into a file with one row per strain. Where a strain contains sequences for multiple segments this will group those segments together. Segment-specific field names (i.e. those not in --common-strain-fields) are modified to ensure their suffix is "_{segment}". For instance "accession → accession_HA". Rows are matched on strain name and basic sanity checking is performed when grouping. Segments with multiple matches for a given strain are dropped, and strains where all segments have either zero or multiple matches are dropped entirely. Manual resolutions may be provided via a `--resolutions` YAML which is a list of dictionaries, each with keys "strain", "accession" and "segment" informing the program which accession (of multiple) to use. The resolutions here are taken both from exploring the data and the existing phylo exclude list. Any disagreement within the "--common-strain-fields" will result in the strain being dropped, however empty values may be replaced and ambiguous dates may be replaced with specific ones (where appropriate).

@joverlee521

Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>

Updates the files which ingest uploads and makes the corresponding changes to the phylogenetic workflow. As metadata (and sequences) now use "strain" as the unique ID a number of simplifications can be made to the workflow. There is one regression: the "accession" column no longer exists and is thus not exported. We'll fix this in a subsequent commit.

This adds back in segment-specific metadata to the Auspice JSON. There are multiple ways this can be done, each with trade-offs. The approach employed here leaves the "_{segment}" suffix on the field names. Alternatively we could remap the metadata file for each `export` call so that (e.g.) "accession_S" becomes "accession".

To use the new metadata format where we group by strain. Steps to regenerate: 1. Populate 'data/' with metadata & sequences from an ingest run. 2. Subsample this as example data via: ``` augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group segments by strains #18

Group segments by strains #18

Commits on Oct 9, 2024

Commits on Oct 10, 2024