-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group segments by strains #18
Commits on Oct 9, 2024
-
[ingest] Add segment information to metadata
Produces a single large metadata file keyed off accession with both strain names and segment-level information encoded therein. The following commit will group this "long" metadata TSV into a "wide" TSV where each row represents a strain.
Configuration menu - View commit details
-
Copy full SHA for d2b13d1 - Browse repository at this point
Copy the full SHA d2b13d1View commit details -
[ingest] parse nextclade TSV using csvtk
Allows us to avoid another python script. Original implementation by @joverlee521 in <#18 (comment)>
Configuration menu - View commit details
-
Copy full SHA for 9bfce1b - Browse repository at this point
Copy the full SHA 9bfce1bView commit details
Commits on Oct 10, 2024
-
[ingest] Group segments by strain
Transforms a metadata file with one row per accession into a file with one row per strain. Where a strain contains sequences for multiple segments this will group those segments together. Segment-specific field names (i.e. those not in --common-strain-fields) are modified to ensure their suffix is "_{segment}". For instance "accession → accession_HA". Rows are matched on strain name and basic sanity checking is performed when grouping. Segments with multiple matches for a given strain are dropped, and strains where all segments have either zero or multiple matches are dropped entirely. Manual resolutions may be provided via a `--resolutions` YAML which is a list of dictionaries, each with keys "strain", "accession" and "segment" informing the program which accession (of multiple) to use. The resolutions here are taken both from exploring the data and the existing phylo exclude list. Any disagreement within the "--common-strain-fields" will result in the strain being dropped, however empty values may be replaced and ambiguous dates may be replaced with specific ones (where appropriate).
Configuration menu - View commit details
-
Copy full SHA for 25d2a75 - Browse repository at this point
Copy the full SHA 25d2a75View commit details -
[ingest] fix metadata conflicts across segments
Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>
Configuration menu - View commit details
-
Copy full SHA for c8ab87c - Browse repository at this point
Copy the full SHA c8ab87cView commit details -
Update workflows to use new ingest outputs
Updates the files which ingest uploads and makes the corresponding changes to the phylogenetic workflow. As metadata (and sequences) now use "strain" as the unique ID a number of simplifications can be made to the workflow. There is one regression: the "accession" column no longer exists and is thus not exported. We'll fix this in a subsequent commit.
Configuration menu - View commit details
-
Copy full SHA for 968e6f2 - Browse repository at this point
Copy the full SHA 968e6f2View commit details -
[phylo] export segment specific metadata
This adds back in segment-specific metadata to the Auspice JSON. There are multiple ways this can be done, each with trade-offs. The approach employed here leaves the "_{segment}" suffix on the field names. Alternatively we could remap the metadata file for each `export` call so that (e.g.) "accession_S" becomes "accession".
Configuration menu - View commit details
-
Copy full SHA for 922551a - Browse repository at this point
Copy the full SHA 922551aView commit details -
Update example data (files & rule)
To use the new metadata format where we group by strain. Steps to regenerate: 1. Populate 'data/' with metadata & sequences from an ingest run. 2. Subsample this as example data via: ``` augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta ```
Configuration menu - View commit details
-
Copy full SHA for 091dc7f - Browse repository at this point
Copy the full SHA 091dc7fView commit details