Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

huddlej · 2024-06-24T19:21:30Z

Description of proposed changes

Replaces the current table of derived haplotype frequencies and titer references that is based on a subsampled HA tree with a table based on all available sequences during the same time period.

With the latest version of Nextclade, we can determine derived haplotype strings per record from a Nextclade annotations file with columns for clade and mutations relative to each clade. We can then calculate haplotype frequencies from all available data instead of a subset of data used to build a tree.

Development checklist

Identify derived HA1 haplotypes from all data with Nextclade for haplotypes
Estimate frequencies per haplotype from all data including at 4 weeks ago and 8 weeks ago, reporting delta frequency for that time period
Include distinct references per haplotype across all titer collections (e.g., cell FRA, cell HI, egg FRA, egg HI, etc.)
Generate table of haplotype frequencies and references with links to haplotype view in the corresponding tree

Related issue(s)

Related to #130
Depends on nextstrain/nextclade#1492

Checklist

Checks pass

Adds a prototype script that produces derived haplotype strings per record from a given Nextclade annotations file with columns for clade and mutations relative to each clade. The derived haplotypes produced with this script could eventually replace the haplotypes we build from the mutation-annotated trees and allow us to calculate haplotype frequencies from all available data instead of a subset of data used to build a tree. Related to #130 Depends on nextstrain/nextclade#1492

Replaces a within-script filtering of Nextclade records by QC with a separate workflow rule that produces a new file with only non-bad records. This new file will serve as input to other rules that build on high-quality Nextclade annotations.

Adds rules to get derived haplotypes from Nextclade annotations for all data and then join those haplotypes with the metadata. The resulting metadata file has strain name, collection date, and haplotype columns that we need for the next steps of the workflow to estimate haplotype frequencies and annotate haplotypes by available titer references.

Adds a script and rule to estimate "tip" frequencies JSON from metadata alone. This simple functionality isn't provided directly through `augur frequencies`, so this commit adds a script that replicates some of the internal logic of that Augur script to get a tip frequencies JSON with the KDE-based method. Since KDE frequency estimates only require a list of dates, we can estimate frequencies for each sequence in the metadata and use those estimates in a subsequent rule to estimate frequencies of derived haplotypes. In this commit, I chose to limit the frequency estimation period to a max date of 4 weeks prior to the current run date and a min date 16 weeks prior. These frequencies will only be used initially to compare the most recent value to the timepoint just previous to calculate a delta frequency.

Adds rule and script to summarize derived haplotype frequencies from all available data.

Updates the script that annotates derived haplotypes for nodes in the tree to use the same style as the haplotypes table with hyphen-delimited mutations (which work as values in URL parameters unlike comma-delimited lists) and with the ancestral allele included for each mutation. These changes should allow us to link from the haplotype tables to the tree view for the same haplotypes.

huddlej self-assigned this Jun 24, 2024

huddlej mentioned this pull request Jun 24, 2024

Automate monthly reports #130

Open

12 tasks

huddlej changed the title ~~Stub script for derived haplotypes from Nextclade~~ Summarize haplotype coverage by titer references using frequencies per haplotype from all available data Jul 5, 2024

huddlej added 11 commits July 5, 2024 14:16

Ignore "tables" output directory

0d5d00e

Remove unsupported pylint config option

99b7c9c

Fix frequencies output format to match Auspice

cf71f12

Start summarizing haplotypes from all data

c706972

Adds rule and script to summarize derived haplotype frequencies from all available data.

Start annotating titers to haplotype summaries

c8a45e3

Refine derived haplotypes table outputs

30143cc

Add override for build date

84e6291

Join mutations by hyphens to avoid breaking markdown tables

fa3e57f

huddlej marked this pull request as ready for review July 8, 2024 02:07

huddlej added 2 commits July 8, 2024 16:25

Simplify haplotype summary table

d783bcd

huddlej merged commit 8849483 into master Jul 25, 2024
3 checks passed

huddlej deleted the add-derived-haplotypes-for-all-sequences branch July 25, 2024 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

huddlej commented Jun 24, 2024 •

edited

Loading

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

Conversation

huddlej commented Jun 24, 2024 • edited Loading

Description of proposed changes

Development checklist

Related issue(s)

Checklist

huddlej commented Jun 24, 2024 •

edited

Loading