Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data #173

Merged
merged 14 commits into from
Jul 25, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jun 24, 2024

Description of proposed changes

Replaces the current table of derived haplotype frequencies and titer references that is based on a subsampled HA tree with a table based on all available sequences during the same time period.

With the latest version of Nextclade, we can determine derived haplotype strings per record from a Nextclade annotations file with columns for clade and mutations relative to each clade. We can then calculate haplotype frequencies from all available data instead of a subset of data used to build a tree.

Development checklist

  • Identify derived HA1 haplotypes from all data with Nextclade for haplotypes
  • Estimate frequencies per haplotype from all data including at 4 weeks ago and 8 weeks ago, reporting delta frequency for that time period
  • Include distinct references per haplotype across all titer collections (e.g., cell FRA, cell HI, egg FRA, egg HI, etc.)
  • Generate table of haplotype frequencies and references with links to haplotype view in the corresponding tree

Related issue(s)

Related to #130
Depends on nextstrain/nextclade#1492

Checklist

  • Checks pass

Adds a prototype script that produces derived haplotype strings per
record from a given Nextclade annotations file with columns for clade
and mutations relative to each clade. The derived haplotypes produced
with this script could eventually replace the haplotypes we build from
the mutation-annotated trees and allow us to calculate haplotype
frequencies from all available data instead of a subset of data used to
build a tree.

Related to #130
Depends on nextstrain/nextclade#1492
@huddlej huddlej self-assigned this Jun 24, 2024
@huddlej huddlej mentioned this pull request Jun 24, 2024
12 tasks
@huddlej huddlej changed the title Stub script for derived haplotypes from Nextclade Summarize haplotype coverage by titer references using frequencies per haplotype from all available data Jul 5, 2024
huddlej added 11 commits July 5, 2024 14:16
Replaces a within-script filtering of Nextclade records by QC with a
separate workflow rule that produces a new file with only non-bad
records. This new file will serve as input to other rules that build on
high-quality Nextclade annotations.
Adds rules to get derived haplotypes from Nextclade annotations for all
data and then join those haplotypes with the metadata. The resulting
metadata file has strain name, collection date, and haplotype columns
that we need for the next steps of the workflow to estimate haplotype
frequencies and annotate haplotypes by available titer references.
Adds a script and rule to estimate "tip" frequencies JSON from metadata
alone. This simple functionality isn't provided directly through `augur
frequencies`, so this commit adds a script that replicates some of the
internal logic of that Augur script to get a tip frequencies JSON with
the KDE-based method. Since KDE frequency estimates only require a list
of dates, we can estimate frequencies for each sequence in the metadata
and use those estimates in a subsequent rule to estimate frequencies of
derived haplotypes.

In this commit, I chose to limit the frequency estimation period to a
max date of 4 weeks prior to the current run date and a min date 16
weeks prior. These frequencies will only be used initially to compare
the most recent value to the timepoint just previous to calculate a
delta frequency.
Adds rule and script to summarize derived haplotype frequencies from all
available data.
@huddlej huddlej marked this pull request as ready for review July 8, 2024 02:07
Updates the script that annotates derived haplotypes for nodes in the
tree to use the same style as the haplotypes table with hyphen-delimited
mutations (which work as values in URL parameters unlike comma-delimited
lists) and with the ancestral allele included for each mutation. These
changes should allow us to link from the haplotype tables to the tree
view for the same haplotypes.
@huddlej huddlej merged commit 8849483 into master Jul 25, 2024
3 checks passed
@huddlej huddlej deleted the add-derived-haplotypes-for-all-sequences branch July 25, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant