Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict contextual samples to 1 year back in 1m/2m/6m builds #1129

Closed
wants to merge 3 commits into from

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Jul 25, 2024

Description of proposed changes

Currently, we focus on a recent time window in many of our ncov analyses. For example, ncov/gisaid/north-america/6m, does more intensive sampling of the previous 6 months (aiming for ~4000 "recent" samples and ~1000 "early" samples). However, as the past continues to recede (as it does) we're getting more and more early context that isn't so relevant for understanding circulating diversity. Here's the live 6m tree for example:

Screenshot 2024-07-25 at 11 52 12 AM

Once selective sweeps occur, we can largely forget about past evolution when looking at current diversity. This had previously prompted us to make the "21L" builds that root to clade 21L / lineage BA.2.

At this point, we're getting closer and closer to wanting the same thing with a clade 24A / lineage JN.1 rooting. However, this strategy is clearly not sustainable. In 4 more years we don't want to have have 4 different rootings, all of which require updating and it not being clear to users what they should be looking at.

This PR addresses the issue in a simple fashion, basically making ncov work more like seasonal-flu or avian-flu where there is recent focal samples and older contextual samples, but the contextual samples only go back a year rather than many.

Here's the resulting global 6m tree:

Screenshot 2024-07-25 at 12 24 56 PM

Results from running this PR can be seen at:

This is using a 10:1 ratio of recent to early samples and doing a +1 year back for early samples. For for the 6m analysis, it's 0m to 6m back as recent focal samples and 6m to 18m back as early contextual samples.

I had tried a +2 year context as well, but it didn't seem to add much understanding while taking up additional screen real estate and additional color ramp. You can compare here however: global/6m

The the biggest worry I see here is that people currently landing at ncov/gisaid/global/6m can see what's currently circulating and get all the context that they may need going back to the beginning of the pandemic (with well known VOCs, etc...).

If we did merge this, we should make two showcase cards on the splash page to direct to 6m vs all-time to (partially) address this. Also, if we did merge this, I'd imagine deprecating the 21L builds, where we'd remove them from the automated GitHub Actions rebuild, remove them from the manifest and add redirects to go from ncov/gisaid/21L/global/6m to ncov/gisaid/global/6m.

The other approach to this same issue would be take older clades (and conceivably Pango lineages) and make these gray while keeping the color ramp only for more recent clades. This strategy is also not mutually exclusive and we do could do both or neither. I'll try to put together a separate PR for the clade colors idea. Though even if colors update fixes things enough for the time being, I do think we'll eventually want to do something like this strategy. But it's possible this is a couple years down the road.

In addition to code review, I'd appreciate 👍 / 👎 feedback on whether you prefer this to the current sampling strategy.

Testing

Tested locally and via GitHub Action trial builds.

Release checklist

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

For when subsampling in the Nextstrain GISAID profile, rather than treating early contextual samples as origin of pandemic to beginning of focal window, eg for 6m analysis from 2020 to 6m ago, instead use a consistent 24m of additional context. So, for 6m, this is context of 30m ago to 6m and focal of 6m ago to present. Additionally, reduce the amount of contextual sequences included from a 4:1 ratio of focal to context to a 10:1 ratio of focal to context.
Drop forced inclusion of Wuhan/1 root in the Nextstrain GISAID profile and swap rooting to use "best", ie temporally optimal rooting. This allows the root to be the common ancestor of the subsampled sequences. This makes it so that with the changes to time-based subsampling in the previous commit, the "6m" analysis includes samples from the previous 30m and the TMRCA is in ~2021.

This set up should be significantly more future proof than needing to continually make new clade-specific (eg /21L/) roots as selective sweeps occur.
@trvrb trvrb added the proposal Proposals that warrant further discussion label Jul 25, 2024
@trvrb trvrb requested a review from a team July 25, 2024 19:41
@trvrb trvrb self-assigned this Jul 25, 2024
@jameshadfield
Copy link
Member

As an experiment I cut up the global/6m tree to split out the recombinant XBB clade and drop the 22B & 22D clades which had no samples from the 6 month focus window. It wasn't as much of an improvement as I had hoped, but still better I think. Haven't sketched out how hard it would be to automate such a cut, and if we want to pursue it we can of course do it afterwards.

image

@trvrb trvrb mentioned this pull request Jul 27, 2024
1 task
@trvrb
Copy link
Member Author

trvrb commented Jul 27, 2024

The other approach to this same issue would be take older clades (and conceivably Pango lineages) and make these gray while keeping the color ramp only for more recent clades.

After working through the coloring option in PR #1132 I'm definitely more of a fan of the color update. Unless there's conflicting preferences, I'll plan to just close this PR.

@trvrb trvrb closed this Jul 31, 2024
@victorlin victorlin changed the title Update time based sampling Limit contextual samples to 1 year back in 1m/2m/6m builds Oct 4, 2024
@victorlin victorlin changed the title Limit contextual samples to 1 year back in 1m/2m/6m builds Restrict contextual samples to 1 year back in 1m/2m/6m builds Oct 4, 2024
@victorlin victorlin deleted the time-based-sampling branch October 4, 2024 18:47
@trvrb trvrb added the revisit sometime Useful to address but no bandwidth at the moment label Oct 8, 2024
@trvrb
Copy link
Member Author

trvrb commented Oct 8, 2024

I just added the "revisit sometime" label. As time accrues leaving the older context all the way back to Wuhan will get increasingly clunky. In perhaps 6 months or a year we should implement something like this PR and include a 2y temporal window as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion revisit sometime Useful to address but no bandwidth at the moment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants