Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add H3N2 HA emerging clades #228

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Sep 11, 2024

Updates the Auspice JSON tree to include emerging clades for H3N2 HA including J.1.1, J.2.1, and J.2.2.

Related to nextstrain/seasonal-flu#181 which updates the Nextclade dataset workflow to produce these new annotations.

preview: https://master.clades.nextstrain.org/?dataset-server=gh:@add-h3n2-ha-emerging-clades@

Updates the Auspice JSON tree to include emerging clades for H3N2 HA
including J.1.1, J.2.1, and J.2.2.

Related to nextstrain/seasonal-flu#181 which
updates the Nextclade dataset workflow to produce these new annotations.
@huddlej huddlej deployed to refs/pull/228/merge September 11, 2024 22:05 — with GitHub Actions Active
Comment on lines -1109 to +1113
"clades": 30,
"clades": 28,
"customClades": {
"subclade": 36,
"short-clade": 30
"subclade": 34,
"short-clade": 28,
"emerging_subclade": 37
Copy link
Member

@ivan-aksamentov ivan-aksamentov Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the number of "big" clades, subclades and short clades (as counted on the tree nodes) all decreased by 2. Not sure if that's something expected or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye. The old tree has clade 3C and 3C.2a1b which are missing from the new tree. 3C only had one sample in the old tree and has no samples in the new tree. 3C.2a1b has no samples in either tree, but it was annotated in the old tree and not in the new tree. I suspect that the workflow dropped these clades during subsampling, as we sample more newer sequences.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Which is to say that for the "recent H3N2 HA" dataset, those missing clades are not a blocking issue for this PR.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SC2 and mpox workflows, I force-include at least one representative sequence for each clade I want to include in a build so that all clades I want are represented. Maybe you could adopt some strategy like this to have less randomness involved?


## 2024-08-08T05:08:21Z

Fix numbering of RBD sites it the `pathogen.json`. The relevant positions were indexed 1-based, when they should have been indexed 0-based.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fix numbering of RBD sites it the `pathogen.json`. The relevant positions were indexed 1-based, when they should have been indexed 0-based.
Fix numbering of RBD sites in the `pathogen.json`. The relevant positions were indexed 1-based, when they should have been indexed 0-based.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been lying around for a month - was it waiting for anything in particular other than a merge from us? I've reviewed and have a few comments - not necessarily blocking but might be nice to address anyways.

I can't seem to sort by emerging subclade, why is that @ivan-aksamentov?

image

@huddlej what distinguishes an emerging subclade from a subclade? Why have that extra column? Does emerging mean provisional and hence what is meant by J.2.1 might change in the future? Otherwise why not just designate as a proper new clade?

Maybe the display name shouldn't have that underscore emerging_subclade but be Emerging subclade - also a short description would be nice for the tooltip. Right now it's empty:
Brave Browser 2024-10-16 16 35 14

Something like this is possible:
Brave Browser 2024-10-16 16 35 16

Lastly, it would be nice to maybe add some new example sequences that are part of these new emerging clades.

Here's the tree with coloring by emerging clades:

Brave Browser 2024-10-16 16 38 28

@tsibley
Copy link
Member

tsibley commented Oct 16, 2024

I can't seem to sort by emerging subclade, why is that @ivan-aksamentov?

It's literally because the text doesn't wrap, which forces the sort asc/desc icons out of view. If I make the text wrap, you can see/use them.

image

It's conventional to allow clicking the column name/text itself to toggle thru sort state (asc, desc, none), which would at least restore functionality if not the indicators.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 16, 2024

Yep, if the text does not fit, it will push arrows away from the view - this is a CSS bug. (as a funny workaround you can scroll them back in if you select the text and drag the selection all the way to the right). The easiest is to pick names that are short words or even abbreviations/acronyms, space separated, instead of underscore-separated. The explanation can be tucked into details in the tooltip.

But it's true that I need to return to the table sometimes, it is one of the oldest components and can def use some love.

Some more discussion is in the nextstrain/nextclade#1537

@huddlej
Copy link
Contributor Author

huddlej commented Oct 16, 2024

This has been lying around for a month - was it waiting for anything in particular other than a merge from us?

@corneliusroemer I shared some initial context in a related issue that may be helpful background for this PR.

This PR is waiting on two things:

  1. a synchronous discussion between at least @rneher and me about whether this is the right solution to the issue of emerging subclades. I'm not as convinced now that I've used it for a month. I think I'd prefer a way to keep using the same "subclade" field but define new subclades in a prerelease Nextclade dataset that we could use in our reporting and users could opt into through the website. I was hoping to use an upcoming Nextstrain biweekly meeting to chat about this general issue.
  2. inclusion of representative sequences from older clades to avoid loss of those clades in the main H3N2 HA dataset (the approach you described above is what I was planning to do)

@corneliusroemer
Copy link
Member

Thanks @huddlej for the response, a PR here is enough to have a "prerelease" dataset that's available through for example: https://master.clades.nextstrain.org/?dataset-server=gh:@add-h3n2-ha-emerging-clades@ (and an equivalent invocation of nextclade dataset get with a command line arg specifying the server)

I'll convert this PR to draft state then as it's not actually ready to be reviewed/merged at this point in time.

@corneliusroemer corneliusroemer marked this pull request as draft October 16, 2024 21:21
@huddlej
Copy link
Contributor Author

huddlej commented Oct 16, 2024

a PR here is enough to have a "prerelease" dataset that's available through for example

I was hoping for something a little more visible to users of the web UI like a H3N2 HA dataset with both "official" and "experimental" labels on the production website. This would allow folks to use emerging annotations ahead of the various WHO meetings but before they've been released officially. But I'm happy to discuss any potential solutions.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 16, 2024

Few thoughts:

  1. If this change provides a sufficiently different approach compared to what most users will use, then perhaps it could be a separate dataset? Think of it as a "fork". e.g. we could have

    nextstrain/flu/h3n2/ha/EPI1857216/default
    nextstrain/flu/h3n2/ha/EPI1857216/experimental
    

    or

    nextstrain/flu/h3n2/default/ha/EPI1857216
    nextstrain/flu/h3n2/experimental/ha/EPI1857216
    

    or whatever the paths/names/flavors you think make sense. The old paths need to be added to the shortcuts for backward compat.

    A disadvantage is that both datasets will have to be maintained in sync. You might update default but forget to update experimental - resulting in default being ahead of experimental.

    This approach can also be used if no consensus is found on the team - John could just create a community/huddlej/ sub-directory and add his stuff there :)

    Or perhaps a new collection nextstrain-experimental/ which will also be considered as "official"?

  2. I just realized that the separate column maybe not a very bad idea - the new column is like a "beta" version of clades and "beta" clades periodically "graduate" to the clade column proper - this way the 2 nomenclatures are always in sync. But that's up to science folks to decide of course - there are considerations way beyond just paths and JSONs.

  3. Improve software: introduce dataset pre-releases and allow users to pick dataset versions - either pre-release/release or even concrete versions. In CLI tags can already be selected, however, to implement pre-releases we will also need to have some kind of a flag for each tag, so that pre-releases are not considered as default for when the tag is not specified - this will likely be a breaking change for CLI. In Web we can do whatever we want - we don't have to maintain a stable interface there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants