WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

corneliusroemer · 2023-10-26T17:04:44Z

Currently, to assign clades at internal nodes, we require a clades.tsv listing defining mutations for each clade which isn't a natural way to define clades in many cases: often people define clades through representative strains.

This draft PR is an attempt to offer an alternative clades command that does clade assignment more like trait inference/ancestral reconstruction: from a set of labeled tips to internal nodes and unlabeled tips.

Such a command will be particularly useful for bootstrapping reference trees for Nextclade datasets. However, there are many more use cases, e.g. getting lineages onto internal nodes in ncov.

Currently, the implementation is a stripped down version of traits with confidence/entropy/model output removed. However this is an implementation detail that will probably change - so it's best not to focus on that part.

It would be great to get feedback in general, but in particular on the following points: How should this command be included? Should it be a new subcommand - the way it's done right now with the place holder name clades2 eventually replaced by a better name, or should we put the functionality inside augur clades and gate it behind a --mode?

My gut preference is to make a new subcommand as the input files to the command are quite different: taking a metadata.tsv and a metadata column name instead of a nuc_mutations.json and a clades.tsv - but we might also want to avoid proliferation of new subcommands.

Some limitations of the current implementation:

clades can be non-monophyletic
hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three). In theory, hierarchy could be taken into account but this can be added later
the current implementation cannot deal with more than 300 clades, this limitation is easily removed by using ancestral reconstruction, e.g. parsimony, instead of the current mugration model

Currently, to assign clades at internal nodes, we require a `clades.tsv` which isn't a natural way to define clades in most cases Rather, one often wants to infer clades from a set of labeled tips This PR adds a new subcommand `clades2` to infer clades at all tips and internal nodes from a subset of labeled tips Currently, the implementation is a stripped down version of `traits` with confidence/entropy/model output removed But this is an implementation detail that might change in the future Some limitations of the current approach: - clades can be non-monophyletic - hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three) In theory, hierarchy could be taken into account but this may not be required for this command to be useful already. The name is provisional, I couldn't immediately think of a good one so I went for `clades2`.

victorlin · 2023-10-26T17:47:03Z

augur/clades2.py

(Starting a thread for command placement/naming)

I'm not a user of clades so I don't think my opinion matters much, but my inclination would be to make it a part of augur clades because the output file is the same. I don't think --mode is necessary since it can be inferred by the inputs. The argparse parser can be written to require either --mutations+--clades or --metadata+--clade-column. Example:

# "old" clades augur clades \ --tree tree.nwk \ --mutations aa_muts.json nt_muts_small.json \ --clades clades.tsv \ --output-node-data clades.json # "new" clades augur clades \ --tree tree.nwk \ --metadata metadata.tsv \ --clade-column clade \ --output-node-data clades.json

That seems reasonable, @victorlin. We'd probably want separate argument groups to clearly show which arguments to use for which "mode".

Good catch that no mode is necessary and one can infer from input files. One disadvantage of lumping the two modes together in one command is that the command becomes more complicated to understand for end users. There stilll are two modes, they just share a few arguments and capabilities but not all and which are shared and which aren't is not obvious.

I'm not sure what the rationale is for lumping if there isn't much shared code/logic. What's the advantage of having it be under one command? I'm open either way, just not fully convinced yet of lumping.

I don't think shared code/logic is necessary to bundle under one command. In trying to imagine myself as a user, my thinking was that this would be an expansion of augur clades's objective from

Assign clades to nodes in a tree based on amino-acid or nucleotide signatures

to

Assign clades to nodes in a tree based on either amino-acid/nucleotide signatures or clades from metadata

In other words, there is a shared objective.

The point on few shared arguments might matter more if there are clades-specific customizations that are available in one and not the other. This doesn't include args like --metadata-id-columns, which is unrelated to clades. As far as I can tell, besides the input arguments, I don't think there are any clades-specific arguments to either command.

Bundling also make it more likely that any feature additions added to one are added to both.

huddlej · 2023-10-26T18:08:57Z

@corneliusroemer Can you say a little more about why the current augur traits command isn't a good solution to the problem? From this PR, it looks like some of the major differences between the proposed new command, clades, and traits are:

clades2 provides an option to define the output attribute name that traits lacks and that clades provides through --membership-name
neither clades2 nor traits provide branch attribute annotations that clades provides on the first internal node for each distinct clade
neither clades2 nor traits provide an argument to set clade labels that clades provides through the --label-name argument

Maybe another question is how important it is for the proposed new interface to exactly match the functionality provided by clades or traits. I could see value in providing confidence values for clade assignments in the same way that traits provides. I also see value in providing branch attributes in Auspice, so users get human-readable branch labels.

corneliusroemer · 2023-10-26T18:45:02Z

Thanks for the good questions @huddlej!

It's true that one could use traits to do what the command does in the current state of the PR, but we will eventually want to use a different treetime function under the hood (ancestral reconstruction rather than mugration) and that means we can't use traits anymore - unless one were to make clades inference essentially a separate traits mode altogether.

Adding branch label functionality makes sense, but it's not essential for the main use cases I've thought of. Same for confidence, if possible nice to have but depending on the algorithm used for inference, we might not get confidence (e.g. Fitch/parsimony won't give you confidence).

huddlej · 2023-12-01T21:25:08Z

Thanks, @corneliusroemer. I see now how different this logic needs to be from augur traits. We recently discussed using Nextclade to assign clades in the seasonal flu workflows, too, which would require this kind of functionality you've proposed.

I agree with @victorlin's assessment in the comments above that placing this new functionality in augur clades conveys the shared objective of the command regardless of the different input modes. Using the same subcommand name also suggests that the command outputs will behave consistently across different modes. For example, annotating branch labels is a key feature of the current augur clades and we will need that feature for flu builds. On the other hand, "confidence" does not exists as a clades output that people depend on, so I like the idea of not including that in outputs unless it is necessary.

Do you think @victorlin's example interface would meet your needs as a user, @corneliusroemer? Is there anything you'd change from the UI perspective? We could chat about this synchronously any time you'd like, too, if that's easier than GitHub comments...

corneliusroemer requested a review from a team October 26, 2023 17:05

victorlin reviewed Oct 26, 2023

View reviewed changes

huddlej mentioned this pull request Nov 28, 2023

Use Nextclade to assign clade labels in main phylogenetic workflow nextstrain/seasonal-flu#131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

corneliusroemer commented Oct 26, 2023

victorlin Oct 26, 2023

victorlin Oct 26, 2023 •

edited

Loading

huddlej Oct 26, 2023

corneliusroemer Oct 26, 2023

victorlin Oct 26, 2023

victorlin Oct 26, 2023

huddlej commented Oct 26, 2023

corneliusroemer commented Oct 26, 2023 •

edited

Loading

huddlej commented Dec 1, 2023

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

Are you sure you want to change the base?

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

Conversation

corneliusroemer commented Oct 26, 2023

victorlin Oct 26, 2023

Choose a reason for hiding this comment

victorlin Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

huddlej Oct 26, 2023

Choose a reason for hiding this comment

corneliusroemer Oct 26, 2023

Choose a reason for hiding this comment

victorlin Oct 26, 2023

Choose a reason for hiding this comment

victorlin Oct 26, 2023

Choose a reason for hiding this comment

huddlej commented Oct 26, 2023

corneliusroemer commented Oct 26, 2023 • edited Loading

huddlej commented Dec 1, 2023

victorlin Oct 26, 2023 •

edited

Loading

corneliusroemer commented Oct 26, 2023 •

edited

Loading