Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

Currently, to assign clades at internal nodes, we require a clades.tsv listing defining mutations for each clade which isn't a natural way to define clades in many cases: often people define clades through representative strains.

This draft PR is an attempt to offer an alternative clades command that does clade assignment more like trait inference/ancestral reconstruction: from a set of labeled tips to internal nodes and unlabeled tips.

Such a command will be particularly useful for bootstrapping reference trees for Nextclade datasets. However, there are many more use cases, e.g. getting lineages onto internal nodes in ncov.

Currently, the implementation is a stripped down version of traits with confidence/entropy/model output removed. However this is an implementation detail that will probably change - so it's best not to focus on that part.

It would be great to get feedback in general, but in particular on the following points: How should this command be included? Should it be a new subcommand - the way it's done right now with the place holder name clades2 eventually replaced by a better name, or should we put the functionality inside augur clades and gate it behind a --mode?

My gut preference is to make a new subcommand as the input files to the command are quite different: taking a metadata.tsv and a metadata column name instead of a nuc_mutations.json and a clades.tsv - but we might also want to avoid proliferation of new subcommands.

Some limitations of the current implementation:

  • clades can be non-monophyletic
  • hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three). In theory, hierarchy could be taken into account but this can be added later
  • the current implementation cannot deal with more than 300 clades, this limitation is easily removed by using ancestral reconstruction, e.g. parsimony, instead of the current mugration model

Currently, to assign clades at internal nodes, we require a `clades.tsv`
which isn't a natural way to define clades in most cases

Rather, one often wants to infer clades from a set of labeled tips

This PR adds a new subcommand `clades2` to infer clades at all tips and internal nodes
from a subset of labeled tips

Currently, the implementation is a stripped down version of `traits`
with confidence/entropy/model output removed
But this is an implementation detail that might change in the future

Some limitations of the current approach:
- clades can be non-monophyletic
- hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three)

In theory, hierarchy could be taken into account but this may not be required for this command to be useful already.

The name is provisional, I couldn't immediately think of a good one so I went for `clades2`.
@corneliusroemer corneliusroemer requested a review from a team October 26, 2023 17:05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Starting a thread for command placement/naming)

Copy link
Member

@victorlin victorlin Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a user of clades so I don't think my opinion matters much, but my inclination would be to make it a part of augur clades because the output file is the same. I don't think --mode is necessary since it can be inferred by the inputs. The argparse parser can be written to require either --mutations+--clades or --metadata+--clade-column. Example:

# "old" clades
augur clades \
  --tree tree.nwk \
  --mutations aa_muts.json nt_muts_small.json \
  --clades clades.tsv \
  --output-node-data clades.json

# "new" clades
augur clades \
  --tree tree.nwk \
  --metadata metadata.tsv \
  --clade-column clade \
  --output-node-data clades.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, @victorlin. We'd probably want separate argument groups to clearly show which arguments to use for which "mode".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch that no mode is necessary and one can infer from input files. One disadvantage of lumping the two modes together in one command is that the command becomes more complicated to understand for end users. There stilll are two modes, they just share a few arguments and capabilities but not all and which are shared and which aren't is not obvious.

I'm not sure what the rationale is for lumping if there isn't much shared code/logic. What's the advantage of having it be under one command? I'm open either way, just not fully convinced yet of lumping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think shared code/logic is necessary to bundle under one command. In trying to imagine myself as a user, my thinking was that this would be an expansion of augur clades's objective from

Assign clades to nodes in a tree based on amino-acid or nucleotide signatures

to

Assign clades to nodes in a tree based on either amino-acid/nucleotide signatures or clades from metadata

In other words, there is a shared objective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point on few shared arguments might matter more if there are clades-specific customizations that are available in one and not the other. This doesn't include args like --metadata-id-columns, which is unrelated to clades. As far as I can tell, besides the input arguments, I don't think there are any clades-specific arguments to either command.

Bundling also make it more likely that any feature additions added to one are added to both.

@huddlej
Copy link
Contributor

huddlej commented Oct 26, 2023

@corneliusroemer Can you say a little more about why the current augur traits command isn't a good solution to the problem? From this PR, it looks like some of the major differences between the proposed new command, clades, and traits are:

  • clades2 provides an option to define the output attribute name that traits lacks and that clades provides through --membership-name
  • neither clades2 nor traits provide branch attribute annotations that clades provides on the first internal node for each distinct clade
  • neither clades2 nor traits provide an argument to set clade labels that clades provides through the --label-name argument

Maybe another question is how important it is for the proposed new interface to exactly match the functionality provided by clades or traits. I could see value in providing confidence values for clade assignments in the same way that traits provides. I also see value in providing branch attributes in Auspice, so users get human-readable branch labels.

@corneliusroemer
Copy link
Member Author

corneliusroemer commented Oct 26, 2023

Thanks for the good questions @huddlej!

It's true that one could use traits to do what the command does in the current state of the PR, but we will eventually want to use a different treetime function under the hood (ancestral reconstruction rather than mugration) and that means we can't use traits anymore - unless one were to make clades inference essentially a separate traits mode altogether.

Adding branch label functionality makes sense, but it's not essential for the main use cases I've thought of. Same for confidence, if possible nice to have but depending on the algorithm used for inference, we might not get confidence (e.g. Fitch/parsimony won't give you confidence).

@huddlej
Copy link
Contributor

huddlej commented Dec 1, 2023

Thanks, @corneliusroemer. I see now how different this logic needs to be from augur traits. We recently discussed using Nextclade to assign clades in the seasonal flu workflows, too, which would require this kind of functionality you've proposed.

I agree with @victorlin's assessment in the comments above that placing this new functionality in augur clades conveys the shared objective of the command regardless of the different input modes. Using the same subcommand name also suggests that the command outputs will behave consistently across different modes. For example, annotating branch labels is a key feature of the current augur clades and we will need that feature for flu builds. On the other hand, "confidence" does not exists as a clades output that people depend on, so I like the idea of not including that in outputs unless it is necessary.

Do you think @victorlin's example interface would meet your needs as a user, @corneliusroemer? Is there anything you'd change from the UI perspective? We could chat about this synchronously any time you'd like, too, if that's easier than GitHub comments...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants