-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pangolin not outputting an expected lineage in simulation experiment #546
Comments
If it helps, even nextclade (at https://clades.nextstrain.org) itself doesn't classify your sequence as 21E but as 20B. I don't know why that would be the case since, indeed, you seem to have put the 21E clade's variants into the sequence, but this then becomes a question to ask at https://github.com/nextstrain/ncov. |
Yes, that's right. It rather seems to be a problem related to Nextstrain. Thanks for your insight! |
Hi @huzuner, thanks for sharing your example 21E. I think the issue, probably for both pangolin and nextclade, is that your 21E sequence has only a subset of the many mutations found in most 21E sequences. It looks like your sequence has these mutations (relative to the NC_045512.2 reference):
However, in the UShER tree, these mutations are found on the path to the branch with 21E, i.e. they are present in most 21E sequences:
So your sequence is mostly reference, with a small subset of the mutations actually found in 21E. https://github.com/nextstrain/ncov/blob/master/defaults/clades.tsv generally has only enough mutations to distinguish the Nextstrain clades from each other, and Nextstrain's augur pipeline identifies clades on a tree using an algorithm that expects to use only a subset of the mutations as input. But an actual 21E sequence really should have many more mutations than just the ones in clades.tsv. I believe nextclade does not use clades.tsv when assigning clades to sequences; it uses a tree, similar to pangolin/usher. |
Hi @AngieHinrichs, thank you so much for your detailed insight into my problem. I really appreciate it. That now explains a lot why my experiments failed all the time with Pangolin and Nextclade predictions carried out on simulation experiments. Thank you in advance! |
@corneliusroemer maintains a set of Pango lineage consensus sequences that should work much better with nextclade and pangolin: https://github.com/corneliusroemer/pango-sequences |
Is there a resource like this for Nextclade/Nextstrain clades? |
Hello,
I have been using Pangolin to evaluate my tool on a simulated dataset.
However, I cannot get the lineage that I expect with pangolin. I would be glad if you could have a look why that could be the case.
I have reference sequences for each Nextclade clade that I generated myself by changing the nucleotide positions (https://github.com/nextstrain/ncov/blob/master/defaults/clades.tsv) when compared to the reference sequence.
Below is an example for 21E:
I then basically use e.g. 21E as input to pangolin with default parameters:
pangolin 21E.fasta --outfile 21E.tsv
Here is the resulting output:
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,scorpio_notes,version,pangolin_version,scorpio_version,constellation_version,is_designated,qc_status,qc_notes,note 21E ,B.1.1.28,0.0,,,,,,PUSHER-v1.27,4.3,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.02,Usher placements: B.1.1.28(1/1)
B.1.1.28 is equal to 20B using the resource for mapping Pango lineages to Nextclade nomenclature: https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/
Do you have any clue what could be the reason that Pangolin does not output a Pango lineage that corresponds to 21E?
Thank you in advance for your insights!
Best,
huzuner
The text was updated successfully, but these errors were encountered: