Skip to content

Releases: nextstrain/nextclade_data

2022-07-12

22 Jul 22:02
Compare
Choose a tag to compare

2022-07-12

SARS-CoV-2

New dataset version (tag 2022-07-12T12:00:00Z)

  • Fix: BA.2.75 lacked the characteristic S:R493Q reversion in the previous release, this is now fixed. This is the only change, otherwise this dataset is identical to 2022-07-11T12:00:00Z.

2022-07-11

12 Jul 14:49
Compare
Choose a tag to compare

2022-07-11

SARS-CoV-2

New dataset version (tag 2022-07-11T12:00:00Z)

  • Pango lineages: In this release, Nextclade can assign Pango lineages up to BA.2.75
  • Alignment params: Retry reverse complement flag is now set to true, so that reverse complement is tried if seed matching fails.
  • Fixes: Some synthetic pango lineage sequences had wrong mutations, this is now fixed through a manually curated override file.

2022-06-29

MPXV B.1

New dataset version (tag 2022-06-29T12:00:00Z)

  • Increased number of B.1 samples from ~100 to ~200 to improve phylogenetic placement of analyzed 2022 outbreak sequences

2022-06-14

16 Jun 16:51
ba3688a
Compare
Choose a tag to compare

3 Monkeypox (MPXV) datasets introduced

Three MPXV datasets are added with differing zoom levels containing:

  • MPXV (All clades)
  • hMPXV-1 (part of clade 3, source of 2017/2018/2022 outbreaks)
  • hMPXV-1 B.1 (2022 outbreak lineage)

All 3 use the coordinate system of the recently designated NCBI Monkeypox reference sequence NC_063383 (MPXV-M5312_HM12_Rivers).

However, SNPs from two different ref sequences are added to the "all clades" and B.1 datasets to reduce the number of total mutations.

The B.1 dataset uses SNPs of ON563414.3 (MPXV_USA_2022_MA001) on top of a NC_063383 backbone.

The "all clades" build uses the SNPs of a reconstructed ancestral MPXV sequence that is the inferred most recent common ancestor of clades 1, 2 and 3, rooted with a Cowpox outgroup.

Only the MPXV (All clades) dataset can assign all clades 1, 2 and 3.
The hMPXV-1 dataset can be used if all viruses are from hMPXV-1.
The B.1 dataset is useful for 2022 outbreak sequences but will not be able to assign anything but B.1 lineages.

Gene annotations follow the annotation used by NC_063383 and is of the form OPG001 (for OrthoPox Gene 001).
Since the alignment reference is always in NC_063383 coordinates, nucleotide and protein mutation position should usually be identical in alignments done with all three datasets.

Quality control parameters are subject to change, especially since "known" frame shifts and stop codons have not been annotated. For example, clade 1 sequences will always show around 7 frame shifts, yet these do not indicate quality problems.

New dataset version (tag 2022-06-14T12:00:00Z)

SARS-CoV-2

  • Pango lineages: New lineages added up till pango-designation release v1.9 and beyond are now included, including among others BA.5.1-BA.5.3, BA.2.35-BA.2.48 and XV-XY

2022-04-28

28 Apr 20:10
Compare
Choose a tag to compare

New dataset version (tag 2022-04-28T12:00:00Z)

SARS-CoV-2 (with and without recombinants)

  • Pango lineages: New lineages added up till pango-designation release v1.8 are now included, including among others BA.3.1, BA.2.14-BA.2.34 and XT-XU (in the default build, excluded from special "without recombinants" dataset).
  • Clades: New Nextstrain clades included. BA.4 is 22A (Omicron), BA.5 is 22B (Omicron) and BA.2.12.1 is 22C (Omicron).

2022-04-08

12 Apr 13:06
Compare
Choose a tag to compare

New dataset version (tag 2022-04-08T12:00:00Z)

SARS-CoV-2 (with and without recombinants)

  • Pango lineages: New lineages added up till pango-designation release v1.4 are now included, including among others BA.4-5, BA.2.9-BA.2.13 and XM-XS (in the default build, excluded from special "without recombinants" dataset). For now, BA.4-5 are included in Nextstrain clade 21L, together with BA.2 which is the most similar Omicron.
  • Reference tree: The first 100 and last 200 sites (with respect to Wuhan reference) are now masked in the reference tree to reduce noise due to sites like 21 that were artifactually polymorphic.

2022-03-31

31 Mar 14:33
Compare
Choose a tag to compare

New dataset version (tag 2022-03-31T12:00:00Z)

SARS-CoV-2 (with and without recombinants)

  • Pango lineages: New lineages added up till pango-designation release v1.2.137 are now included, including among others BA.1.18-19, BA.2.4-BA.2.8 and XG-XK (in the default build, excluded from special "without recombinants" dataset).
  • Dataset: The sampling of sequences has changed slightly. Previously, every Nextstrain clade got around 30 random sequences belonging to this clade causing quite a bit of movement between releases. This is no longer the case. The tree is thus slightly smaller. The change is most noticeable for small Nextstrain clades like 20F.

2022-03-24

24 Mar 23:35
Compare
Choose a tag to compare

New dataset version (tag 2022-03-24T12:00:00Z)

SARS-CoV-2

  • Recombinants: Recombinant Pango lineages are now included in the reference tree. Each recombinant is attached to the root node so as not to spawn false internal nodes in the tree that would attract bad sequences. As long as recombinants do not qualify for a Nextstrain clade, they will receive the place holder clade name recombinant. Pango lineages are provided if present. Beware that new unnamed recombinants with similar donors but slightly different breakpoint will attach to existing recombinants in the reference tree and thus get a wrong Pango lineage. A number of reversions and labeled mutations is a sign that you may have a similar but different recombinant.
  • Pango lineages: In this release, Nextclade can assign Pango lineages up to pango-designation release v1.2.133, featuring Omicron recombinants like XD, XE and XF.
  • QC: qc.json was updated with the most common stop codons and frameshifts that appear to be real and not artefacts (in ORFs 3a, 6, 7a, 7b,8, 9b)
  • QC: virus_properties.json was updated and now contains more mutations that are common in 21K which should help identifying recombinants

SARS-CoV-2 without recombinants

  • New dataset: Now that recombinants are included in the default SARS-CoV-2 tree, it is no longer easy to identify breakpoints and donors of new recombinants if they attach to existing recombinants on the tree. To facilitate the analysis of new potential recombinants, we have added a new dataset named "SARS-CoV-2 without recombinants" that does not include recombinants and can thus be used for recombinant analysis as before the inclusion of recombinants. This dataset should only be used for recombinant analysis, it will receive less attention than the main (default) SARS-CoV-2 dataset.
  • Pango lineages: In this release, Nextclade can assign Pango lineages up to pango-designation release v1.2.133, except recombinants (lineages starting with X).

2022-03-14

24 Mar 23:34
Compare
Choose a tag to compare

New dataset version (tag 2022-03-14T12:00:00Z)

SARS-CoV-2

  • Pango lineages: Nextclade now assigns sequences a pango lineage, similar to how clades are assigned. Output is visible in both web and tsv/csv output (column Nextclade_pango). The classifier is about 98% accurate for sequences from the past 12 months. Older lineages are deprioritised, and accuracy is thus worse. Read more about the method and validation against pangoLEARN and UShER in this report: Nextclade as pango lineage classifier: Methods and Validation.
  • Pango lineages: In this release, Nextclade can assign Pango lineages up to pango-designation release v1.2.132, featuring lineages like BA.2.3, BA.1.17 and BA.1.1.16.
  • Reference tree: Every pango lineage that's sampled in gets a synthetic sequence that is chosen to represent a hypothetical common ancestor of the lineage, according to the sequences listed as members in the pango-designation repo.

2022-02-07

07 Feb 13:08
Compare
Choose a tag to compare

2022-02-07

New dataset version (tag 2022-02-07T12:00:00Z)

SARS-CoV-2

  • Reference tree: Updated with new data. New algorithm for choosing how many of each pango lineage to include improves coverage of common and recent lineages. Every pango lineage that's included gets one relatively basal (early) sequence to keep number of false positive reversions down.

2022-01-18

24 Jan 21:32
Compare
Choose a tag to compare

New dataset version (tag 2022-01-18T12:00:00Z)

  • Backwards incompatibility(!): New datasets no longer work for Nextclade versions before 1.10.0, to use new datasets, you must update

SARS-CoV-2

  • Files: added virus_properties.json containing common mutations per clade
  • QC: higher penalty for private mutations that are reversions or common in other clades

Influenza

  • Files: Stub virus_properties.json added to be compatible with new Nextclade version 1.10.0