Skip to content

5. Workflow for Salamzade et al. 2021

Rauf Salamzade edited this page Jul 16, 2021 · 1 revision

This page documents the technical details behind the workflow to find geographic signatures, some of which contained carbapenemases, as described in Salamzade et al 2021.

1. Gather Set of Samples and Associated Sequencing Data

For the study, we included 608 Enterobacteriales samples collected from four hospitals (three in Boston, MA and one in Orange, CA) until the end of 2016. The majority of assemblies for these samples were constructed using short-read sequencing technology alone but prepared with both frag and jumping libraries to achieve high quality draft genomes. A subset of the samples were previously part of the study described in Cerqueira et al 2017, including 8 samples which had PacBio assemblies available. From preliminary analysis, we also developed an interest in some additional samples, and using Nanopore sequencing in combination with Illumina, constructed hybrid genomic assemblies for 4 of the samples.

2. Predict Scaffolds Representative of Plasmids in Assemblies

We predicted which scaffolds in our sample assemblies represented plasmids using MOB-suite, a recently developed tool by Robertson and Nash 2018. All scaffolds which were identified as belonging to an established plasmid group or categorized as potentially representing a novel plasmid group, provided they were shorter than 500 kb in length, were included in our analysis as predicted plasmid scaffolds. The upper size restriction was relevant only for novel plasmid scaffolds, as MOB-suite did not confidently assign a scaffold larger than 500 kb to an established plasmid group.

Further, because we were interested in identifying conserved segments larger than 10 kb in length, we retained only plasmids scaffolds larger than 10 kb. In total we searched for conserved segments across 1,380 scaffolds.

3. Determine if any of the Plasmid Scaffolds Represent Completed Circular Plasmids

While Enterobacteriales have been observed to carry linear plasmids, the clear majority of their plasmids are circular. However, their representation in FASTA files requires them to be linearized, even when they have been sequenced to completion. This complicates searching for conserved segments which span the linearization break; however, we have developed ConSequences with the ability to overcome such challenges.

Thus, to leverage these features in the ConSequences suite, we first needed to identify which of the predicted plasmid scaffolds across sample assemblies represented completed circular plasmids. To do this, we developed a program inspired and borrowing from methodolgies developed by Jorgensen et al. 2014 and Kothari et al. 2018. The program findCircularScaffolds.py implements the methods described in these papers for predicting whether assembly scaffolds represent circular molecules or not.

We ultimately predicted whether each of 1,380 >= 10 kb plasmids were circular or likely to be circular using two criteria:

  1. Five or more paired frag reads bridging scaffold ends. One read aligns to scaffold start while the other aligns to scaffold end.
  2. Whether scaffold ends align by BLASTn analysis with HSP having E-value <= 1e-5.

The 95 scaffolds which met both criteria were classified with high-confidence as circular and trimmed on one end to account for criteria (2). Nearly 23% of the 1,380 plasmids met criteria (1) and were predicted to be likely circular. Because some of our assemblies might have been previously processed to trim scaffold ends, our estimate of circular scaffolds was conservative to avoid false positive detection of conserved segments.

4. Comprehensive Identification of Conserved Segments using Every Plasmid Scaffold as a Reference

We created a comprehensive multi-FASTA with all 1,380 plasmid scaffolds across all samples (trimmed where appropriate for scaffold end overlap). The format of headers in the multi-FASTA included both the sample and scaffold identifier separated by a | character (e.g. >Sample_1|Scaffold_3).

We then used delineateSegmentsOnReference.py to identify conserved segments across each of the plasmid scaffolds, providing the comprehensive multi-FASTA as the database of alternate scaffolds. Afterwards, the Segment_Results.txt file produced for each reference scaffold was concatenated into a single file simply using cat. Sequences for all conserved segments identified across all samples were extracted from the references they were delineated upon using extractAllSegmentSequences.py (provided in ConSequences/other/).

5. Remove Redundant Segments found on Different Reference Scaffolds

Because we found segments on each reference, there was a lot of redundancy. To remove analogous conserved segments, we used CD-HIT by Li and Godzik 2006:

cd-hit-est -T 16 -i all_segments.fasta -o collapsed -c 0.99 -G 1 -g 1 -d 0 -n 10 -M 20000 -aL 0.95 -aS 0.95

Next, we used selectRepresentativeSegments.py (provided in ConSequences/other/) to parse the CD-HIT results and select representative segments from clusters of nearly equivalent segments. We ended up with 4,605 conserved segments.

We also used CD-HIT (specifically cd-hit-est-2d) to align all representative segments against each other and infer whether they exhibited substantial overlap (>= 10 kb) with or nesting within each other.

6. Filtering to Identify Multi-Species Geographic Signatures

We first applied filtering to retain only segments identified across multiple species, which provided the length and high sequence conservation, suggests they are representative of recent horizontal transfer. Of the 4,605 conserved segments, 2,249 segments met this criteria.

Next, we started a more intricate filtering to identify segments which are strongly associated with the cities our isolates were collected in, either Boston, MA or Orange, CA. We began by filtering for segments which were found exclusively in isolates from a single city in the context of our study. Next, we screened the sequences of passing segments against NCBI’s nt database (downloaded in December 2019), using thresholds of 98% identity and 95% query coverage. Hits matching sequences in NCBI from samples sourced in the same city were retained as potential geographic signatures. The script gatherGeographicInformationForSamples.py (located in Consequences/other/), utilized Biopython's Entrez functionalities and geopy, was used to query the collection information of NCBI hits for segments. After applying this filter we retained only 86 potentially geographic signatures.

Because of the modest number of segments retained at this stage, we took the opportunity to perform a computationally intensive but extremely informative test to assess how reliable the assembly of each segment is. This test was focused on testing whether there was evidence that segments could contain tandem repeats, which are difficult to assess the proper copy count of. These segments could be easily mis-assembled when working with draft Illumina assemblies and thus lead to the faulty identification of geographic signatures. Therefore, we checked whether any of the potential geographic signatures contained such tandem repeats using Pilon by Walker et al. 2014. Fragment library sequencing reads of each isolate found to harbor a signature were aligned to the signature’s reference sequence using bwa mem (version 0.7.17 with default settings) by Li and Durbin 2010. Pilon (version 1.23) was then run with the options: --vcf --fix all,breaks --mindepth 5 to check if the TandemRepeat flag was set off, which would indicate that the segment likely contained a tandem repeat motif and should be removed. This led to the removal of 17 segments, leaving a set of 69 geographic signature sequences present in high-quality regions of our assemblies.

Finally, we next performed an extremely intensive comprehensive assessment of uniqueness for each of the signatures using BIGSI by Bradley et al. 2019. We used the tool to search against all raw sequencing read sets in a snapshot of the ENA/SRA database taken on December 2016, aligning with the collection date of the most recently isolates samples included in this analysis. The SRA/ENA snapshot provided a broader database (455,632 read sets) compared to our original screening against NCBI’s nt database, which included only a subset of assemblies available in NCBI’s Nucleotide database and did not account for bacterial samples with sequencing data but no assembly. Because BIGSI was not intended for query sequences longer than 10 kb, to perform this search, we used a sliding window (2 kb window; 1 kb step) across each signature to identify readsets containing at least 99% of all k-mers for each window. Matching readsets were downloaded from EBI’s ENA database and further searched using a k-mer based methodology (described in the next subsection) to more stringently assess whether they harbored any of the geographic signature sequences. After this more stringent analysis, 181 sequencing readsets from 96 Biosamples isolates were predicted to harbor one or more of our 69 geographic signatures. 30 (31.3%) of these isolates corresponded to those from our own study. Another six isolates were from Brigham and Women’s Hospital as part of other studies, and could thus represent additional instances of the same BWH-specific geographic signatures. The remaining 60 samples, corresponding to 65 readsets, were from different or undisclosed geographic locations. The 22 signatures represented within these read sets were removed from our analysis because they were not geographically specific, resulting in a final set of 47 geographic signatures that passed our most stringent set of filters for strong geographic association, never being observed in sequencing data collected from alternate locations.

7. Search and Vet Additional Instances of Signatures in Sample Reads

Because we used draft assemblies, it was possible that we missed instances of geographic signatures because of plasmid assembly fragmentation. We thus searched for additional instances of the 47 multi-species geographic signatures identified in the raw sequencing readsets for the 608 isolates which were part of our study. To do so, we first checked for the presence of the signature in an isolate’s sequencing data using the generateReferenceMSA.py and querySegmentInRawReads.py programs in ConSeqeuences. By doing so, we identified 52 potential new instances of the motif within the readsets of our samples.

To understand why these instances were not initially picked up, we further validated the 52 additional signature instances identified by using BLASTn to align each to the draft assembly of the sample the representative sequence was delineated upon. Here is the breakdown of what we observed:

  • 34 (65.4%) instances aligned to the end of assembly scaffolds.
  • 6 (11.5%) instances were found in chromosomal scaffolds at >98% identity and >95% coverage.
  • 11 (21.2%) instances had parts fully embedded within a scaffold (identity > 98%, 25% < coverage < 95%) or had no significant alignment (identity > 98%, signature coverage < 25%) to a sample’s assembly.

To prevent potentially including false positive instances into our analysis, we discarded the 11 instances listed last.

After adding new instances found from searching the sample reads directly, we checked if smaller signatures nested within larger signatures were found in the same set of isolates now had identical conservation profiles. If this was the case, we kept the larger of the two signatures. In the end, we found a final set of 44 multi-species geographic signatures. Of interest to our study, six of these signatures carried the carbapenemase blaKPC.