Skip to content

4. Basic Tutorial

rsalamza edited this page Dec 3, 2020 · 3 revisions

On this page we describe the application of the ConSequences suite on finding conserved segments between 26 complete/circular plasmids from the hybrid illumina+nanopore assemblies of the 12 isolates noted to carry the blaKPC harboring geographic signature Sig5.1-CP in Pironti and Salamzade et al 2020. The tutorial should take slightly over half-an-hour from start to finish, much of which will be hands off-time.

The input files for this tutorial can be found in tutorial_data/ and the expected results can be found in the subdirectory results/ within it. While we recommend running ConSequences on all reference scaffolds and afterwards collapsing redundant segments found across multiple references using some tool like CD-HIT, we just use a single reference scaffold here for simplicity: The 417 kb plasmid from Enterobacter hormaechei MGH7 (mob-suite group 840).

Predicting Conserved Segments on Scaffold-2 from E. hormaechei MGH7

Run delineateSegmentsOnReference.py to predict segments along the reference scaffold:

python /path/to/ConSequences/bin/delineateSegmentsOnReference.py \
       --ref_fasta=/path/to/ConSequences/tutorial_data/MGH7_scaffold2.fasta \
       --query_multifasta=/path/to/ConSequences/tutorial_data/All_Plasmids_from_Twelve_Isolates.fasta \
       --ref_circular \
       --query_circular=/path/to/ConSequences/tutorial_data/Circuar_Plasmid_Scaffolds.txt \
       --scan_window_size=10000 \
       --scan_slide_step=100 \
       --matches_percentage=99 \
       --outdir=/path/to/Segments_on_MGH7_Scaffold2/

Using the time command in UNIX we find that the runtime is roughly ~28 minutes. Future updates will improve speed-ups to this step via incorporation of C++ code:

real    27m32.638s
user    24m21.528s
sys     3m10.910s

The resulting output directory should have the following files in it after running to completion:

Progress.log
Segment_Results.txt
Clustered_Windows.txt
Segment_Windows.txt
SWA_Results.txt
BLAST_results.txt
blast.err
blast.log
Processed_Input_FASTA/
Parameter_Inputs.txt

The main result file is called Segment_Results.txt:

Ente_cloa_MGH7|MGH7_2_v3|417086  193601  212800  Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
Ente_cloa_MGH7|MGH7_2_v3|417086  212701  230900  Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
Ente_cloa_MGH7|MGH7_2_v3|417086  230801  296300  Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
.
.
.

Each line in the file corresponds to a distinct segment and consists of four columns (tab-delimited):

  1. Scaffold name
  2. Start position on segment on scaffold
  3. Stop position on segment on scaffold
  4. Comma separated list of alternate scaffolds which are predicted to have segment

Generate Reference Multiple Sequence Alignments for Segment 1

Run generateReferenceMSA.py to produce reference segment MSA:

python /path/to/ConSequences/bin/generateReferenceMSA.py \
       --ref_fasta=/path/to/ConSequences/tutorial_data/MGH7_scaffold2.fasta \
       --start_coord=193601 \
       --stop_coord=212800 \
       --mapping_scaffs="Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410" \
       --sliding_window_results=/path/to/Segments_on_MGH7_Scaffold2/SWA_Results.txt
       --msa_output=/path/to/Segment_1.msa

The output is just a reference-based MSA with variant information filled in from the BLAST analysis performed by delineateSegmentsOnReference.py.

This step also runs very quickly < 1 minute:

real    0m0.528s
user    0m0.338s
sys     0m0.056s

Check if Readset for K. pneumoniae MGH83 has Segment 1

Download illumina reads from SRA for Klebsiella pneumoniae MGH83 (one of the samples predicted to carry Segment 1 by delineateSegmentsOnReference.py):

# Note, this requires SRAToolkit - which is not a ConSequences dependency
# If you have the ConSequences Conda environment active however, you can simply install it using the following command
# conda install -c bioconda sra-tools
 
fastq-dump --outdir fastq --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-3 --clip SRR2132036

Run querySegmentInRawReads.py to check if MGH83's reads show indication of having Segment 1:

python /path/to/ConSequences/bin/querySegmentInRawReads.py \
       --ref_msas=/path/to/Segment_1.msa \
       --references="Ente_cloa_MGH7|MGH7_2_v3|417086" \
       --illumina_reads /path/to/fastq/SRR2132036_pass_1.fastq.gz /path/to/fastq/SRR2132036_pass_2.fastq.gz \
       --output_prefix=MGH83_Segment1 \
       --min_depth=5 \
       --kmer_length=31

The final result file (tab-delimited) will be called MGH83_Segment1_result.txt and comprises of the following:

Reference MSA	Total k-mers missed	Total k-mers missed (ignoring starting/ending 100 bp of MSA)	Number of k-mers assessed
/path/to/Segments_on_MGH7_Scaffold2/Segment_1.msa	0	0	18972

A value of 0 for the "Total k-mers missed" column indicates that MGH83 likely has an assembly path in its reads for this segment which aligns with our expectations.

This last step is also relatively fast, taking < 3 minutes:

real    2m14.966s
user    2m1.846s
sys     0m10.931s