-
Notifications
You must be signed in to change notification settings - Fork 1
4. Basic Tutorial
On this page we describe the application of the ConSequences suite on finding conserved segments between 26 complete/circular plasmids from the hybrid illumina+nanopore assemblies of the 12 isolates noted to carry the blaKPC harboring geographic signature Sig5.1-CP in Salamzade et al 2021. The tutorial should take slightly over half-an-hour from start to finish, much of which will be hands off-time.
The input files for this tutorial can be found in tutorial_data/
and the expected results can be found in the subdirectory results/
within it. While we recommend running ConSequences on all reference scaffolds and afterwards collapsing redundant segments found across multiple references using some tool like CD-HIT, we just use a single reference scaffold here for simplicity: The 417 kb plasmid from Enterobacter hormaechei MGH7 (mob-suite group 840).
Run delineateSegmentsOnReference.py
to predict segments along the reference scaffold:
python /path/to/ConSequences/bin/delineateSegmentsOnReference.py \
--ref_fasta=/path/to/ConSequences/tutorial_data/MGH7_scaffold2.fasta \
--query_multifasta=/path/to/ConSequences/tutorial_data/All_Plasmids_from_Twelve_Isolates.fasta \
--ref_circular \
--query_circular=/path/to/ConSequences/tutorial_data/Circuar_Plasmid_Scaffolds.txt \
--scan_window_size=10000 \
--scan_slide_step=100 \
--matches_percentage=99 \
--outdir=/path/to/Segments_on_MGH7_Scaffold2/
Using the time
command in UNIX we find that the runtime is roughly ~28 minutes. Future updates will improve speed-ups to this step via incorporation of C++ code:
real 27m32.638s
user 24m21.528s
sys 3m10.910s
The resulting output directory should have the following files in it after running to completion:
Progress.log
Segment_Results.txt
Clustered_Windows.txt
Segment_Windows.txt
SWA_Results.txt
BLAST_results.txt
blast.err
blast.log
Processed_Input_FASTA/
Parameter_Inputs.txt
The main result file is called Segment_Results.txt
:
Ente_cloa_MGH7|MGH7_2_v3|417086 193601 212800 Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
Ente_cloa_MGH7|MGH7_2_v3|417086 212701 230900 Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
Ente_cloa_MGH7|MGH7_2_v3|417086 230801 296300 Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410
.
.
.
Each line in the file corresponds to a distinct segment and consists of four columns (tab-delimited):
- Scaffold name
- Start position on segment on scaffold
- Stop position on segment on scaffold
- Comma separated list of alternate scaffolds which are predicted to have segment
Run generateReferenceMSA.py
to produce reference segment MSA:
python /path/to/ConSequences/bin/generateReferenceMSA.py \
--ref_fasta=/path/to/ConSequences/tutorial_data/MGH7_scaffold2.fasta \
--start_coord=193601 \
--end_coord=212800 \
--mapping_scaffs="Citr_freu_MGH279|MGH279_2_v3|322189, Citr_freu_MGH281c|MGH281c_2_v3|408045, Citr_freu_MGH281|MGH281_2_v3|322189, Citr_sp._MGH103|MGH103_2_v3|327514, Citr_sp._MGH105|MGH105_2_v3|322189, Ente_cloa_MGH7|MGH7_2_v3|417086, Kleb_pneu_MGH83|MGH83_2_v3|324410" \
--sliding_window_results=/path/to/Segments_on_MGH7_Scaffold2/SWA_Results.txt
--msa_output=/path/to/Segment_1.msa
The output is just a reference-based MSA with variant information filled in from the BLAST analysis performed by delineateSegmentsOnReference.py
.
This step also runs very quickly < 1 minute:
real 0m0.528s
user 0m0.338s
sys 0m0.056s
Download illumina reads from SRA for Klebsiella pneumoniae MGH83 (one of the samples predicted to carry Segment 1 by delineateSegmentsOnReference.py
):
# Note, this requires SRAToolkit - which is not a ConSequences dependency
# If you have the ConSequences Conda environment active however, you can simply install it using the following command
# conda install -c bioconda sra-tools
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip SRR2132036
Run querySegmentInRawReads.py
to check if MGH83's reads show indication of having Segment 1:
python /path/to/ConSequences/bin/querySegmentInRawReads.py \
--ref_msas=/path/to/Segment_1.msa \
--references="Ente_cloa_MGH7|MGH7_2_v3|417086" \
--illumina_reads /path/to/fastq/SRR2132036_pass_1.fastq.gz /path/to/fastq/SRR2132036_pass_2.fastq.gz \
--output_prefix=MGH83_Segment1 \
--min_depth=5 \
--kmer_length=31
The final result file (tab-delimited) will be called MGH83_Segment1_result.txt
and comprises of the following:
Reference MSA Total k-mers missed Total k-mers missed (ignoring starting/ending 100 bp of MSA) Number of k-mers assessed
/path/to/Segments_on_MGH7_Scaffold2/Segment_1.msa 0 0 18972
A value of 0 for the "Total k-mers missed" column indicates that MGH83 likely has an assembly path in its reads for this segment which aligns with our expectations.
This last step is also relatively fast, taking < 3 minutes:
real 2m14.966s
user 2m1.846s
sys 0m10.931s