Skip to content

2. Rapid Identification of New Instances of High Interest Segments

rsalamza edited this page Dec 3, 2020 · 1 revision

The other two main programs in ConSequences are generateReferenceMSA.py and querySegmentInRawReads.py which enable the quick prediction of whether a sample has a segment of interest directly from short read sequencing reads.

Description of Method

If a conserved segment of interest is identified from delineateSegmentsOnReference.py based analysis. Result files generated by that program can be provided as input to generateReferenceMSA.py to construct a reference-based multiple sequence alignment (MSA) for the segment.

Afterwards, the program querySegmentInRawReads.py can be used to predict whether the defining/core components of the MSA are present in the raw reads of a sample (provided as FASTQ files) using a sliding k-mer analysis of one or multiple segment MSAs. As slight variations can exist between instances of a signature in the multiple sequence alignment, a sample only needed to possess one of the possible 31-mers.

Usage for generateReferenceMSA.py

usage: generateReferenceMSA.py [-h] -r REF_FASTA -s START_COORD -e END_COORD
                               -m MAPPING_SCAFFS -w SLIDING_WINDOW_RESULTS -o
                               MSA_OUTPUT [-l LOG_FILE]

	Program: generateReferenceMSA.py
	Author: Rauf Salamzade
	The Broad Institute of MIT and Harvard
	Earl Lab / Bacterial Genomics Group

	This program will generate a . If facing difficulties, please raise 
        issues on the github page: https://github.com/broadinstitute/consequences	

optional arguments:
  -h, --help            show this help message and exit
  -r REF_FASTA, --ref_fasta REF_FASTA
                        FASTA for reference scaffold upon which 
                        segment lies.
  -s START_COORD, --start_coord START_COORD
                        Starting coordinate of segment.
  -e END_COORD, --end_coord END_COORD
                        Ending coordinate of segment.
  -m MAPPING_SCAFFS, --mapping_scaffs MAPPING_SCAFFS
                        List of scaffolds with segment. One per line.
  -w SLIDING_WINDOW_RESULTS, --sliding_window_results SLIDING_WINDOW_RESULTS
                        Sliding window results file which contains 
                        variant information.
  -o MSA_OUTPUT, --msa_output MSA_OUTPUT
                        Multiple-sequence-alignment to be used for rapid 
                        identification of signature sequences.
  -l LOG_FILE, --log_file LOG_FILE
                        Path to logging output file

Usage for querySegmentInRawReads.py

usage: querySegmentInRawReads.py [-h] -m REF_MSAS [REF_MSAS ...] -r REFERENCES
                                 [REFERENCES ...] -i ILLUMINA_READS
                                 [ILLUMINA_READS ...] -o OUTPUT_PREFIX
                                 [-d MIN_DEPTH] [-k KMER_LENGTH] [-c CORES]

	Program: generateReferenceMSA.py
	Author: Rauf Salamzade
	The Broad Institute of MIT and Harvard
	Earl Lab / Bacterial Genomics Group

	This program will generate a . If facing difficulties, please 
        raise issues on the github page: https://github.com/broadinstitute/consequences
	

optional arguments:
  -h, --help            show this help message and exit
  -m REF_MSAS [REF_MSAS ...], --ref_msas REF_MSAS [REF_MSAS ...]
                        Multi-FASTA reference-based multiple sequence alignment(s) 
                        for segment(s) of interest.
  -r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]
                        Reference sample. Should be provided in same respective 
                        order as --ref_msas.
  -i ILLUMINA_READS [ILLUMINA_READS ...], --illumina_reads ILLUMINA_READS [ILLUMINA_READS ...]
                        Illumina or any high-accuracy sequencing data in FASTQ format.
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Multiple-sequence-alignment to be used for rapid 
                        identification of signature sequences.
  -d MIN_DEPTH, --min_depth MIN_DEPTH
                        Minimum number of times k-mer has to occur in sample 
                        read's to avoid inclusion of sequencing errors.
  -k KMER_LENGTH, --kmer_length KMER_LENGTH
                        Size of k-mer to use for searching. Default is 31.
  -c CORES, --cores CORES
                        Number of cores to provide JellyFish. Default is 1.