GitHub - tbmorrison/ssqc: snaq seq QC for COVID-19

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
NT_IS_LOOKUP_TABLE_v0.3.3_seperate.txt		NT_IS_LOOKUP_TABLE_v0.3.3_seperate.txt
NT_IS_LOOKUP_TABLE_v0.4.2_seperate.txt		NT_IS_LOOKUP_TABLE_v0.4.2_seperate.txt
README		README
SARS-CoV_ISv0.3.1_seperate.fasta		SARS-CoV_ISv0.3.1_seperate.fasta
SARS-CoV_ISv0.3.1_seperate.fasta.amb		SARS-CoV_ISv0.3.1_seperate.fasta.amb
SARS-CoV_ISv0.3.1_seperate.fasta.ann		SARS-CoV_ISv0.3.1_seperate.fasta.ann
SARS-CoV_ISv0.3.1_seperate.fasta.bwt		SARS-CoV_ISv0.3.1_seperate.fasta.bwt
SARS-CoV_ISv0.3.1_seperate.fasta.fai		SARS-CoV_ISv0.3.1_seperate.fasta.fai
SARS-CoV_ISv0.3.1_seperate.fasta.pac		SARS-CoV_ISv0.3.1_seperate.fasta.pac
SARS-CoV_ISv0.3.1_seperate.fasta.sa		SARS-CoV_ISv0.3.1_seperate.fasta.sa
SARS-CoV_ISv0.4.1_seperate.fasta		SARS-CoV_ISv0.4.1_seperate.fasta
SARS-CoV_ISv0.4.1_seperate.fasta.amb		SARS-CoV_ISv0.4.1_seperate.fasta.amb
SARS-CoV_ISv0.4.1_seperate.fasta.ann		SARS-CoV_ISv0.4.1_seperate.fasta.ann
SARS-CoV_ISv0.4.1_seperate.fasta.bwt		SARS-CoV_ISv0.4.1_seperate.fasta.bwt
SARS-CoV_ISv0.4.1_seperate.fasta.fai		SARS-CoV_ISv0.4.1_seperate.fasta.fai
SARS-CoV_ISv0.4.1_seperate.fasta.pac		SARS-CoV_ISv0.4.1_seperate.fasta.pac
SARS-CoV_ISv0.4.1_seperate.fasta.sa		SARS-CoV_ISv0.4.1_seperate.fasta.sa
VarDict-amplicon.v1-TF.bed		VarDict-amplicon.v1-TF.bed
VarDict-amplicon.v1.bed		VarDict-amplicon.v1.bed
VarDict-amplicon.v2.1.bed		VarDict-amplicon.v2.1.bed
catMultLane		catMultLane
covTable		covTable
coverage		coverage
readCount		readCount
remRecombo		remRecombo
ssqc		ssqc
ssqc.parm		ssqc.parm

Repository files navigation

ssqc <path/to/parm.file>
Script reads parameter file (see below) for a file indicating which sequences to process.

parm.file:
paired-end true or false if sequence is paired end (note: collaps samples that cross lanes e.g., cat <name-L00[1-4]-R1.fastq.gz > name-R1.fastq.gz)
fastq-files </path/to/file> contains a list of fastq.gz files, paired ends alternate with naming convension name-R1.fastq.gz and name-R2.fastq.gz
is_input <integer> copies of internal standard added to sample.
cc_expected <integer> determined after several runs..expected count of unique CC sequences when no significant NT competition.
nt2is </path/to/file> contains the mapping of NT to IS, base changes in lowercase
ref </path/to/files> path to the reference genome
seqSplit <true/false> determine if the NT and IS reads are to divided into separate bam and fastq
multiple-lanes true or false Not implemented. Script does not handle illumina 4 lane fastq. Use catMultLane to concatenate lanes
qScore Sorting reads into NT, IS, REC, CC, & ukn bins uses base change position, to use position for splitting qScore must be at or above (zero ignores)
goodBaseChange Number of base change positions required for sorting, otherwise ends up in ukn bin
calcCov true or false generate coverage table to be used for viral load, otherwise use NT and IS read counts
covBed table used by samtools bedcov for coverage analysis

Program flow: read parameters; if needed--align reads; call remRecombinants to count CC and remove recombinants; calculate NT and IS reads; create runQC table; calculate viral load
viral load calculated from total amplicon reads NT / IS * IS_INPUT
runQC uses the NT and IS CC-reads to adjust expected complexity capture (cc_expected) for competition. Unique CC sequences calculated from -CC-counts.txt
Coverage calculated using bamstats amplicon counting method

Key files produced:
viralLoad.txt contains the estimated viral genomes present when IS was mixed with sample
runQC.txt indicates how far the complexity capture deviated from expected.

The Other files:
<prefix>-NT-R1.fastq or <prefix>-NT-R2.fastq are the viral reads, and should be compatible with your viral assembly pipeline
-IS-R1.fastq or -IS-R2.fastq are the IS reads
-bad-R1.fastq or -bad-R2.fastq are the recombinant reads
-IS.coverage, -cc.coverage, -NT.coverage, or -bad.coverage a table listing the reads for each amplicon, used for viral load.
-tallies.txt a table indicating occurrences and position of recombinants
-CC.txt table the source/quality of a samples CC sequences.
-CC-counts.txt a summary table created from the CC.txt file.
Bam and sam intermediate files used to create fastq and coverage.

The code has room for optimization but we should settle on the final deliverable before moving to optimization. Also, I’m a hacker, so I’m happy to hear of any suggestions for code improvements.

Dependencies:

bwa
Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60

samtools
https://github.com/samtools/samtools

Part of ssqc package: remRecombo (count CC & sort sequences), coverage or baseCount (used for viral load calculation), catMultLane & covTable are accessory scripts to concatentate multilane fastq and create coverage data table of all run samples. Various COVID reference files: reference genomes, coverage.bed & look up table.
Tom Morrison, AccuGenomics. [email protected]

Version control
9/22/20 - added ability to process ubam and added undeduped CC reads to runQC.txt.
11/27/20 - Analysis options controled by ssqc.parm file.