Code for Microbiology PhD Thesis - bioinformatic whole genome assembly, RNAseq data processing and analysis, 16S analysis, figures in R
Scripts are divided into the following sections:
These scripts were used to assemble whole genomes for individual bacterial isolates which were sequenced with both Illumina paired-end short reads and Oxford Nanopore Technologies long reads. The scripts are designed to run in the order specified below and are constructed to loop through multiple samples with minimal intervention required. Quality checks are included throughout.
This script was run on a computing cluster to basecall Oxford Nanopore long reads using the SUP model for MinION data using GPUs.
This script processes multiple raw .fastq files from the sequencers all the way through to assembly and annotation, and includes the following programs:
- Barcodes are removed with Porechop
- Reads are trimmed with Filtlong
- Read quality metrics are assessed with Nanopack (DOI:10.1093/bioinformatics/bty149)
- Read subsets generated with seqtk
- Assemblies by Flye (DOI:10.1038/s41587-019-0072-8)
- Assemblies by Raven (DOI:10.1038/s43588-021-00073-4)
- Assemblies by Canu (DOI:10.1101/gr.215087.116)
- Assemblies by Redbean/wtdbg2 (DOI:10.1038/s41592-019-0669-3)
- Assemblies by Miniasm/minipolish (DOI:10.1093/bioinformatics/btw152)
Prior to running this script, all fastq files must be concatenated into one .fastq.gz per barcode and be in the working directory. This script is constructed so that if a sample has fewer reads than the required depth for Trycycler (at least 50X), it will not be subsampled and will be assembled by Flye.
This script finds the consensus sequence between different long-read assemblies for samples that had reads depths greater than the target depth using Trycycler (DOI:10.1186/s13059-021-02483-z). Assemblies are generated by the WGA_readqc_assemblies.sh script. Manual intervention and decision making is required when running Trycycler.
This script polishes Oxford Nanopore long-read assemblies from Trycycler with long reads using Medaka and CPUs.
The script processes raw multiple paired-end Illumina reads, and includes the following programs:
- Barcodes are removed with BBDuk
- Reads are trimmed with Trimmomatic (DOI:10.1093/bioinformatics/btu170)
- Read quality metrics are assessed with Fastqc
This script generates hybrid assemblies using Illumina paired-end short reads and Oxford Nanopore long reads with Unicycler (DOI:10.1371/journal.pcbi.1005595).
This script polishes long-read assemblies with short reads using Polypolish (DOI:10.1371/journal.pcbi.1009802. This script runs on assemblies that have already been polished by Medaka using WGA_medaka.sh.
This script polishes long-read assemblies with short reads using POLCA (DOI:10.1371/journal.pcbi.1007981. This script runs on assemblies that have already been polished by PolyPolish using WGA_polypolish.sh.
After Medaka long-read polishing, an alternative to Polypolish & POLCA for polishing assemblies with Illumina reads is Pilon (DOI:10.1371/journal.pone.0112963). The script makes use of the insertsizeI.py
and insertsizeX.py
scripts for determining the minimum and maximum insert size between paired end reads for the bowtie2 call.
After Medaka long-read polishing, an additional long-read polishing step can be done with Racon (DOI:10.1101/gr.214270.116).
This script takes the completed and assembled genome and does some initial annotation and investigation:
- Annotates using Prokka (DOI:10.1093/bioinformatics/btu153)
- Plasmid detection by MOB-suite (DOI:10.1099/mgen.0.000206)
- Identification of the following through Abricate
- Plasmid replicon detection by PlasmidFinder (DOI:10.1128/AAC.02412-14)
- Antimicrobial resistance genes in CARD (DOI:10.1093/nar/gkw1004)
- Antimicrobial resistance genes in ResFinder (DOI:10.1093/jac/dkaa345)
- Prediction of viral contigs using DeepVirFinder (DOI:10.1007/s40484-019-0187-4)
This folder contains scripts for processing and quantifying transcriptomes of a mixed-species bacterial culture generated from Illumina unpaired read sequencing data.
This script runs the analysis pipeline READemption (DOI:10.1093/bioinformatics/btu533) for quantifying RNAseq transcripts and assumes that the create
command has already been run and that the input files are in the folder structure specified by the READemption documentation.
In addition, this script shows how to make use of the cross-align clean option to eliminate reads that map to multiple replicons during the align
step.
This script runs the analysis pipeline for quantifying RNAseq transcripts using the following programs:
- Reads aligned to reference with Bowtie2 (DOI:10.1038/nmeth.1923)
- .sam files generated with Samtools (DOI:10.1093/bioinformatics/btp352)
- TPMs and counts quantified with Salmon (DOI:10.1038/nmeth.4197)
This script runs the the analysis for differentially expressed genes using DESeq2 (DOI:10.18129/B9.bioc.DESeq2) and produces diagnostic and results plots.
This folder contains scripts to assemble and investigate metagenomic DNA samples on a high performance computing cluster.
This script assembles metagenomic Illumina paired end reads using MEGAHIT (DOI:10.1093/bioinformatics/btv033) with preset meta-large
.
This script assembles metagenomic Illumina paired end reads using SPAdes (DOI:10.1093/bioinformatics/btw493) with arguments --meta
and --plasmid
.
This script assembles metagenomic Illumina paired end reads using IDBA-UD (DOI:10.1093/bioinformatics/bts174).
This script downloads the database and classifies reads into taxonomic groups using Kaiju (DOI:10.1038/ncomms11257).
This script downloads the database and classifies reads into taxonomic groups using Kraken2 (DOI:10.1186/s13059-019-1891-0).
This script identifies the proportion of reads with small subunit and large subunit rRNA sequences using METAXA2 (DOI:10.1111/1755-0998.12399).
These scripts include data manipulation and cleaning for 16S rRNA community profiling with ASVs, as well as statistical analyses for diversity and making taxonomic plots.
This script processes Illumina paired-end sequencing data for ASV analysis using the R package DADA2 (DOI:10.1038/nmeth.3869) and performs quality control checks for removing contaminating sequences in the negative control.
This script creates a phyloseq (DOI:10.1371/journal.pone.0061217) object and evaluates alpha diversity and beta diversity with vegan, and plots taxonomic heatmaps with ggplot2.
Many of the shell scripts included here are dependent on conda environments run on a Linux Ubuntu x86_64 machine. The packages and versions for each environment used in the respective folders can be installed using the .yml files.