Skip to content

Code for Microbiology PhD Thesis - bioinformatic whole genome assembly, 16S analysis, figures in R

License

Notifications You must be signed in to change notification settings

lerminin/microbiology-phd-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

microbiology-phd-code

Code for Microbiology PhD Thesis - bioinformatic whole genome assembly, RNAseq data processing and analysis, 16S analysis, figures in R

Scripts are divided into the following sections:

  1. Whole genome assembly (WGA)
  2. RNAseq
  3. Metagenomics
  4. Community profiling (16S rRNA)
  5. Conda environments

Whole genome assembly (WGA) folder

These scripts were used to assemble whole genomes for individual bacterial isolates which were sequenced with both Illumina paired-end short reads and Oxford Nanopore Technologies long reads. The scripts are designed to run in the order specified below and are constructed to loop through multiple samples with minimal intervention required. Quality checks are included throughout.

guppy_basecalling.sh

This script was run on a computing cluster to basecall Oxford Nanopore long reads using the SUP model for MinION data using GPUs.

WGA_readqc_assemblies.sh

This script processes multiple raw .fastq files from the sequencers all the way through to assembly and annotation, and includes the following programs:

Prior to running this script, all fastq files must be concatenated into one .fastq.gz per barcode and be in the working directory. This script is constructed so that if a sample has fewer reads than the required depth for Trycycler (at least 50X), it will not be subsampled and will be assembled by Flye.

WGA_trycycler.sh

This script finds the consensus sequence between different long-read assemblies for samples that had reads depths greater than the target depth using Trycycler (DOI:10.1186/s13059-021-02483-z). Assemblies are generated by the WGA_readqc_assemblies.sh script. Manual intervention and decision making is required when running Trycycler.

WGA_medaka.sh

This script polishes Oxford Nanopore long-read assemblies from Trycycler with long reads using Medaka and CPUs.

WGA_illumina.sh

The script processes raw multiple paired-end Illumina reads, and includes the following programs:

WGA_unicycler.sh

This script generates hybrid assemblies using Illumina paired-end short reads and Oxford Nanopore long reads with Unicycler (DOI:10.1371/journal.pcbi.1005595).

WGA_polypolish.sh

This script polishes long-read assemblies with short reads using Polypolish (DOI:10.1371/journal.pcbi.1009802. This script runs on assemblies that have already been polished by Medaka using WGA_medaka.sh.

WGA_polca.sh

This script polishes long-read assemblies with short reads using POLCA (DOI:10.1371/journal.pcbi.1007981. This script runs on assemblies that have already been polished by PolyPolish using WGA_polypolish.sh.

WGA_pilon.sh

After Medaka long-read polishing, an alternative to Polypolish & POLCA for polishing assemblies with Illumina reads is Pilon (DOI:10.1371/journal.pone.0112963). The script makes use of the insertsizeI.py and insertsizeX.py scripts for determining the minimum and maximum insert size between paired end reads for the bowtie2 call.

WGA_racon.sh

After Medaka long-read polishing, an additional long-read polishing step can be done with Racon (DOI:10.1101/gr.214270.116).

WGA_prokka.sh

This script takes the completed and assembled genome and does some initial annotation and investigation:

RNAseq folder

This folder contains scripts for processing and quantifying transcriptomes of a mixed-species bacterial culture generated from Illumina unpaired read sequencing data.

READemption_analysis.sh

This script runs the analysis pipeline READemption (DOI:10.1093/bioinformatics/btu533) for quantifying RNAseq transcripts and assumes that the create command has already been run and that the input files are in the folder structure specified by the READemption documentation.

In addition, this script shows how to make use of the cross-align clean option to eliminate reads that map to multiple replicons during the align step.

polymicrobial_RNAseqanalysis.sh

This script runs the analysis pipeline for quantifying RNAseq transcripts using the following programs:

DESeq2_analysis.R

This script runs the the analysis for differentially expressed genes using DESeq2 (DOI:10.18129/B9.bioc.DESeq2) and produces diagnostic and results plots.

Metagenomics folder

This folder contains scripts to assemble and investigate metagenomic DNA samples on a high performance computing cluster.

megahit.sh

This script assembles metagenomic Illumina paired end reads using MEGAHIT (DOI:10.1093/bioinformatics/btv033) with preset meta-large.

pspades.sh

This script assembles metagenomic Illumina paired end reads using SPAdes (DOI:10.1093/bioinformatics/btw493) with arguments --meta and --plasmid.

idba.sh

This script assembles metagenomic Illumina paired end reads using IDBA-UD (DOI:10.1093/bioinformatics/bts174).

kaiju_db_download.sh and kaiju_classify.sh

This script downloads the database and classifies reads into taxonomic groups using Kaiju (DOI:10.1038/ncomms11257).

kraken_classify.sh and kraken_db_download.sh

This script downloads the database and classifies reads into taxonomic groups using Kraken2 (DOI:10.1186/s13059-019-1891-0).

metaxa2.sh

This script identifies the proportion of reads with small subunit and large subunit rRNA sequences using METAXA2 (DOI:10.1111/1755-0998.12399).

Community profiling (16S rRNA) scripts

These scripts include data manipulation and cleaning for 16S rRNA community profiling with ASVs, as well as statistical analyses for diversity and making taxonomic plots.

16illumina_dada2.RMD

This script processes Illumina paired-end sequencing data for ASV analysis using the R package DADA2 (DOI:10.1038/nmeth.3869) and performs quality control checks for removing contaminating sequences in the negative control.

16Sillumina_stats.R

This script creates a phyloseq (DOI:10.1371/journal.pone.0061217) object and evaluates alpha diversity and beta diversity with vegan, and plots taxonomic heatmaps with ggplot2.

Conda environments (conda_envs) folder

Many of the shell scripts included here are dependent on conda environments run on a Linux Ubuntu x86_64 machine. The packages and versions for each environment used in the respective folders can be installed using the .yml files.

About

Code for Microbiology PhD Thesis - bioinformatic whole genome assembly, 16S analysis, figures in R

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published