Skip to content
Rauf Salamzade edited this page Aug 12, 2024 · 25 revisions

Welcome to the lsaBGC-Pan wiki!

lsaBGC-Pan puts together some programs from the original suite (e.g. lsaBGC-Cluster, lsaBGC-See, lsaBGC-ComprehenSeeIve, and GSeeF) and some new programs (e.g. zol, lsaBGC-Reconcile [an updated version of lsaBGC-Divergence], and lsaBGC-Sociate). lsaBGC-Pan works on both bacterial and fungal genomes and can jointly process both antiSMASH and GECCO BGC predictions. In summary, lsaBGC-Pan is a newer and recommended workflow to the previously available lsaBGC-Easy.py and lsaBGC-EukEasy.py workflows. We plan to continue development of this workflow but maintain support for lsaBGC-Easy.py and lsaBGC-EukEasy.py.

Contents

  1. overview of lsaBGC-Pan
  2. comparison of workflows: what's different about lsaBGC-Pan compared to lsaBGC-(Euk)Easy.py
  3. installation guide
  4. tutorial on GECCO based analysis of BGCs from Cutibacterium avidum and Cutibacterium acnes
  5. tutorial on joint AntiSMASH & GECCO analysis of BGCs from Streptomyces olivaceus
  6. tutorial on AntiSMASH analysis of BGCs from Aspergillus flavus
  7. guide to parameter selections during mid-way break
  8. explanation of final spreadsheet and visual reports
  9. details on new modules: lsaBGC-Reconcile and lsaBGC-Sociate

Example Commands:

Perform analysis using a directory of AntiSMASH results as input:

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ -c 10

Provide a directory of AntiSMASH results as input and incorporate GECCO BGC predictions as well:

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ -c 10 -rg

Provide a directory of genomes in FASTA format for GECCO-based BGC predictions and analysis (only works for bacteria):

lsaBGC-Pan -g Directory_of_Genomes_in_FASTA/ -o Pan_Results/ -c 10

Usage

usage: lsaBGC-Pan [-h] [-a ANTISMASH_RESULTS_DIRECTORY] [-g GENOMES [GENOMES ...]] -o OUTPUT_DIRECTORY [-f] [-k] [-rg] [-up] [-omc] [-ohq] [-hqp] [-cs CORE_PROPORTION] [-mcs MAX_CORE_GENES]
                  [-pic POPULATION_IDENTITY_CUTOFF] [-mpf MANUAL_POPULATIONS_FILE] [-ci CLUSTER_INFLATION] [-cj CLUSTER_JACCARD] [-cr CLUSTER_SYNTENIC_CORRELATION] [-cc CLUSTER_CONTAINMENT] [-nb]
                  [-zp ZOL_PARAMETERS] [-zhq] [-zl] [-ed EDGE_DISTANCE] [-hqr] [-rsh] [-at ALIGNMENT_TIMEOUT] [-py] [-c THREADS]

       __              ___   _____  _____
      / /  ___ ___ _  / _ ) / ___/ / ___/
     / /  (_-</ _ `/ / _  |/ (_ / / /__  
    /_/  /___/\_,_/ /____/ \___/  \___/  
    **********************************************************************************************************
    Program: lsaBGC-Pan
    Author: Rauf Salamzade
    Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
	
    QUICK DESCRIPTION:
    Workflow to run a pan-genome analysis of biosynthetic gene clusters for a single genus or species. 
    https://github.com/Kalan-Lab/lsaBGC-Pan/
		
    **********************************************************************************************************
    NOTES & CONSIDERATIONS:	

    * Check out the documentation at: https://github.com/Kalan-Lab/lsaBGC-Pan/wiki 	
    * Either provide an antiSMASH results directory or a directory with genomes (not both!).
    * lsaBGC-Pan does not have all the  functionalities of the original lsaBGC suite to make things simpler 
      and more straightforward. In particular, lsaBGC-(Auto)Expansion, which helped with scalability is 
      discluded because its usage generally requires more careful consideration and manual curation. Thus, to 
      make sure your analysis runs in a relatively timely manner, the program is restricted to handling data 
      from 4-200 genomes.
          - Consider using tools for dereplication/strain-clustering (such as skDER/CiDDER, dRep, PopPunk, 
            or treemmer) to prune down your set of genomes if you have more than 200.
    * Panaroo should only be used if working with genomes belonging to a single bacterial species. If working 
      with multiple bacterial species or fungal genomes, please use OrthoFinder instead.
    * Specifying fungal mode makes sure OrthoFinder is being used and that antiSMASH results were provided. 
      GECCO is not designed for fungal genomes. In addition, customized processing designed for bacterial 
      genomics is not performed and a more direct extraction of hierarchical orthogroups is applied.
    **********************************************************************************************************
    OVERVIEW OF WORKFLOW STEPS:	
	
    PART 1
    ----------------------------------------------------------------------------------------------------------
        - Step 1: Assess inputs provided
        - Step 1a: If genomes are provided, perform gene-calling with p(y)rodigal and annotate BGCs with GECCO
        - Step 1b: If antiSMASH results are provided, extracted genes from full genome GenBank files. If GECCO 
                   is requested, then it will also be run and overlapping GECCO and antiSMASH BGC regions will 
                   be consolidated by taking
                   the larger region. 
        - Step 2: Run OrthoFinder/Panaroo for orthology inference.
        - Step 3: Create species tree/phyogeny from (near-) core ortholog groups.
        - Step 4: Infer populations (will do so at multiple "core AAI" cutoffs) [no checkpoint, always rerun].
        - Step 5: Run lsaBGC-Cluster.py to determine evolutionary GCFs (by default in testing mode unless 
                  --auto-cluster specified) [no checkpoint, always rerun].
		
    BREAK (optional - but recommended - can be skipped by issuing --no-break)
    ----------------------------------------------------------------------------------------------------------
        - Step 6a: Manually examine which parameters make the most sense for evolutionary clustering of GCFs. 
                   Restart the workflow after with parameters for gene cluster clustering adapted.
        - Step 6b: Manually assess how population designations structure along the species tree with different 
                   core AAI cutoffs and, if desired, adjust population designations.							  

    PART 2
    ----------------------------------------------------------------------------------------------------------
        - Step 7: Parallel running of zol and cgc per GCF.
        - Step 8: Parallel running of GSeeF, lsaBGC-See, and lsaBGC-ComprehenSeeIve per GCF.
        - Step 9: Parallel running of lsaBGC-MIBiGMapper.
        - Step 10: Run lsaBGC-Reconcile.
		- Step 11: Run lsaBGC-Sociate.
        - Step 12: Create consolidated report of zol, lsaBGC-MIBiGMapper, lsaBGC-Reconcile, and lsaBGC-Sociate 
                   results [no checkpoint, always rerun].
    **********************************************************************************************************
	

options:
  -h, --help            show this help message and exit
  -a ANTISMASH_RESULTS_DIRECTORY, --antismash-results-directory ANTISMASH_RESULTS_DIRECTORY
                        A directory with subdirectories corresponding to antiSMASH results
                        per sample/genome [Optional].
  -g GENOMES [GENOMES ...], --genomes GENOMES [GENOMES ...]
                        Paths to genomes or directories with genomes in FASTA format.
                        Will run GECCO for BGC-predictions [Optional].
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        Parent output/workspace directory.
  -f, --fungal          Specify if input are from fungal genomes. Only possible
                        if antiSMASH results are provided.
  -k, --keep-locus-tags
                        Keep original locus tags in antiSMASH GenBank files.
  -rg, --run-gecco      If antiSMASH results are provided also run GECCO for
                        annotation of BGCs.
  -up, --use-panaroo    Use Panaroo instead of OrthoFinder for orthology inference.
                        Recommended if investigating a single bacterial species.
  -omc, --run-coarse-orthofinder
                        Use coarse clustering for orthogroups in OrthoFinder instead
                        of the more resolute hierarchical determined homolog groups.
                        There are some advantages to coarse OGs, including their
                        construction being deterministic.
  -ohq, --run-msa-orthofinder
                        Run OrthoFinder using multiple sequence alignments instead of
                        DendroBlast to determine hierarchical ortholog groups.
  -hqp, --high-quality-phylogeny
                        Prioritize quality over speed for phylogeny construction.
  -cs CORE_PROPORTION, --core-proportion CORE_PROPORTION
                        What proportion of genomes single-copy orthogroups need to be
                        found in to be used for species tree construction [Default is 0.9].
  -mcs MAX_CORE_GENES, --max-core-genes MAX_CORE_GENES
                        The maximum number of single copy (near-)core orthogroups to
                        use [Default is 500].
  -pic POPULATION_IDENTITY_CUTOFF, --population-identity-cutoff POPULATION_IDENTITY_CUTOFF
                        The core-genome identity cutoff used to define pairs of genomes as
                        belonging to the same group/population [Default is 99.0].
  -mpf MANUAL_POPULATIONS_FILE, --manual-populations-file MANUAL_POPULATIONS_FILE
                        Tab delimited file for manual mapping of samples to different
                        populations/clades.
  -ci CLUSTER_INFLATION, --cluster-inflation CLUSTER_INFLATION
                        The MCL inflation parameter for clustering BGCs into GCFs [Default
                        is 0.8].
  -cj CLUSTER_JACCARD, --cluster-jaccard CLUSTER_JACCARD
                        Cutoff for Jaccard similarity of homolog groups shared between two
                        BGCs [Default is 50.0].
  -cr CLUSTER_SYNTENIC_CORRELATION, --cluster-syntenic-correlation CLUSTER_SYNTENIC_CORRELATION
                        The minimal correlation coefficient needed between for considering them
                        as a pair prior to MCL [Default is 0.4].
  -cc CLUSTER_CONTAINMENT, --cluster-containment CLUSTER_CONTAINMENT
                        Cutoff for percentage of OGs for a gene cluster near a contig edge
                        to be found within the comparing gene cluster for the pair to be
                        considered in MCL (a minimum of 3 OGs shared are still required) [Default
                        is 70.0]
  -nb, --no-break       No break after step 5 to assess GCF clustering and population
                        stratification and adapt parameters.
  -zp ZOL_PARAMETERS, --zol-parameters ZOL_PARAMETERS
                        The parameters to run zol analyses with - please surround by quotes
                        [Defaut is ""].
  -zhq, --zol-high-quality-preset
                        Use preset of options for performing high-quality and comprehensive zol
                        analyses instead of prioritizing speed.
  -zl, --zol-keep-multi-copy
                        Include all GCF instances in zol analysis, not just the most
                        representative instance from each sample/genome.
  -ed EDGE_DISTANCE, --edge-distance EDGE_DISTANCE
                        Distance in bp to scaffold/contig edge to be considered potentially
                        fragmented. Used in GCF clustering (related to --cluster-containment
                        parameter) and zol conservation computations [Default is 5000].
  -hqr, --high-quality-reconcile
                        Perform high-quality alignment for reconcile analysis.
  -rsh, --report-sociate-hits
                        Report signficant pyseer hits with 'notes' - usually
                        indicate some general issue - should be examined with caution:
                        https://pyseer.readthedocs.io/en/master/usage.html#notes-field.
  -at ALIGNMENT_TIMEOUT, --alignment-timeout ALIGNMENT_TIMEOUT
                        The timeout in seconds for constructing proteins alignments using
                        MUSCLE during lsaBGC-Reconcile and lsaBGC-Sociate - to prevent long
                        runs due to stragglers/abnormally large orthogroups [Default is
                        1800 (30 minutes)].
  -py, --use-prodigal   Use prodigal instead of pyrodigal (only if genomes are provided - not
                        relevant if antiSMASH results provided).
  -c THREADS, --threads THREADS
                        Total number of threads/processes to use. Recommend inreasing as much
                        as possible. [Default is 4].