Skip to content

8. explanation of final spreadsheet and visual reports

Rauf Salamzade edited this page Aug 29, 2024 · 20 revisions

We will use results from the tutorial on lsaBGC-Pan analysis of the pan-BGC-ome of Streptomyces olivaceus. The Final_Results/ subdirectory for the analysis can be found in this Google Drive folder.

Explanation of Consolidated Spreadsheet Tables

Spreadsheet location: Pan_Results/Final_Results/Consolidated_Spreadsheet.xlsx

Example spreadsheet

Tab 1: "Explanation of Results"

A link to this wiki page to explain the contents of the other sheets.

Tab 2: "BGC Overview"

This tab features an overview of BGCs across all genomes along with information on which populations samples/genomes are grouped into and which GCFs the BGCs are grouped into.

Column Descriptions:

Column Description
sample Sample/genome identifier.
population The population/clade that sample was assigned to.
method The BGC prediction method (either antiSMASH or GECCO).
genome_path The path to the full genome in GenBank format.
bgc_id The BGC identifier.
bgc_path The path to the BGC in GenBank format.
gcf_id The GCF identifier the BGC belongs to.
scaffold The scaffold identifier the BGC is found on.
start The start coordinate for the BGC.
end The end coordinate for the BGC.
bgc_length The length of the BGC in bp.
dist_to_edge The minimal distance of the BGC to the start/end of the scaffold/contig it is on.

Tab 3: "zol Results"

This tab provides gene-resolution information on the conservation, evolutionary trends, and functional annotation of orthogroups across GCFs. It uses zol to compute the information and replaces lsaBGC-PopGene.

Column Descriptions:

Column Description
GCF ID The GCF identifier.
Ortholog Group (OG) ID The orthogroup identifier.
OG is Single Copy? Is the orthogroup single copy?
Proportion of Total Gene Cluster Instances with OG The proportion of total GCF instances which feature the orthogroup. Note, by default non-representative paralogous BGC instances are still filtered out (when two or more BGC instances are found in the same sample). See option --zol-keep-multi-copy in lsaBGC-Pan.
Proportion of Complete Gene Cluster Instances with OG The proportion of complete GCF instances (not near contig edges) which feature the orthogroup. Again, paralogous instances are filtered out by default.
columns F ... onwards Descriptions of columns F onwards can be found on this zol wiki page. Note, these data reflect comprehensive analysis - not just complete instances.

Tab 4: "lsaBGC-MIBiGMapper Results"

This tab shows mapping information of GCFs to reference/characterized BGCs in the ever-so-useful MIBiG database. By default lsaBGC-MIBiGMapper requires 5 proteins from the focal GCF mapping to proteins from a single reference MIBiG BGC at >=80% identity and >=70% coverage of the reference BGC.

If you would like to have these options accessible in lsaBGC-Pan - open up a GitHub ticket and just give us a nudge to do it!

Column Descriptions:

Column Description
GCF ID GCF identifier.
MIBiG BGC ID The matching MIBiG reference BGC identifier.
GCF OG ID The GCF orthogroup ID.
MIBiG Protein Matching The matching protein in the MIBiG reference BGC.
MIBiG Compound(s) The compounds associated with the MIBiG reference BGC.

Tab 5: "lsaBGC-Reconcile Results"

This tab depicts an overview of BGC associated orthogroups and metrics to help identify those that might have been horizontally transferred.

Column Descriptions:

Column Description
orthogroup Orthogroup identifier.
GCF count The number of distinct GCFs the orthogroup is found within.
found in non-BGC context Whether the orthogroup is found in a non-BGC context.
population count total The number of distinct populations the orthogroup is found within.
population count in BGC context The number of distinct populations the orthogroup is found within a BGC context specifically.
GCFs The list of GCFs the orthogroup is found in.
conservation total The proportion of genomes the orthogroup is found within.
conservation in BGC context The proportion of genomes the orthogroup is found within a BGC context within.
norm_max_bd max_bd / mean_bd
mean_bd The average phylogenetic branch distance ratio between leafs in the gene tree and species tree.
max_bd The maximum phylogenetic branch distance ratio between leafs in the gene tree and species tree.
population/clade specific conservation metrics... The proportion of a single population/clade's genomes the orthogroup has been found in.

Tab 6: "BGC OG by Sample Matrix"

A two-header matrix file where the first row corresponds to the genome/sample identifiers and the second row indicates the populations they belong to. The columns are sorted in primary by the population identifiers. The rows of the matrix after these two header columns correspond to the copy count of BGC-associated orthogroups across the different samples/genomes.

Tab 7: "lsaBGC-Sociate Results"

This tab shows results from performing genome-wide association testing (GWAS) to identify orthogroups and alternate GCFs associated or de-associated with focal GCFs.

⚠️ Doing GWAS generally benefits heavily from the inclusion of more samples and traits being interspersed phylogenetically. While we use the "lmm" model in pyseer to adjust p-values for phylogenetic dispersion of associated orthogroups/GCFs with focal GCFs and apply Bonferroni multiple testing correction, you can still end up with false positives if working with a small number of samples. Do not assess if you have less than 20 samples and ideally incorporate at least 100 samples if this module is your primary interest.

Annotations simply require an E-value < 1e-5 but the best annotation for the consensus sequence of an orthogroup is selected based on score or bitscore.

Column Descriptions:

Column Description
focal GCF The focal GCF that we are looking for co-occurence (de-)associations with.
associated GCF/OG The associated GCF or orthogroup identifier with the focal GCF.
allele frequency The allele frequency of the associated GCF or orthogroup.
pvalue The un-adjusted pvalue.
phylogenetically corrected pvalue The phylogenetically corrected p-value based on the lmm model.
beta The effect size/slope of the associated feature.
beta-std-err "the standard error of the fit on beta" - pyseer documentation.
variant_h2 "the variance in phenotype [focal GCF presence] explained by the variant" - pyseer documentation.
notes "Notes about the fit" from the pyseer run.
KO Annotation (E-value) Best KEGG ortholog annotation(s) (the HMMER3 E-value associated with the best score)
PGAP Annotation (E-value) Best PGAP annotation(s) (the HMMER3 E-value associated with the best score)
PaperBLAST Annotation (E-value) Best PaperBLAST annotation(s) (the DIAMOND E-value associated with the best bitscore). For associated papers BLAST the consensus sequence or the ID here to on the PaperBLAST webpage.
CARD Annotation (E-value) Best CARD annotation(s) of antimicrobial resistance genes (the DIAMOND E-value associated with the best bitscore)
IS Finder (E-value) Best ISFinder annotation(s) of IS elements / transposons (the DIAMOND E-value associated with the best bitscore)
MIBiG Annotation (E-value) Best MIBiG annotation(s) for genes in characterized BGCs (the DIAMOND E-value associated with the best bitscore)
VOG Annotation (E-value) Best VOG annotation(s) for viral/phage ortholog groups (the HMMER3 E-value associated with the best score)
Pfam Domains Pfam domains with E-value < 1e-5 and meeting the "trusted" score thresholds.

Visual Results

All visuals from lsaBGC-Pan have Rscripts for creating them nearby the plots - and if users are familiar with R - they can be easily adjusted to redo scaling (e.g. size of PDFs) to make the figures better suited for publication. Users then would simply re-run them, e.g. Rscript some_rscript.R to recreate the figures. For more details see the tutorial for analysis of the pan-BGC-ome of two Cutibacterium species.

GSeeF - Comprehensive Visual of the Presence of BGCs across a Species Tree

GSeeF produces a plot showing the presence of GCFs across the species phylogeny.

  • Script location: Pan_Results/Final_Results/Visualizations/GseeF_Results/gseef_rscript.R
  • Plot location: Pan_Results/Final_Results/Visualizations/GSeeF_Results/Final_Results/Phylogenetic_Heatmap.png
  • Legend location: Pan_Results/Final_Results/Visualizations/GSeeF_Results/Final_Results/Annotation_Legend.png

Example from Streptomyces olivaceus tutorial:

lsaBGC-See Plots - BGC schematics across a species phylogeny

These plots are made for each individual GCF. It shows a schematic of BGCs belonging to a GCF across a species tree. Additional information on how lsaBGC-See works can be found on its original lsaBGC wiki page.

  • Script location: Pan_Results/Final_Results/Visualizations/lsaBGC_See_Results/GCF_X/plot_with_species_phylo.R
  • Plot location: Pan_Results/Final_Results/Visualizations/lsaBGC_See_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf

Example from Streptomyces olivaceus tutorial:

lsaBGC-ComprehenSeeIve Plots - genome-wide orthogroup presence/absence for BGCs across a species phylogeny

These plots are made for each individual GCF. It shows a heatmap for the presence of orthogroups associated with the GCF across the entire genome of samples. Additional information on how lsaBGC-ComprehenSeeIve works can be found on its original lsaBGC wiki page.

  • Script location: Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/plot_with_species_phylo.R
  • Plot location: Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf

Example from Streptomyces olivaceus tutorial:

lsaBGC-Reconcile Plots

These plots are made for each individual orthogroup that is found within a BGC context. They show a gene phylogeny of the orthogroup constructed using MUSCLE super5 alignment and FastTree2 alongside tracks indicating the context the gene instance is found within (which GCF or whether it is not in a GCF context) and the population the genome with the gene belongs to. The "reconcile" part of the name is because the overlay of population information on the gene tree allows users to see indications of horizontal transfer.

  • Script locations: Pan_Results/Final_Results/Visualizations/lsaBGC_Reconcile_Results/BGC_OG_PhyloViz_Scripts/
  • Plot locations: Pan_Results/Final_Results/Visualizations/lsaBGC_Reconcile_Results/BGC_OG_Phylogenetic_Visualizations/ Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf
  • Population legend location: Pan_Results/Final_Results/Visualizations/population_coloring.pdf
  • Context legend location: Pan_Results/Final_Results/Visualizations/gcf_coloring.pdf

Example from Streptomyces olivaceus tutorial:

Note, for the GCF track, the first column, gray specifically means non-BGC context, but grey is used as just another clade color in the population track.

cgc Plots

These plots are made for each GCF. cgc is a program in the zol suite (a dependency of lsaBGC-Pan) which visualizes zol results. More information on cgc can be found on this zol wiki page. Note, it is probably easier to rerun cgc rather than update the Rscript associated with a GCF.

  • Script locations: Pan_Results/Final_Results/Visualizations/cgc_Results/GCF_X/cgc_script.R
  • Plot locations: Pan_Results/Final_Results/Visualizations/cgc_Results/GCF_X/cgc_plot.png

Example from Streptomyces olivaceus tutorial:

lsaBGC-Sociate Plots

These plots are made for each GCF which has associated/de-sociated features (orthogroups or other GCFs) across the pangenome. The figure is a phylogenetic heatmap where the first track (in black) is the presence of the focal GCF. Then the following tracks in order of lowest phylogenetically corrected p-value (left) to highest p-value (right) are associated features (orthogroups or other GCFs, red = negative effect size, blue = positive effect size)

  • Script locations: Pan_Results/Final_Results/Visualizations/lsaBGC_Sociate_Visual_Results/Rscripts/
  • Plot locations: Pan_Results/Final_Results/Visualizations/lsaBGC_Sociate_Visual_Results/Plots/

Example from Streptomyces olivaceus tutorial: