Skip to content

Commit

Permalink
MRG: add more text (#8)
Browse files Browse the repository at this point in the history
* add more text

* add extra exercise

* add citation

* add more text
  • Loading branch information
ctb authored Apr 29, 2024
1 parent 2aa846d commit 7160204
Show file tree
Hide file tree
Showing 4 changed files with 88 additions and 15 deletions.
25 changes: 23 additions & 2 deletions docs/amr.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,9 @@ And, finally, run AMRfinder on the proteins:
```
amrfinder -p CD136.assembly.faa -t 16 -o CD136.amrfinder.tsv --plus
```
(This will take under a minute.)

This will produce a spreadsheet named `CD136.amrfinder.tsv` that
AMRfinder will produce a spreadsheet named `CD136.amrfinder.tsv` that
contains a number of columns - you can see the list like so, using
`csvtk headers`:

Expand All @@ -79,5 +80,25 @@ Run:
csvtk -t cut -f "% Coverage of reference sequence","HMM description" CD136.amrfinder.tsv
```

<!-- @CTB say something output the files. -->
and you will see:
```
% Coverage of reference sequence HMM description
89.41 CfxA family broad-spectrum class A beta-lactamase
87.59 23S ribosomal RNA methyltransferase Erm
52.84 NA
100.00 macrolide efflux MFS transporter Mef(En2)
100.00 lincosamide nucleotidyltransferase Lnu(AN2)
100.00 CepA family extended-spectrum class A beta-lactamase
```

The first column here is the amount of the known (reference) sequence
that is present in the metagenome, and the second is the description of
the match.

Note: If you wanted to get the abundance of these in the metagenome,
you would have to find the DNA contig that the relevant gene was on,
using the column "Protein identifier", and then map the metagenome
reads to it to get the abundance. This is because assembly collapses
the abundance of the output contigs, and you have to recover it through
other means.

54 changes: 44 additions & 10 deletions docs/comparing-metagenomes.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
# Comparing metagenomes

The tutorial uses [sourmash](https://sourmash.readthedocs.io/) to do
comparisons of multiple metagenomes based on weighted and unweighted
k-mer content.

In this tutorial, you will learn how to create distance matrices and
ordination plots from metagenome content. Importantly, this tutorial
is *reference* and *annotation* free - it will work equally well on
any metagenome.

## First, create a conda software environment and a working directory.

To install software, run:
To install the necessary software, run:
```
mamba create -n smash -y sourmash scikit-learn
conda activate smash
Expand All @@ -14,14 +23,12 @@ mkdir ~/compare-metag
cd ~/compare-metag
```


## Comparing based on content

<!-- * reference free, annotation free @CTB -->

Here we are going to use the
[`sourmash compare`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-compare-compare-many-signatures) and
[`sourmash plot`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-plot-cluster-and-visualize-comparisons-of-many-signatures)
command to compare and cluster many metagenomes based on their content - not their annotation or assemblies.
commands to compare and cluster many metagenomes based on their content.

As with the [single metagenome analysis](single-metagenomes-taxonomy.md), we have two options here: with, or without abundance information.

Expand Down Expand Up @@ -114,19 +121,46 @@ If you plot this via MDS, you'll see a clear separation:
Points to discuss:

* what does this all mean, in ~microbial terms? Hint: ask Mani to
revist how the test data sets were generated!
revist how the test data sets were generated! Alternatively,
go on to the next section!

## Extra: examining taxonomy

<!--
If we quickly run our [taxonomy analysis](single-metagenomes-taxonomy.md) on
one of the other samples, we can maybe start to see some of the reasons for
the differences in diversity but not richness:

## Comparing based on taxonomy
```
mamba activate tax
sourmash scripts fastgather ../data/tutorial_other/CD240.sig.zip \
../databases/gtdb-rs214-k31.zip -o CD240.x.gtdb-rs214.fastgather.csv -c 16
sourmash gather ../data/tutorial_other/CD240.sig.zip \
../databases/gtdb-rs214-k31.zip -o CD240.x.gtdb-rs214.gather.csv \
--picklist CD240.x.gtdb-rs214.fastgather.csv:match_name:ident
sourmash tax metagenome -g CD240.x.gtdb-rs214.gather.csv \
-t ../single-metag/gtdb-rs214.lineages.sqldb -F human
```
mamba create -y -n workshop-r r-base r-tidyverse r-vegan r-ape r-rcolorbrewer

You should see:
```
sample name proportion cANI lineage
----------- ---------- ---- -------
CD240 42.2% 94.0% d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis
CD240 19.5% 94.5% d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis
CD240 12.6% 94.1% d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Tannerellaceae;g__Parabacteroides;s__Parabacteroides distasonis
CD240 11.7% 91.2% d__Bacteria;p__Bacillota_A;c__Clostridia;o__Oscillospirales;f__Acutalibacteraceae;g__Ruminococcus_E;s__Ruminococcus_E bromii_B
CD240 11.4% - unclassified
CD240 2.6% 91.4% d__Bacteria;p__Bacillota_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Faecalibacterium;s__Faecalibacterium prausnitzii_D
```

That's right - both samples have similar species, but the abundances of those
species are quite different.

-->
Note that in this case that's not an accident: the dataset was created
specifically to contain only five species ;).

---

Expand Down
9 changes: 8 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Introduction

<!-- @CTB stuff about workshop -->
These are tutorials for the PIG-PARADIGM workshop on metagenomics,
Apr 29th, 2024, given at Wageningen.

Tutorials:

Expand All @@ -12,3 +13,9 @@ Tutorials:

Data originally from
[the MIntO tutorial data](https://zenodo.org/records/6369313).

## More information

Authors: Anneliek ter Horst and C. Titus Brown

See the GitHub repo at [ngs-docs/2024-pig-paradigm-workshop](https://github.com/ngs-docs/2024-pig-paradigm-workshop).
15 changes: 13 additions & 2 deletions docs/single-metagenomes-taxonomy.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Analyzing a single metagenome for taxonomy

The tutorial uses [sourmash](https://sourmash.readthedocs.io/) to do
various k-mer based analyses of Illumina shotgun metagenome content.

In this tutorial, you will learn:

* how to look at what genomes share content with a metagenome;
* how to examine the abundance of metagenome content without a reference;
* how to summarize the taxonomic content of a metagenome;

We will be using the taxonomic classification system as benchmarked in
[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0),
which is both very *sensitive* and quite *specific*.

## Creating a working directory

Run:
Expand Down Expand Up @@ -90,8 +103,6 @@ Points to discuss:
content is present in the reference database. Some of this is
probably erroneous data or host contamination.

<!-- @CTB details: discuss weighted/unweighted more? and... what's in a metagenome, anyway? -->

### K-mer abundance histogram

Let's examine this data set further. First, let's take a look at the
Expand Down

0 comments on commit 7160204

Please sign in to comment.