Skip to content

Commit

Permalink
merge changes
Browse files Browse the repository at this point in the history
Merge branch 'devel' of github.com:bcbio/bcbioR into devel

# Conflicts:
#	inst/templates/rnaseq/README.md
  • Loading branch information
lpantano committed Oct 4, 2024
2 parents 98ac197 + eda970d commit d3a8047
Show file tree
Hide file tree
Showing 7 changed files with 178 additions and 19 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ inst/rmarkdown/templates/rnaseq/skeleton/DE/Multiplicative_DGE_Analysis.Rmd
.quarto
inst/templates/chipseq/QC/QC.html
*.html
*.csv
*.qs
10 changes: 10 additions & 0 deletions inst/templates/chipseq/QC/params_qc.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# info params


coldata_fn='/path/to/nf-core/samplesheet.csv'
# This folder is in the nf-core output directory inside multiqc folder
multiqc_data_dir='/path/to/nf-core/output/multiqc/narrowPeak/multiqc_data/'
# This folder is in the nf-core output directory, maybe is broadPeak instead of narrowPeak
peaks_dir = '/path/to/nf-core/output/bowtie2/mergedLibrary/macs2/narrowPeak/'
# This folder is in the nf-core output directory, maybe is broadPeak instead of narrowPeak, also includes antibody name
counts_fn = '/path/to/nf-core/output/bowtie2/mergedLibrary/macs2/narrowPeak/consensus/antibody/deseq2/antibody.consensus_peaks.rds'
126 changes: 115 additions & 11 deletions inst/templates/chipseq/diffbind/diffbind.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,15 @@ params:
project_file: ../information.R
functions_file: ../libs/load_data.R
diffbind_counts_file: diffbind_counts.qs
factor_of_interest: genotype
condition_of_interest: genotype
numerator: cKO
denominator: WT
# species = mouse or human
species: mouse
counts_fn: diffbind_counts.csv
results_sig_anno_fn: diffbind_results_sig_anno.csv
---
Template developed with materials from https://hbctraining.github.io/main/.

```{r, cache = FALSE, message = FALSE, warning=FALSE}
# This set up the working directory to this file so all files can be found
Expand All @@ -38,7 +41,7 @@ setwd(fs::path_dir(getSourceEditorContext()$path))


```{r source_params, cache = FALSE, message = FALSE, warning=FALSE}
# 1. set up factor_of_interest parameter from parameter above or manually
# 1. set up condition_of_interest parameter from parameter above or manually
# this is used to color plots, it needs to be part of the metadata
# 2. Set input files in this file
source(params$params_file)
Expand All @@ -56,6 +59,22 @@ source(params$functions_file)
- Analyst: `r analyst`
- Experiment: `r experiment`

# Methodology

[DiffBind](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf)
is an R Bioconductor package which provides functions for processing
DNA data enriched for genomic loci, including ChIPseq data enriched for sites
where specific protein/DNA binding occurs or histone marks are enriched.

DiffBind is mainly used for identifying sites that are differentially enriched
between two or more sample groups. It works primarily with sets of peak calls
('peaksets'), which are sets of genomic intervals representing candidate protein
binding sites for each sample. It includes functions that support the processing
of peaksets, including overlapping and merging peak sets across an entire dataset,
counting sequencing reads in overlapping intervals in peak sets, and identifying
statistically significantly differentially bound sites based on evidence of
binding affinity (measured by differences in read densities). To this end it uses
statistical routines developed in an RNA-Seq context (primarily the Bioconductor packages [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) and [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)).

```{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE}
library(tidyverse)
Expand Down Expand Up @@ -125,9 +144,9 @@ coldata <- load_coldata(coldata_fn)
# make_diffbind_samplesheet is a function provided by bcbioR to help assemble DiffBind's samplesheet
# using the nf-core samplesheet and output. In the resulting DiffBind counts object, it
# encodes your factor of interest as "Factor" and the antibody as "Condition"
# encodes your condition of interest as "Condition" and the antibody as "Factor"
samplesheet <- make_diffbind_samplesheet(coldata, bam_dir, peaks_dir, params$factor_of_interest)
samplesheet <- make_diffbind_samplesheet(coldata, bam_dir, peaks_dir, params$condition_of_interest)
# if necessary, one additional covariate of interest can be encoded as "Tissue"
Expand All @@ -137,6 +156,27 @@ samplesheet <- make_diffbind_samplesheet(coldata, bam_dir, peaks_dir, params$fac
samplesheet %>% dplyr::select(SampleID, Replicate, Condition, Factor, ControlID) %>% sanitize_datatable()
```

# Calculate counts matrix

The first step is to read in a set of peaksets and associated metadata.
This is done using the DiffBind sample sheet. Once the peaksets are read in,
a merging function finds all overlapping peaks and derives a single set of unique
genomic intervals covering all the supplied peaks (a consensus peakset for
the experiment). A region is considered for the consensus set if it appears in
more than two of the samples. This consensus set represents the overall set of
candidate binding sites to be used in further analysis.

The next step is to take the alignment files and compute count information for
each of the peaks/regions in the consensus set. In this step, for each of the
consensus regions, DiffBind uses the number of aligned reads in the ChIP sample
and the input sample to compute a normalized read count for each sample at every
potential binding site. The peaks in the consensus peakset may be re-centered and
trimmed based on calculating their summits (point of greatest read overlap) in
order to provide more standardized peak intervals.

We then normalize the count matrix to adjust for varying
library size, and we use the normalized counts for further analysis including PCA.

```{r create diffbind counts object, eval = !file.exists(params$diffbind_counts_file)}
diffbind_obj <- dba(sampleSheet = samplesheet, scoreCol = 5)
Expand All @@ -150,6 +190,12 @@ qsave(diffbind_counts, params$diffbind_counts_file)
```

# PCA

Principal Component Analysis (PCA) is a statistical technique used to simplify
high-dimensional data by identifying patterns and reducing the number of variables.
In the context of ChIPseq, PCA helps analyze large datasets containing information
about thousands of binding locations across different samples (e.g., tissues, cells).

```{r PCA}
diffbind_counts <- qread(params$diffbind_counts_file)
Expand All @@ -162,34 +208,70 @@ norm_counts <- dba.peakset(diffbind_norm, bRetrieve=TRUE, DataType=DBA_DATA_FRAM
rownames(norm_counts) <- norm_counts$peak
norm_counts <- norm_counts %>% dplyr::select(-peak) %>% as.matrix()
norm_counts_log <- log2(norm_counts + 1)
norm_counts_log_df <- norm_counts_log %>% as.data.frame() %>%
rownames_to_column('peak')
write_csv(norm_counts_log_df, params$counts_fn)
coldata_for_pca <- coldata[colnames(norm_counts), ]
stopifnot(all(colnames(norm_counts) == rownames(coldata_for_pca)))
degPCA(norm_counts_log, coldata_for_pca, condition = params$factor_of_interest) +
degPCA(norm_counts_log, coldata_for_pca, condition = params$condition_of_interest) +
scale_color_cb_friendly()
```


# Differentially Bound Peaks

## Table
```{r DE analysis}
diffbind_norm <- dba.contrast(diffbind_norm, contrast = c('Factor', params$numerator, params$denominator))
A standardized differential analysis is performed using DiffBind and the DESeq2 package,
including estimation of size factors and dispersions, fitting and testing the
model, evaluating the supplied contrast, and shrinking the LFCs. A p-value and FDR
is assigned to each candidate binding site indicating confidence that they are differentially bound.

```{r DB analysis}
diffbind_norm <- dba.contrast(diffbind_norm, contrast = c('Condition', params$numerator, params$denominator))
results_obj <- dba.analyze(diffbind_norm, bGreylist = F)
results_report <- dba.report(results_obj, th = 1)
results_report_sig <- dba.report(results_obj)
results <- results_report %>% as.data.frame()
results_sig <- results_report_sig %>% as.data.frame()
```

## MA plot

This plot can help to:
- Identify Differential Binding: Sites that show a significant log-fold change (M value away from 0) indicate changes in binding between conditions.
- Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
- Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of binding levels and any patterns or anomalies in the dataset.

```{r MA plot}
results_for_ma <- results%>%
mutate(peak = paste(seqnames, start, end, sep = '_')) %>%
mutate(t = 0) %>%
dplyr::select(peak, AveExpr = Conc, logFC = Fold, P.Value = p.value, adj.P.Val = FDR, t)
rownames(results_for_ma) <- results_for_ma$peak
degMA(as.DEGSet(results_for_ma, contrast = paste(params$numerator, params$denominator, sep = ' vs. ')))
```

## Table of differentially bound peaks

```{r DB table}
results_sig %>% sanitize_datatable()
```


## Volcano plot

This volcano plot shows the binding sites that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in purple are sites that have padj < 0.05 and a log2-fold change magnitude > 0.5. Points in blue have a padj > 0.05 and a log2-fold change magnitude > 0.5. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2-fold change and padj that we have chosen.

```{r volcano, fig.height = 8}
results_mod <- results %>%
mutate(Fold = replace(Fold, Fold < -5, -5)) %>%
Expand All @@ -203,14 +285,17 @@ EnhancedVolcano(results_mod,
FCcutoff = 0.5,
x = 'Fold',
y = 'FDR',
title = paste(params$factor_of_interest, ':', params$numerator, 'vs', params$denominator),
title = paste(params$condition_of_interest, ':', params$numerator, 'vs', params$denominator),
col=as.vector(colors[c("dark_grey", "light_blue",
"purple", "purple")]),
subtitle = "", drawConnectors = T, max.overlaps = Inf)
```

## Plot top peaks

We visualize the log2 normalized read counts at a few of the most differentially
bound sites.
```{r plot top peaks, fig.width = 8, fig.height = 6}
norm_counts_log_long <- norm_counts_log %>% as.data.frame() %>%
rownames_to_column('peak') %>%
Expand All @@ -219,12 +304,19 @@ norm_counts_log_long <- norm_counts_log %>% as.data.frame() %>%
norm_counts_log_long_top <- norm_counts_log_long %>% filter(peak %in% show$peak)
ggplot(norm_counts_log_long_top, aes(x = .data[[params$factor_of_interest]], y = norm_counts_log2)) +
ggplot(norm_counts_log_long_top, aes(x = .data[[params$condition_of_interest]], y = norm_counts_log2)) +
facet_wrap(~peak, scale = 'free_y') + geom_boxplot()
```

## Annotate DB peaks

We use the [ChIPseeker](https://www.bioconductor.org/packages/release/bioc/html/ChIPseeker.html)
package to determine the genomic context of the differentially bound peaks and
visualize these annotations. We consider the promoter region to be within 2000 bp in either direction of the TSS.

We also use [ChIPpeakAnno](https://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html)
to identify any gene features within 1000 bp of a differentially bound site.

```{r annotate, echo = F}
results_sig_anno <- annotatePeak(results_report_sig,
Expand All @@ -245,4 +337,16 @@ results_sig_anno_batch <- annotatePeakInBatch(results_report_sig,
maxgap = 1000)
results_sig_anno_batch_df <- results_sig_anno_batch %>% as.data.frame()
```
write_csv(results_sig_anno_batch_df, params$results_sig_anno_fn)
```

# Functional Enrichment

# R session

List and version of tools used for the report generation.

```{r}
sessionInfo()
```
9 changes: 9 additions & 0 deletions inst/templates/chipseq/diffbind/params_diffbind.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# info params

coldata_fn='/path/to/nf-core/samplesheet.csv'

# This folder is in the nf-core output directory, maybe is broadPeak instead of narrowPeak
peaks_dir = '/path/to/nf-core/output/bowtie2/mergedLibrary/macs2/narrowPeak/'

# This folder is in the nf-core output directory
bam_dir = '/path/to/nf-core/output/bowtie2/mergedLibrary/'
4 changes: 2 additions & 2 deletions inst/templates/chipseq/libs/load_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,9 @@ make_diffbind_samplesheet <- function(coldata, bam_dir, peaks_dir, column = NULL

coldata_for_diffbind <- coldata %>%
filter(!is.na(control) & control != '') %>%
dplyr::rename(ControlID = control, SampleID = sample, Condition = antibody) %>%
dplyr::rename(ControlID = control, SampleID = sample, Factor = antibody) %>%
separate(SampleID, into = c('sample', 'Replicate'), remove = F, sep = '_REP')
coldata_for_diffbind$Factor <- coldata_for_diffbind[[column]]
coldata_for_diffbind$Condition <- coldata_for_diffbind[[column]]

samplesheet <- coldata_for_diffbind %>%
left_join(bam_files %>% dplyr::select(SampleID = sample, bamReads = bam), by = 'SampleID') %>%
Expand Down
42 changes: 38 additions & 4 deletions inst/templates/chipseq/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,45 @@

Make sure there is a valid project name, and modify `information.R` with the right information for your project. You can use this file with any other Rmd to include the project/analysis information.

## Run data with nf-core rnaseq

- Make sure you have access to our [Seqera WorkSpace](https://cloud.seqera.io/orgs/HBC/workspaces/core_production/launchpad)
- Transfer data to HCBC S3: Ask Alex/Lorena. Files will be at our S3 bucket `input` folder
- Prepare the CSV file according these [instructions](https://nf-co.re/chipseq/2.0.0/docs/usage/). File should look like this:

```csv
sample,fastq_1,fastq_2,antibody,control
WT_BCATENIN_IP_REP1,BLA203A1_S27_L006_R1_001.fastq.gz,,BCATENIN,WT_INPUT
WT_BCATENIN_IP_REP2,BLA203A25_S16_L001_R1_001.fastq.gz,,BCATENIN,WT_INPUT
WT_BCATENIN_IP_REP2,BLA203A25_S16_L002_R1_001.fastq.gz,,BCATENIN,WT_INPUT
WT_BCATENIN_IP_REP2,BLA203A25_S16_L003_R1_001.fastq.gz,,BCATENIN,WT_INPUT
WT_BCATENIN_IP_REP3,BLA203A49_S40_L001_R1_001.fastq.gz,,BCATENIN,WT_INPUT
WT_INPUT_REP1,BLA203A6_S32_L006_R1_001.fastq.gz,,,
WT_INPUT_REP2,BLA203A30_S21_L001_R1_001.fastq.gz,,,
WT_INPUT_REP2,BLA203A30_S21_L002_R1_001.fastq.gz,,,
WT_INPUT_REP3,BLA203A31_S21_L003_R1_001.fastq.gz,,,
```

You can add more columns to this file with more metadata, and use this file as the `coldata` file in the templates.

- Upload file to our `Datasets` in Seqera using the name of the project but starting with `chipseq-pi_lastname-hbc_code`
- Go to `Launchpad`, select `nf-core_chipseq` pipeline, and select the previous created `Datasets` in the `input` parameter after clicking in `Browser`
- Select an output directory with the same name used for the `Dataset` inside the `results` folder in S3
- When pipeline is done, data will be copied to our on-premise HPC in the scratch system under `scratch/groups/hsph/hbc/bcbio/` folder

## QC

`QC/QC.Rmd` is a template for QC metrics. It includes basic read-level statistics, peak quality information, sample correlation analysis, and PCA.
`QC/QC.Rmd` is a template for QC metrics. It includes basic read-level statistics, peak quality information, sample correlation analysis, and PCA that it produces using the above samplesheet and output from the nf-core pipeline. Use `params_qc.R` to provide the required input files.

## DiffBind

`diffbind/diffbind.Rmd` is a template for comparing peak binding betweeen two groups. Use `params_diffbind.R` to provide the required input files.

## DropBox
On the YAML header file of the Rmd you can specify some parameters including the conditions to be compared, the genome used, and the desired output file names. This template has examples of:
* calculating a peak counts matrix
* PCA
* differential binding analyiss
* peak annotation
* functional analysis (coming soon)

- In `reports/QC`
- [ ] copy QC `Rmd/R/html/figures`
This template writes to CSV a log2 normalized counts matrix of peaks x samples as well as the annotated significant results of the differential binding analysis.
4 changes: 2 additions & 2 deletions inst/templates/rnaseq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We recommend to use the samplesheet.csv used with nf-core as metadata file, wher

## Downstream analysis

Please, modify `information.R` with the right information. You can use this file with any other Rmd to include the project/analysis information.
Modify `information.R` with the right information. You can use this file with any other Rmd to include the project/analysis information.

### QC

Expand All @@ -20,7 +20,7 @@ Read instruction in the R and Rmd scripts to render it.

### DE

`DE/DEG.Rmd` is a template for two groups comparison. `params_de.R` has the information of the input files to load. You can point to `bcbio` or `nf-core/rnaseq` output files.
`DE/DEG.Rmd` is a template for comparison between two groups. `params_de.R` has the information for the input files to load. You can point to `bcbio` or `nf-core/rnaseq` output files.

On the `YAML` header file of the `Rmd` you can specify some parameters or just set them up in the first chunk of code of the template. This template has examples of:

Expand Down

0 comments on commit d3a8047

Please sign in to comment.