tags |
---|
ggg, ggg2023, ggg201b |
[toc]
Today we're going to further generalize our Snakefile so that we can re-use the same rules for different files. And we're going to work with larger (higher coverage) data sets!
We suggest using RStudio to run the below commands as it will give you an editor and a terminal. Follow these instructions or see the spoiler below for a quickstart.
Note - I've changed the srun
instructions to use more resources:
srun -p high2 --time=3:00:00 --nodes=1 \
--cpus-per-task 4 --mem 10GB --pty /bin/bash
:::spoiler Logging into farm and running RStudio Server
Log into farm with ssh/MobaXterm - link.
Then run:
srun -p high2 --time=3:00:00 --nodes=1 \
--cpus-per-task 4 --mem 10GB --pty /bin/bash
and wait for a new prompt. Then run:
module load spack/R/4.1.1
module load rstudio-server/2022.07.1
rserver-farm
Then,
- in a new shell, run the ssh tunneling command output by
rserver-farm
above; - connect to the
localhost
URL output byrserver-farm
; - Login with your username and the one-time password output by
rserver-farm
; - start a Terminal. :::
When you are all set, your prompt should look like this:
(base) datalab-01@farm:~$
If it doesn’t, please alert me!
Next, run:
mamba create -n vc -y snakemake-minimal \
bcftools=1.8 samtools=1.16.1 minimap2=2.24
You may get an error if it already exists - that's ok!
Now activate this mamba environment to use the installed collection of software:
mamba activate vc
and you're now ready to go!
Addendum for lab 5: let's also update samtools so we can use the coverage
command:
mamba install -n vc -y samtools=1.16.1
Run:
cd
git clone https://github.com/ngs-docs/2023-ggg-201b-variant-calling \
lab5 -b lab5
cd ~/lab5/
:::danger
You'll need to run two commands if you log out and log back in.
First, activate the vc
conda environment:
mamba activate vc
Second, change to the ~/lab5/
directory:
cd ~/lab5/
:::
Reminder: now that we've connected all the rules with input/output, we can run everything up to the rule call_variants
by just specifying call_variants
:
snakemake -p -j 1 call_variants
Try the following commands:
samtools flagstat SRR2584857_1.x.ecoli-rel606.bam
and
samtools coverage -m SRR2584857_1.x.ecoli-rel606.bam.sorted
You don't actually need to use the rule names (this will be important later on!). You can just tell snakemake what file you want produced, and run that.
So:
snakemake -p -j 1 SRR2584857_1.x.ecoli-rel606.vcf
will also work to run the rule call_variants
, but you don't have to remember the rule name.
(Later on we'll see why this is important for other things.)
By default, if you don't specify a rule snakemake runs the first rule in the file. (This is actually the only case where the order of rules in the Snakefile matters!)
So it's conventional to define a rule named all
at the top; try it!
rule all:
input:
"SRR2584857_1.x.ecoli-rel606.vcf"
Question: why specify the input for this rule, and no output or shell command!?
(There's no good reason. it's just a cute trick that matches the way snakemake thinks. 🤷)
Once you're done creating that sorted bam file, you can also run
samtools index SRR2584857_1.x.ecoli-rel606.bam.sorted
samtools tview -p ecoli:4314717 --reference ecoli-rel606.fa SRR2584857_1.x.ecoli-rel606.bam.sorted
to actually look at the aligned reads.
Let's split the last rule up into two different rules:
The first rule will create ecoli-ref.fa
by gunzipping ecoli-ref.fa.gz
.
The second rule will call variants and create the vcf file.
FIRST, create a new rule uncompress_ref
.
THEN, move the relevant input, output, and shell command stuff out of call_variants
into the new rule.
Here's the result:
rule uncompress_ref:
input:
ref="ecoli-rel606.fa.gz",
output:
ref="ecoli-rel606.fa",
shell: "gunzip -k {input.ref}"
rule call_variants:
input:
bam="SRR2584857_1.x.ecoli-rel606.bam.sorted",
ref="ecoli-rel606.fa",
output:
pileup="SRR2584857_1.x.ecoli-rel606.pileup",
bcf="SRR2584857_1.x.ecoli-rel606.bcf",
vcf="SRR2584857_1.x.ecoli-rel606.vcf",
shell: """
bcftools mpileup -Ou -f {input.ref} {input.bam} > {output.pileup}
bcftools call -mv -Ob {output.pileup} -o {output.bcf}
bcftools view {output.bcf} > {output.vcf}
"""
search/replace in RStudio:
SRR2584857_1
=>{sample}
ecoli-rel606
=>{genome}
Does the resulting Snakefile work?
What's going on here?
:::info CTB note to self: put full Snakefile here! :::
So far we've been working with a cut-down file; let's work with more data, and more files!
First, copy four files over from my account on farm:
cp ~ctbrown/data/ggg201b/SRR258*_1.fastq.gz ./
Next, remove the download_data
rule, and rerun the pipeline:
snakemake -j 1 -p
rule all:
input:
"SRR2584857_1.x.ecoli-rel606.vcf"
rule download_genome:
output: "{genome}.fa.gz"
shell:
"curl -JLO https://osf.io/8sm92/download -o {output}"
rule map_reads:
input:
reads="{sample}.fastq.gz",
ref="{genome}.fa.gz"
output: "{sample}.x.{genome}.sam"
shell: """
minimap2 -ax sr {input.ref} {input.reads} > {output}
"""
rule sam_to_bam:
input: "{sample}.x.{genome}.sam",
output: "{sample}.x.{genome}.bam",
shell: """
samtools view -b -F 4 {input} > {output}
"""
rule sort_bam:
input: "{sample}.x.{genome}.bam"
output: "{sample}.x.{genome}.bam.sorted"
shell: """
samtools sort {input} > {output}
"""
rule uncompress_ref:
input:
ref="{genome}.fa.gz",
output:
ref="{genome}.fa",
shell: "gunzip -k {input.ref}"
rule call_variants:
input:
bam="{sample}.x.{genome}.bam.sorted",
ref="{genome}.fa",
output:
pileup="{sample}.x.{genome}.pileup",
bcf="{sample}.x.{genome}.bcf",
vcf="{sample}.x.{genome}.vcf",
shell: """
bcftools mpileup -Ou -f {input.ref} {input.bam} > {output.pileup}
bcftools call -mv -Ob {output.pileup} -o {output.bcf}
bcftools view {output.bcf} > {output.vcf}
"""
This will take longer... why?
Let's take a look at the VCF! How has it changed?
rule all:
input:
"SRR2584857_1.x.ecoli-rel606.vcf",
"SRR2584403_1.x.ecoli-rel606.vcf"
Let's run the BAM diagnostics again.
Look at flagstat:
samtools flagstat SRR2584857_1.x.ecoli-rel606.bam
Look at coverage:
samtools coverage -m SRR2584857_1.x.ecoli-rel606.bam.sorted
We can also look at the reads again:
samtools index SRR2584857_1.x.ecoli-rel606.bam.sorted
samtools tview -p ecoli:920514 --reference ecoli-rel606.fa SRR2584857_1.x.ecoli-rel606.bam.sorted
(use q
to exit the tview)
Adjust the bcftools view command to look like this instead:
bcftools filter -Ov -e 'QUAL<40 || DP<10 || GT!="1/1"' {input} > {output}
and then remove the VCF file and rerun:
rm *.vcf
snakemake -j 1 -p