You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A while ago we realized that QUALIMAP assigns more reads to exonic+intronic+intergenic than total reads in the sample. We subsampled the paired-end SRA FASTQ to exactly 1M fragments by head -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq.
Then, the pipeline runs the samples through STAR, then Picard markduplicates, and then qualimap (broadly speaking).
The issue is that although the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so Qualimap is not quantifying reads instead of fragments, which could explain the discrepancy.
The way qualimap runs when the variables are translated to arguments is correct (our samples are paired-end and the strandedness is reverse):
We also made sure that the counting algorithm is uniquely-mapped-reads, as is shown in this screenshot from the Qualimap report. It is my understanding that using this counting algorithm means that chimeric and secondary alignments are not counted, which could explain the higher number if I did not use this counting algorithm:
We know that this is not a nf-core/rnaseq issue but we also reported the same issue a while ago on Qualimap's BitBucket repo. The link to the issue is here but it can't be accessed because the issue has not been approved yet. It was submitted on the 22nd of January 2024, so 8 weeks have passed and there are no responses or updates.
Maybe there is an explanation that we are missing but we have tried everything and still can't explain how Qualimap assigns more reads to genomic features than total reads in the sample. Personally, we have decided to drop the Qualimap results and compute those statistics with an in-house script, and wanted to open the discussion here.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered:
Description of the bug
A while ago we realized that
QUALIMAP
assigns more reads to exonic+intronic+intergenic than total reads in the sample. We subsampled the paired-end SRA FASTQ to exactly 1M fragments byhead -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq
.Then, the pipeline runs the samples through STAR, then Picard markduplicates, and then qualimap (broadly speaking).
The issue is that although the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so Qualimap is not quantifying reads instead of fragments, which could explain the discrepancy.
The way qualimap runs when the variables are translated to arguments is correct (our samples are paired-end and the strandedness is reverse):
We also made sure that the counting algorithm is
uniquely-mapped-reads
, as is shown in this screenshot from the Qualimap report. It is my understanding that using this counting algorithm means that chimeric and secondary alignments are not counted, which could explain the higher number if I did not use this counting algorithm:We know that this is not a
nf-core/rnaseq
issue but we also reported the same issue a while ago on Qualimap's BitBucket repo. The link to the issue is here but it can't be accessed because the issue has not been approved yet. It was submitted on the 22nd of January 2024, so 8 weeks have passed and there are no responses or updates.Maybe there is an explanation that we are missing but we have tried everything and still can't explain how Qualimap assigns more reads to genomic features than total reads in the sample. Personally, we have decided to drop the Qualimap results and compute those statistics with an in-house script, and wanted to open the discussion here.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: