Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant difference between IDV and AD when calling certain RNA-seq indels with mpileup #2277

Open
luyh-xp opened this issue Sep 12, 2024 · 8 comments · May be fixed by #2281 or #2282
Open

Significant difference between IDV and AD when calling certain RNA-seq indels with mpileup #2277

luyh-xp opened this issue Sep 12, 2024 · 8 comments · May be fixed by #2281 or #2282

Comments

@luyh-xp
Copy link

luyh-xp commented Sep 12, 2024

I'm using mpileup to compute the mutation frequencies of a list of known variant sites in RNA-seq. For a small number of indels the AD numbers in the output are much lower than what I'm seeing in IGV. For instance, when I run the following command:
bcftools mpileup -r 11:69587265 -f <GRCm38 fasta> --annotate FORMAT/AD,FORMAT/DP,INFO/AD -F 0.001 --max-depth 10000 --max-idepth 10000 -Q20 -x -A --no-BAQ --tandem-qual 10000 <RNA-seq bam> | grep -v "^#"

Here is the output I'm getting:
11 69587265 . CA CAA 0 . INDEL;IDV=77;IMF=0.616;DP=124;AD=47,2;I16=24,23,1,1,1880,75200,80,3200,872,17236,40,800,1057,25169,46,1066;QS=0.908467,0.0915332;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 0,106,55:49:47,2

And here is how this site looks like in IGV:
296fe156-fbce-4e9d-b37c-38660d014890

I am aware that some low quality reads are filtered during variant calling and the reported AD number will be lower than the raw IDV. However upon manual examination most of the reads seem to be high-quality and correctly aligned. Such a huge drop from 77 to 2 seems counter-intuitive to me and I wonder if this is the expected behavior of mpileup.

(One thing I notice is that most of the read pairs containing the insertion are spliced except 2, which happens to be the same as the AD reported)

@jkbonfield
Copy link
Contributor

What version of bcftools is this? The latest one has a revised indel caller with --indels-cns which may do a better job.

@luyh-xp
Copy link
Author

luyh-xp commented Sep 12, 2024

@jkbonfield Thank you for the suggestion. I'm using the latest bcftools 1.20. I tried rerunning the command with --indels-cns but unfortunately it doesn't resolve this issue. Somehow the REF AD number is also off this time:

11 69587265 . CA CAA 0 . INDEL;IDV=2;IMF=0.016129;DP=124;AD=0,2;I16=0,0,1,1,0,0,80,3200,0,0,40,800,0,0,46,1066;QS=0,1;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 40,6,0:2:0,2

@jkbonfield
Copy link
Contributor

Is there test data available?

Also sorry I don't use IGV and don't understand the display. What are the shaded boxes showing? Why don't all alignments have shading? Is it soft-clipping for example? (In which case the reads are tiny.) Seeing the SAM file would be helpful.

@luyh-xp
Copy link
Author

luyh-xp commented Sep 13, 2024

Indel_AD_GRCm38.sam.txt
@jkbonfield I have created an example sam file containing only reads overlapping with this exon. Can confirm that same command as above produces the same result i.e. IDV=77 but ALT AD = 2. Thanks a lot for looking into this!

@jkbonfield
Copy link
Contributor

So the --indels-cns code has an explicit bit of code to filter out reads with skips in the CIGAR string.

https://github.com/samtools/bcftools/blob/develop/bam2bcf_edlib.c#L1567-L1573

Commenting out those lines gives me this:

11	69587265	.	CA	CAA	0	.	INDEL;IDV=77;IMF=0.631148;DP=122;AD=43,77;I16=21,22,38,39,1720,68800,3080,123200,826,16418,1540,30800,969,23179,1824,44106;QS=0.349662,0.650338;VDB=0.0434579;SGB=-0.693147;RPBZ=-0.826049;MQBZ=1.83725;MQSBZ=0;BQBZ=0;SCBZ=-1.9937;MQ0F=0	PL:DP:AD	139,0,96:120:43,77

Putting it through bcftools call gives me QUAL of 106 and GT 0/1, matching the most likely PL information from above.

My question to myself now is why I ever added those lines to filter out ref skips?

@jkbonfield
Copy link
Contributor

In answer to my own question - it wasn't me who added those. They were duplicated (by request) from bam2bcf_indel.c, which itself was moved from Samtools.

They appeared here in 2011 with no real explanation other than to fix a bug. What bug I have no idea.

The same code exists in bam2bcf_indel.c, which is what you get if you don't use --indels-cns.

It's clearly code that is long overdue for reevaluation. It's quite possible whatever it was that broke bcftools calling when reads had ref-skips in them is no longer affected. It certainly seemed to give us the correct answer anyway.

jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Sep 17, 2024
Mpileup removes alignments using the cigar ref skip operator ("N").
This was originally added in 2011 in samtools/samtools#d1643d6 with
the commit message of "fixed a bug in indel calling related to
unmapped and refskip reads".

Unfortunately I don't know what that bug was, but removing the code
shows it still works (at least for some data!).  We need better
understanding of what's going on and why it was added, so perhaps we
should add a command line option to control this instead?

Fixes samtools#2277
@jkbonfield jkbonfield linked a pull request Sep 17, 2024 that will close this issue
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Sep 17, 2024
Mpileup removes alignments using the cigar ref skip operator ("N").
This was originally added in 2011 in samtools/samtools#d1643d6 with
the commit message of "fixed a bug in indel calling related to
unmapped and refskip reads".

Unfortunately I don't know what that bug was, but removing the code
shows it still works (at least for some data!).  We need better
understanding of what's going on and why it was added, but this PR
makes it optional, keeping the default as before.  Note there appears
to be no filtering of BAM_CREF_SKIP in indels-2.0 so the option would
be a nop there.

This is an alternative PR to samtools#2281.  I've leave it to the project
maintainer as to what is preferable: removing the (no longer needed?)
filtering, or keeping the default behaviour identical and adding a new
option instead (which is safer, but possibly leads to accidental bad
calls due to not noticing a new option has appeared).

Fixes samtools#2277
@jkbonfield jkbonfield linked a pull request Sep 17, 2024 that will close this issue
@luyh-xp
Copy link
Author

luyh-xp commented Sep 18, 2024

Thank you so much for figuring this out! I agree that reads with Ns in CIGARs shouldn't be simply filtered out like that since they are abundant in RNA-seq libraries.

So for now the solution is to comment out bam2bcf_edlib.c:1567-1573 and bam2bcf_indel.c:853-859 and re-compile bcftools?

@jkbonfield
Copy link
Contributor

That works, or check out one of the two PRs above (the second makes it optional so is probably the one Petr will go with) and you don't need to edit the code, but the changes are minimal obviously.

You only need to change one bit too. bam2bcf_edlib.c is for the --indels-cns mode while bam2bcf_indel.c is the original and default mpileup code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants