Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mpileup2] How does mpileup choose the reads for later calling, and is the -d flag behaving strangely? #2234

Open
goeckeritz opened this issue Jul 25, 2024 · 1 comment
Labels
D1: Difficult enhancement htslib-dependent Cannot be fixed until htslib is fixed

Comments

@goeckeritz
Copy link

goeckeritz commented Jul 25, 2024

Hi bcftools team,

I have a conceptual question about mpileup and I think the -d flag is either being weird or I'm misunderstanding how it works. I am using bcftools v. 1.18, and the .bam file in question contains a single sample.

I have reads piling up at some problematic genomic regions (we're talking 10-20X the average depth), and I don't really want to trust the variants in those areas. For example, there will be 95% of reads at a site supporting the REF allele, and 5% supporting an ALT allele, so even if all or the majority of reads were included, I'd expect the site to be called homozygous REF. I should note that almost all of the reads at this site will be Primary, too, with a MAPQ > 30.

However, even when the -d flag is set to 250, I'll have only 30 or so reads out of ~600 be chosen for future variant calling and contributing to DP, so this variant site doesn't get filtered out with my max depth filter later on in my pipeline. Moreover, those 30 reads will include almost ALL of those reads from the ALT allele; this results in this site being called heterozygous REF/ALT.

So my first question is: How does mpileup choose which reads to include for later calling -- i.e., how does it choose how many and which reads contribute to DP? It definitely doesn't appear to be random, since in the example above it grabbed almost all of the ALT supporting reads, despite them being 5% of the high quality mappings at that site.

My second question is: Why is the DP at this site not 250, if that's my maximum number of reads to give to mpileup? When I set -d to 1000 or 2000, mpileup grabs 166 reads for this site (DP=166). It's not until I set -d to 10000 does it grab most of the reads at that site for future variant calling. It looks like this issue was touched on in #1694 -- but in the opposite direction and I'm not sure what was meant by "Hence the filtering isn't as strict as it may sound, an is perhaps closer to an average maximum depth threshold rather than something to take as a literal limit." Also, I'm not necessarily trying to filter on the -d flag either. I just want more of the data represented in genotype likelihood assessments, especially when they're relatively high quality -- and therefore I'll know that this site is a problematic region with lots of reads piling up so I CAN filter it out later.

Since -d or --max-depth is defined as 'At a position, read maximally INT reads per input file', this behavior doesn't seem intuitive to me.. But it sounds like there is more to the -d flag that I'm not comprehending, and that it isn't necessarily a bug, right?

Thanks in advance for any time you might have for answering these questions. If it's easier to just point to some reading material, I'm happy to do some more legwork to make your life easier!

Kindly,
Charity

@pd3
Copy link
Member

pd3 commented Aug 5, 2024

There are two read subsetting steps.

First, the mpileup -d option controls the maximum number of reads in the pileup. Unfortunately, this is not random, the buffer fills up to the maximum depth and it starts to throw away spurious reads. This is to be addressed in the new mpileup2 code #1842. As to your second question, he comment #1694 (comment) answers it rather well: as more reads come at a later position, at some point it reaches the maximum of 250 reads. However, because many reads don't start at the same position, the coverage will be lower to the left of the maximum.

Then there is subsetting of reads at the genotype likelihood calculation step. This is invisible to the user and cannot be controlled by any option: the pre-computed auxiliary tables use maximum 255 reads. This is sufficient for germline variant calling in practice.

I'll mark this issue as enhancement, the future mpileup2 code will address it.

@pd3 pd3 added enhancement D1: Difficult htslib-dependent Cannot be fixed until htslib is fixed labels Aug 5, 2024
@pd3 pd3 changed the title How does mpileup choose the reads for later calling, and is the -d flag behaving strangely? [mpileup2] How does mpileup choose the reads for later calling, and is the -d flag behaving strangely? Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D1: Difficult enhancement htslib-dependent Cannot be fixed until htslib is fixed
Projects
None yet
Development

No branches or pull requests

2 participants