Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools norm -m +any replaces reference allele GTs with missing for repeated sites #2247

Open
rickymagner opened this issue Aug 7, 2024 · 0 comments

Comments

@rickymagner
Copy link

This is a bit of a niche case, but came up when I was trying to create some consensus VCFs using trios. Here's a minimal example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003
chr1    219783256       .       AAC     ACAC,A  .       .       SYNC=219783257;BASE=FN_CA;CALL=FP_CA;AN=4;AC=1,2        GT      2/1     ./.
chr1    219783256       .       A       ACACACACACACAC  .       .       SYNC=219783257;CALL=FP_CA;AN=2;AC=1     GT      ./.     0/1

Here we have a two-sample VCF, with two sites with the same POS. The key point is the second entry shows HG003 having GT 0/1, wheras the first as missing ./.. The meanings are different: the first signifies some amount of confidence in seeing both reference and alt supporting reads, whereas the latter reflects uncertainty about how those alts relate to the sample. When running bcftools norm -m +any on this, we get the following:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003
chr1    219783256       .       AAC     ACAC,A,ACACACACACACACAC .       .       SYNC=219783257;BASE=FN_CA;CALL=FP_CA;AN=4;AC=1,2,1      GT      2/1     ./3

Now the HG003 GT at this one site is ./3 (corresponding to ./1 with the previous order). In other words, the confidence about seeing a reference haplotype here is "lost" in the normalization. I would consider this a bug, since the tool is deciding to throw away confidence about matching the reference and replacing with some uncertainty. This is actually an important distinction, as this example shows that setting HG003 to have GT 0/3 here (matching the original) is a definitive mendelian violation compared with the 2/1 GT in the son, whereas ./3 leaves open the possibility that an MV hasn't occurred. Some tools are sensitive to this, and I'd expect the behavior involving 0/3 if I didn't catch this example. It's also worth noting if you throw away the other sample (make it a single sample VCF), the behavior is as I expected: the site collapses into a 0/1 site.

Can anyone confirm if this is expected behavior (and why), or if this is a bug? This was done with bcftools v1.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant