Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What if the ref allele is a common SNP? #215

Open
johannesreiter opened this issue Mar 6, 2017 · 4 comments
Open

What if the ref allele is a common SNP? #215

johannesreiter opened this issue Mar 6, 2017 · 4 comments

Comments

@johannesreiter
Copy link

I was wondering how SNPs should be handled. If one uses the germline reference (a common SNP in this case) as ref, the variant effect prediction runs into the below error. How is one supposed to encode such a somatic variant? Using two distinct variants, one for the SNP and one for the somatic variant?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a2c9a6cebb7d> in <module>()
----> 1 var.effects()

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/varcode/variant.py in effects(self, raise_on_error)
    360     def effects(self, raise_on_error=True):
    361         return predict_variant_effects(
--> 362             variant=self, raise_on_error=raise_on_error)
    363 
    364     def effect_on_transcript(self, transcript):

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/varcode/effects/effect_prediction.py in predict_variant_effects(variant, raise_on_error)
     84                         effect = predict_variant_effect_on_transcript(
     85                             variant=variant,
---> 86                             transcript=transcript)
     87                     else:
     88                         effect = predict_variant_effect_on_transcript_or_failure(

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/varcode/effects/effect_prediction.py in predict_variant_effect_on_transcript(variant, transcript)
    202 
    203         exonic_effect_annotation = exonic_transcript_effect(
--> 204             variant, exon, exon_number, transcript)
    205 
    206         # simple case: both start and end are in the same

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/varcode/effects/effect_prediction.py in exonic_transcript_effect(variant, exon, exon_number, transcript)
    340                  genome_end,
    341                  variant,
--> 342                  cdna_ref))
    343 
    344     utr5_length = min(transcript.start_codon_spliced_offsets)

ValueError: Found ref nucleotides 'A' in sequence of Transcript(transcript_id=ENST00000296589, name=SLC45A2-001, gene_id=ENSG00000164175, gene_name=SLC45A2, biotype=protein_coding, location=5:33944721-33984835) at offset 1133 (chromosome positions 33954511:33954511) but variant Variant(contig='5', start=33954511, ref='C', alt='A', reference_name='GRCh37') has 'G'
@joaoe
Copy link

joaoe commented Mar 15, 2017

What distinguishes a somatic from germline variant in the GT field. A variant with both should look something like:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample_normal	Sample_cancer
# homozygous germline and somatic change AA -> CC and GG
chr1	123123123	rsXXX A	C,G	3455	PASS	.	GT	1/1	2/2
# or in case of heterozygous changes. AA -> AC and AG
chr1	123123123	rsXXX A	C,G	3455	PASS	.	GT	0/1	0/2

The REF has to always match the reference used to do the variant calling, nor the germline.
Some discussion https://www.biostars.org/p/12964/

The GT field is actually very important to differentiate between more than one sample, so it should be added to the Variant class and not just carried in the metadata object.

@iskandr
Copy link
Contributor

iskandr commented Mar 15, 2017

We tried to only put required/non-optional fields on the Variant class itself. I could add a helper property called either GT or genotype which either returns the metadata value for "GT" or None. Would that be helpful?

@iskandr
Copy link
Contributor

iskandr commented Mar 15, 2017

Oh, @joaoe, I might have misunderstood. Do you want Varcode to parse alts such as "A,T" and then use the GT to determine each samples value for the alt field?

@joaoe
Copy link

joaoe commented Mar 17, 2017

Do you want Varcode to parse alts such as "A,T" and then use the GT to determine each samples value for the alt field?

I don't know how that would be implemented since current varcode does not provide a way to know which sample the variant applies to, in case the VCF file has multiple samples. Varcode just splits the ALT field and returns the same number of variants as ALT alleles, with the info from CHROM, POS, REF and ALT.

VCF files produced with a germline and cancer sample will typically contain both samples, so if you want varcode to be useful for these cases, there should be a way to tell to which sample the variant applies to. So, for each column after FORMAT, varcode needs to produce different Variant objects.

Having the GT is actually not directly useful as it still has to be parsed.
I'd vote for something simple like a sample_index integer which would refer to the proper column right of FORMAT.

Example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_1 sample_2
chr1 123123 . A C,G . . . GT 1/2 0/0
# variant only in normal sample, produces
# Variant(..., ref=A, alt=C, sample_index=0)
# Variant(..., ref=A, alt=G, sample_index=0)
chr1 123123 . A C,G . . . GT 1/1 1/2
# variant in both samples
# Variant(..., ref=A, alt=C, sample_index=0)
# Variant(..., ref=A, alt=C, sample_index=1)
# Variant(..., ref=A, alt=G, sample_index=1)
chr1 123123 . A C . . . GT 0/0 0/0
# variant does not apply to any of the samples.

If the GT field is not there, then parse as before, sample_index can be None.

This would change the behavior of the API so it should be behind an option to load_vcf.

It can get complicated further. If the VCF file was 'phased', then the GT field is split by a pipe |, not a back slash. In this case, the position of each number in GT then matters, because the position will tell how to group close-by variants originating from the same group of reads, used to disambiguate heterozygous variants. So, including a gt_allele_index perhaps.

PS: A example containing multiple variants grouped by mouse strain. ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/VCF/genotype/SC_MOUSE_GENOMES.genotype.vcf.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants