Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensemble caller #14

Open
ramaniak opened this issue Dec 9, 2016 · 7 comments
Open

ensemble caller #14

ramaniak opened this issue Dec 9, 2016 · 7 comments

Comments

@ramaniak
Copy link

ramaniak commented Dec 9, 2016

Hello Brad,
I have a question/request about the output from running ensemble variant caller, specifially the format field. At the moment the ensemble vcf file reports the format field from the first file of 'n' where it appears. For example, if I input vcf files from mutect2, strelka, vardict, and muse as my input callers and the variant in question appears in mutect2 and strelka, the format field is reported from mutect2. So, there are cases where the format fields could appear from mutect2 or strelka or vardict (but not muse, if I require the variant to be in at least 2 callers). This implies that there is no uniformity in the format field anymore. Is there any way to fix this so that a specific set of format fields are reported irrespective of the input vcf files? This is probably not the easiest thing to do, but I thought I'd ask you anyway.

thanks
Arun

@chapmanb
Copy link
Member

Arun;
Thanks for starting this discussion. Unfortunately it is quite difficult to normalize these to a single set of input fields, hence the current approach which is the best we can reasonably manage. Most of the format field values are calculated internally in the callers so this would require recalling or otherwise interfacing directly with a variant caller. The ensemble method here is meant to be more lightweight than that so takes the imperfect simplified approach instead. Sorry to not have a good solution but hope this helps explain the current implementation.

@ramaniak
Copy link
Author

Thanks, I completely understand.
Before writing up a script to do this, I searched the web to not re-invent the wheel and came across this:
https://github.com/tjparnell/HCI-Scripts/blob/master/SomaticVariants/update_somaticVCF_attributes.pl

A good start or so it seems

Arun

@chapmanb
Copy link
Member

Arun;
Thanks for the pointer. We'd definitely have interest in pointing at normalization scripts if you build something based on that starting point. The tricky part is handling all the callers and special cases which is does make a good start on. Thanks again.

@ramaniak
Copy link
Author

I agree. Will keep you posted on any updates.

thanks

@ramaniak ramaniak reopened this Dec 15, 2016
@ramaniak
Copy link
Author

sorry for closing and re-opening. Just realized there was another issue, which might not be relevant to the ensemble calling per se.

Currently, I am using the ensemble caller for somatic calls based on mutect2, muse, vardict, strelka and caveman. I am asking the caller to report any calls in at least 2 variant callers.

the issue I noticed is based on how each of the callers report the tumour and normal format fields. Here is the header field from each of these callers.

**CAVEMAN**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOUR

**Mutect2**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

**Vardict**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

**muse** 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

**strelka**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR 

As you probably notice, mutect2 and muse report TUMOR as the last column whereas Caveman, Vardict and Strelka report Normal in the last column.

ensemble caller also reports the TUMOR field in the last column, but maintains the order as present in the variant callers. Therefore when a variant is seen in, say caveman and vardict, the normal and tumour format fields get switched.

I am not quite sure how to deal with this just yet. Seeing, if I can change the default occurrence of these TUMOUR and NORMAL in the variant callers.

Thanks
Arun

@chapmanb
Copy link
Member

Arun;
You will have to ensure samples have consistent sample ordering prior to feeding into ensemble calling. Thank you for highlighting this requirement. Somatic callers do have different behavior in terms of sample ordering and naming, so this requires some work upstream to normalize. This is done automatically in bcbio (https://github.com/chapmanb/bcbio-nextgen) pipelines so is not a part of the more standalone ensemble calling here. Hope this helps.

@ramaniak
Copy link
Author

ramaniak commented Dec 16, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants