-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom ploidy regions in HaplotypeCaller #8464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Overall this approach is a very light touch in terms of code which is good. Some comments and one efficiency concern that should be easy to correct
src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCaller.java
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
@@ -946,6 +946,29 @@ public void testHaplotypeCallerRemoveAltAlleleBasedOnHaptypeScores() throws IOEx | |||
} | |||
} | |||
|
|||
@Test(dataProvider="HaplotypeCallerTestInputs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some tests where where the ployidy regions are invalid for various reasons (check some in that are on the wrong reference, some that don't have the right names etc...) and assert that you are getting user exceptions in all of those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some tests to check for strings instead of integers for ploidy values, and checking for negative values for ploidy. I'm not sure how to check against a reference mismatch since this supports bed files, which don't have any sort of sequence dictionary. Do you have any ideas? I suppose during construction I could check against the VCF header, but that only protects against differently named reference sequences, while the lengths could be different. (E.g. a hg38 <-> T2T mixup wouldn't be detectable with just a bed file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the references mismatch subtly then we are hosed... but if somebody uses an HG19 bedfile + an HG38 reference this should at least warn the user because it should be easy to detect.... theoretically you could also check against the sequence dictionary that the intervals actually fall within the contig as well... Just asserting the names mismatch is probably strong enough protection
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Show resolved
Hide resolved
* @param region Current region of interest, e.g. AssemblyRegion | ||
* @return Genotyping engine with correct ploidy setting for given region | ||
*/ | ||
protected HaplotypeCallerGenotypingEngine getLocalGenotypingEngine(Locatable region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a somewhat ignorable comment but you could extract out a method here so these two are necessarily in lock-step and there is no risk of somebody breaking one but not the other...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd like them to be covered by just one method, but unfortunately the return different types. I tried returning just the parent class and using a HashMap input that maps to the parent class, but Java doesn't seem to like this. Do you have any other suggestions, or do you think it's fine to leave as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull out a method "getPloyidyToUseAtTHISSite()" or something similar and have both of these methods call that one that has your adjudication logic and then return whatever map value is needed.
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
// Private constructor for use in copyWithNewPloidy | ||
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy | ||
private HaplotypeCallerArgumentCollection(int ploidy) { | ||
this.standardArgs.genotypeArgs.samplePloidy = ploidy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works? This looks terrifying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear I found this idea somewhere online and I can't find where anymore... Do you have any suggestions on doing this in a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that both of us are weary of this option... if you keep this confusing copy constructor (that i'm really not sure works at all this is horrifying...) you are going to need to check in some unit tests where you call this some times on a on-default argument collection (probably also non-default sub-argument collections) and demonstrate that everything is identical after calling this except for ployidy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole point was so they would be in-sync if something changes in one? that is easily demonstrated with a test...
...oadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerArgumentCollection.java
Outdated
Show resolved
Hide resolved
...oadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerArgumentCollection.java
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
Thanks @jamesemery and @davidbenjamin for your helpful comments. I think I either addressed most of them or had some follow-up questions about ideas for getting the last few changes in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you need to rebase this to get the tests working. A few medium sized comments... mostly this looks good.
.../java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java
Outdated
Show resolved
Hide resolved
|
||
// Should fail due to negative integer values in name column of bed | ||
@Test(dataProvider="HaplotypeCallerTestInputs", expectedExceptions = UserException.class) | ||
public void testNonPositiveCustomPloidyRegions(final String inputFileName, final String referenceFileName) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in keeping with my comment above... add a test here where you mis-map a bedfile with the reference as that seems detectable at the overlap detector construction phase.
src/main/java/org/broadinstitute/hellbender/tools/copynumber/formats/records/SimpleCount.java
Outdated
Show resolved
Hide resolved
// Private constructor for use in copyWithNewPloidy | ||
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy | ||
private HaplotypeCallerArgumentCollection(int ploidy) { | ||
this.standardArgs.genotypeArgs.samplePloidy = ploidy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that both of us are weary of this option... if you keep this confusing copy constructor (that i'm really not sure works at all this is horrifying...) you are going to need to check in some unit tests where you call this some times on a on-default argument collection (probably also non-default sub-argument collections) and demonstrate that everything is identical after calling this except for ployidy.
// Private constructor for use in copyWithNewPloidy | ||
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy | ||
private HaplotypeCallerArgumentCollection(int ploidy) { | ||
this.standardArgs.genotypeArgs.samplePloidy = ploidy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole point was so they would be in-sync if something changes in one? that is easily demonstrated with a test...
defaultGenotypingEngine.setAnnotationEngine(annotationEngine); | ||
|
||
// Create other custom genotyping engines if user provided ploidyRegions | ||
for (final int ploidy : this.allCustomPloidies) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so currently this code maintains two separate genotypers for ployidy 2 if it gets specified?
* @param region Current region of interest, e.g. AssemblyRegion | ||
* @return Genotyping engine with correct ploidy setting for given region | ||
*/ | ||
protected HaplotypeCallerGenotypingEngine getLocalGenotypingEngine(Locatable region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull out a method "getPloyidyToUseAtTHISSite()" or something similar and have both of these methods call that one that has your adjudication logic and then return whatever map value is needed.
@@ -0,0 +1,3 @@ | |||
20 10000000 10010000 2 | |||
20 10020000 10030000 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put into the test file a demonstration of the rule "overlapping ployidy defaults to the highest option?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a method for getting current ploidy to trim down that part of the code. Still need to add the test
...oadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerArgumentCollection.java
Outdated
Show resolved
Hide resolved
Just pushed some small changes based on our conversation. I think the last things to add now are 3 tests:
I also need to rebase to get the right htsjdk version, and other changes that might've been made to the code, which are hopefully easy to resolve... |
7b70607
to
b5486cc
Compare
Finish changes to HaplotypeCaller to add ploidy-regions flag allowing variable ploidy Change IntervalFileFeature interface to NamedFeature for htsjdk dependency Address comments for draft PR Make a few changes and refactors based on comments
b5486cc
to
4852f49
Compare
Github actions tests reported job failures from actions build 7063736836
|
* Introduced a new class MulitPloidyGenotyperCache which keeps track of multiple GenotypingEngines that have different ploidy values. * minor refactoring in related classes * removed an unused type parameter from GenotypingEngine
Github actions tests reported job failures from actions build 7131188447
|
* fix npe * update comment and re-enable test
Github actions tests reported job failures from actions build 7131774248
|
Github actions tests reported job failures from actions build 7132143535
|
Github actions tests reported job failures from actions build 7143620989
|
The changes here were merged in #8609 |
Summary
This PR adds a new flag to
HaplotypeCaller
called--ploidy-regions
which allows the user to input a .bed or .interval_list with "name" column equal to a positive integer for the ploidy to use when calling variants in that region. The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies. The global-ploidy
flag will still provide the background default (or the built-in ploidy of 2 for humans), but the user input value will supersede these in overlapping regions. Note that the overlap is checked against the active region, meaning variants near the boundary of the--ploidy-regions
file may end up with GT fields having ploidy slightly differently than expected, for example if your custom region overlaps a given active region but the variant ends up being written to a location outside that interval. In this case the ploidy from the user input would be used rather than any other default.Implementation Details
The key idea is to allow
HaplotypeCallerEngine
to initialize multiple genotyping engines based on the--ploidy-regions
input. The intervals are first parsed to check for positive integer ploidy values, and then used to create hashmaps of ploidy -> genotyper. The engine uses two types of genotypers: one for active region determination and one for doing the actual genotyping. Both admit a ploidy paramter passed viahcArgs
. This PR modifies theHaplotypeCallerArgumentCollection
class to include a method for creating copies of this object with differing ploidy amounts. These then get fed to the constructors of the appropriate genotyper classes, which are organized into two hashmaps. In every situation where one of these genotypers is used, we instead begin the scope by calling a "get local genotyper" method that performs the logic of checking whether the region of interest overlaps any of the user-provided regions, and then selects the appropriatelocalEngine
genotyper for the task, ensuring the user-provided ploidy supersedes any other defaults.A Note on Dependency
The flexibility of using either .bed or .interval_list files to specify this information depends on this PR in htsjdk being made into a full release, and then bumping the dependency of GATK. The code in this PR would not compile until this happens.