-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results do look promising! We'll need to test it out with additional modes to get some information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I think it's worth testing on a highly mutated metastasis sample as well. It likely works well with relatively quiet tumors but this joint genotyping process with tumors may introduce bias in the germline calls of the paired normal samples when applied to more progressed tumor samples.
gatk_version = "4.2.4.1" | ||
gatk_version = "4.5.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the test performed before or after this update? There are numerous changes since 4.2.4.1. I would suggest creating a separate PR for this or does this pipeline change require the latest version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, the update was required to get the other branch to work. That branch uses GenomicsDBImport
instead of CombineGVCFs
. The initial comparisons between this branch and the original method were both done with gatk version 4.2.4.1. The later comparisons between GenomicsDBImport and
CombineGVCFs` were both done with gatk version 4.5.0.0.
This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental. So it's a separate question whether it should be joint calling with the normal sample. I wonder if the benefits of having a second sample from an independent library outweigh any cons of somatic mutations influencing the normal calls. I assume even very progressed tumor samples have orders of magnitude less somatic variants than germline? |
I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented |
Or does this update run |
It runs HaplotypeCaller on each sample
Current main runs:
This update runs:
They yield essentially the same results, but the current method repeats the computationally heavy step. |
Yup, this was done to generate inputs for https://github.com/uclahs-cds/pipeline-regenotype-gSNP |
Really exciting! May be worth benchmarking on the Precision FDA dataset: https://precision.fda.gov/challenges/truth/results as well |
@sorelfitzgibbon @pboutros @yashpatel6 Alternatively, it seems like we can genotype the gVCFs without combining them, which will make the process simpler and faster. Then, we can use https://github.com/uclahs-cds/pipeline-regenotype-gSNP for cohort level genotyping. |
I don't think this is possible. My understanding is they have to be combined with either |
This branch (
DetailsBenchmark HG001 variants and recommended benchmarking regions dowloaded from:
to:
1.
|
Type | Branch | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SNP | main | 3254352 | 3025399 | 228953 | 3308188 | 2735 | 279297 | 275 | 149 | 0.9296469999999999 | 0.9990969999999999 | 0.084426 | 0.9631219999999999 | 2.110715142410159 | 2.0595541024351274 | 1.526276146858805 | 1.468632011740415 |
SNP | combine | 3254352 | 2994915 | 259437 | 3258372 | 2974 | 259731 | 265 | 163 | 0.9202799999999999 | 0.999008 | 0.079712 | 0.9580290000000001 | 2.110715142410159 | 2.064118390928878 | 1.526276146858805 | 1.614994284283063 |
INDEL | main | 467684 | 452975 | 14709 | 842223 | 5003 | 367279 | 892 | 3076 | 0.968549 | 0.989466 | 0.436083 | 0.9788959999999999 | 1.4564622691647133 | 1.71186535561161 | ||
INDEL | combine | 467684 | 453032 | 14652 | 842327 | 5005 | 367343 | 892 | 3075 | 0.968671 | 0.9894629999999999 | 0.436105 | 0.978957 | 1.4564622691647133 | 1.7121678625645012 |
Call-gSNP
main
and combine-gvcfs
branches compared directly against each other
Type | Query | Truth | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INDEL | combine | main | 458311 | 457695 | 616 | 879053 | 845 | 404068 | 0 | 773 | 0.998656 | 0.9982209999999999 | 0.459663 | 0.9984379999999999 | 1.4146944562280916 | 1.710720187710512 | ||
SNP | combine | main | 3027873 | 2930405 | 97468 | 3364587 | 67341 | 365945 | 0 | 9154 | 0.9678100000000001 | 0.977543 | 0.10876400000000001 | 0.9726520000000001 | 2.1212151194249054 | 2.0610247841269524 | 1.473022797923228 | 1.6170602116421053 |
pipeline-call-gSNP
output:
main
branch:
/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/main-RMNISTHS_30xdownsample-gSNP
combine-gvcfs
branch:
/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/sfitz-combine-gvcfs-RMNISTHS_30xdownsample-gSNP
2. Two HG001
Exomes, NIST7035 and NIST7036
- FASTQs downloaded from
ftp.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome
- Reads aligned, recalibrated, then
call-gSNP
. - Results:
Call-gSNP
main
branch variants compared with combine-gvcfs
branch variants
- I'm not sure yet how to interpret these exome results, but probably the methods were not adequate. I did not have the exome regions bed file and I used a benchmark regions bed file for WGS.
Type | Exome | Query | Truth | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SNP | NIST7035 | main | combine | 165578 | 162858 | 2720 | 230841 | 42792 | 25176 | 67 | 356 | 0.9835729999999999 | 0.791933 | 0.10906199999999999 | 0.877411 | 2.1412257905419505 | 2.0782943772583633 | 0.5538806656592298 | 0.46946168783624614 |
SNP | NIST7035 | combine | main | 205629 | 162857 | 42772 | 185934 | 2733 | 20329 | 64 | 360 | 0.791994 | 0.983497 | 0.10933399999999999 | 0.877418 | 2.143839909498112 | 2.070116062141949 | 0.46188905163464683 | 0.5439527582598296 |
SNP | NIST7086 | main | combine | 180913 | 177914 | 2999 | 244897 | 40331 | 26637 | 60 | 381 | 0.9834229999999999 | 0.8152159999999999 | 0.108768 | 0.8914540000000001 | 2.143892267593397 | 2.075800577671732 | 0.524525503152075 | 0.45642178643017317 |
SNP | NIST7086 | combine | main | 218225 | 177914 | 40311 | 202985 | 3014 | 22043 | 57 | 383 | 0.815278 | 0.9833430000000001 | 0.10859400000000001 | 0.891458 | 2.1425362496220246 | 2.0719527880759627 | 0.4484251053801985 | 0.5160760174505469 |
INDEL | NIST7035 | main | combine | 30083 | 29695 | 388 | 51990 | 4886 | 17163 | 10 | 475 | 0.987102 | 0.859707 | 0.330121 | 0.91901 | 0.49863636363636366 | 0.5439859053989489 | ||
INDEL | NIST7035 | combine | main | 34505 | 29698 | 4807 | 44652 | 401 | 14309 | 8 | 205 | 0.8606870000000001 | 0.986784 | 0.320456 | 0.9194319999999999 | 0.47614913841746603 | 0.5612105022670556 | ||
INDEL | NIST7086 | main | combine | 32208 | 31846 | 362 | 54688 | 4526 | 18037 | 14 | 385 | 0.9887610000000001 | 0.876511 | 0.329816 | 0.929258 | 0.4915246278818613 | 0.5381309204015269 | ||
INDEL | NIST7086 | combine | main | 36318 | 31847 | 4471 | 47990 | 377 | 15493 | 9 | 166 | 0.876893 | 0.988399 | 0.322838 | 0.929313 | 0.46822207643103164 | 0.5585738539898133 |
3. PCAWG-63 donor DO35350
. combine-gvcfs
branch ran successfully on four samples, two tumor, two normal.
main run log: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/sfitz-combine-gvcfs_DO35350_gSNP.log
output: /hot/data/unregistered/PCAWG-63/pipeline-runs/call-gSNP/sfitz-combine-gvcfs-DO35350-gSNP
Interestingly, this set of samples reliably (tested 6+ times) causes errors indicating the detachment of scratch (which has been a somewhat infrequent random error across varied pipeline runs). This was true for both main
and combine-gvcfs
branch runs and was solved by changing the workDir to /hot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments for now
The previous exome comparisons had accidentally been done with different GATK versions. Now repeated, both with
|
Final test after removing |
Panel sample comparison of sites (keep in mind this was done with the main branch on GATK v4.2.4.1 and this branch on GATK 4.5.0.0):
Generally speaking, the numbers are in alignment with what we've seen before with GATK version updates. Are we pending any other tests/comparisons? The results seem generally agreeable with the previous method Anything else to add @tyamaguchi-ucla ? |
I'm curious to understand the difference between the original method, which has been deprecated, and this updated method in the context of calling germline SNPs using normal and highly mutated tumor samples (e.g. mets). (Let me know if the tested samples above are highly mutated) Could we ask Jieun to test out this branch and maybe process an average sample and the most mutated one from the SU2C dataset? I think she already has gSNP calls for the older version. Also, regarding "Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible." |
The methods aren't really deprecated; this is more of an update to the steps taken without any change to the way calling is done in the context of single-sample vs joint. The current calling is being done on a per-patient level and it's the same with the updated approach here. The only difference being performing the full calling directly with HC from BAM(s) versus going to intermediate GVCF per sample and then doing the genotyping together. That being said, the results with the GIAB sample (with a given truth set) plus the exome samples would suggest this branch should be functional and follows the recommended practices; I'm not sure what additional information running a highly mutated sample would add for this particular PR as we don't really have a truth set for the SU2C samples to compare against. If the intent is to have some understanding of what the differences might be between the two workflows, that can be done separately from the PR itself here, starting with a release candidate with these changes implemented, rather than hold up this PR for that purpose?
Totally agree, users should be made aware of trying to keep the versions the same between HC and regenotyping. We can add it as a note in the regenotype-gSNP pipeline. The update made here will also be a major version change with proper notification of the changes made and things to be aware of for on-going processing runs. |
Here the conventional comments are useful. I should've added suggestion/thought/question: (non-blocking).
Right, I meant the best practices but I'm curious to understand to what extent https://gatk.broadinstitute.org/hc/en-us/articles/360035890431-The-logic-of-joint-calling-for-germline-short-variants (Thanks for the link, @sorelfitzgibbon) As joint genotyping is performed using one or more tumor samples with normal to predict germline variants in this pipeline, I'm curious to understand how much contamination (as in FPs and TNs) might be introduced based on the purity, PGA, CNAs, LOH, and # somatic variants. At the same time, tumor samples generally have higher coverage, which should positively influence joint genotyping. Also, it's recommended to use > 50 samples to perform Even if we encounter such cases, since we keep the single-sample normal gVCFs for regenotype-gSNP, we can use the regenotyped VCF instead of the VCFs from this pipeline. In any case, this update will generally benefit both us and end users. Thus, it's non-blocking. |
@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result. |
@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student. |
I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined? |
For dropping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, the changes look good! I'm good with merging this and getting a release candidate out.
"The results of our investigation favored the co-local realignment of tumor and matched-normal BAMs rather than the local realignment of tumor and matched-normal BAMs separately (data not shown)." https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1029-6 We believe this generally applies to other somatic variant detection algorithms and that's why we keep the IR (data not shown).
We can take several approaches but can create GitHub discussions internally. |
Description
The pipeline has been creating GVCFs for the input samples (1 normal and 1 or more tumor samples for a patient), but then starting over to run joint genotyping on the set of samples using
HaplotypeCaller
. This unnecessarily repeats the heavy computation steps. This PR instead combines the GVCFs and joint genotypes usingGenotypeGVCFs
. This runs 1.8 x faster, (0.52 of CPU hours).Performance
Original method
Duration : 1d 5h 36m 53s
CPU hours : 348.3
Each run of 50 (scattered)
HaplotypeCallerVCF
takes over an hour on average, < 2 CPU, ~ 3 GBCombined GVCFs
Duration : 16h 4m 13s
CPU hours : 181.2
Each run of 50
CombineGVCFs
processes takes < 4 minutes, < 2 CPU, ~ 1 GBEach run of 50
GenotypeGVCFs
processes takes < 3 minutes, < 2 CPU, ~ 3 GBThe final output is much the same. Not identical, but the GenotypeGVCF method is the newer best practices method and is likely to be the better of the two. Differences seem to be mostly (exclusively?) in lower quality calls.
Genotype Concordances:
Details:
/hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/orig-cgvcf.cmp
Closes #156 #108
Testing Results
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample.