Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

sorelfitzgibbon · 2024-04-01T16:31:13Z

Description

The pipeline has been creating GVCFs for the input samples (1 normal and 1 or more tumor samples for a patient), but then starting over to run joint genotyping on the set of samples using HaplotypeCaller. This unnecessarily repeats the heavy computation steps. This PR instead combines the GVCFs and joint genotypes using GenotypeGVCFs. This runs 1.8 x faster, (0.52 of CPU hours).

Performance

Original method

Duration : 1d 5h 36m 53s
CPU hours : 348.3

Each run of 50 (scattered) HaplotypeCallerVCF takes over an hour on average, < 2 CPU, ~ 3 GB

Combined GVCFs

Duration : 16h 4m 13s
CPU hours : 181.2

Each run of 50 CombineGVCFs processes takes < 4 minutes, < 2 CPU, ~ 1 GB
Each run of 50 GenotypeGVCFs processes takes < 3 minutes, < 2 CPU, ~ 3 GB

The final output is much the same. Not identical, but the GenotypeGVCF method is the newer best practices method and is likely to be the better of the two. Differences seem to be mostly (exclusively?) in lower quality calls.

Genotype Concordances:
Details: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/orig-cgvcf.cmp

Closes #156 #108

Testing Results

N-T paired WGS
- samples: DO40167 from PCAWG-63, Normal: 153 GB, Tumor: 178 GB
- input csv: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/test/DO40167.yaml
- config: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/test/PCAWG-63.config
- original method output: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/main-DO40167-PCAWG-63-orig
- combine GVCFs output: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/sfitz-combine-gvcfs-DO40167-PCAWG-63-cgvcf

Checklist

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
I have tested the pipeline on at least one A-mini sample.

yashpatel6

The results do look promising! We'll need to test it out with additional modes to get some information

module/combine-gvcfs.nf

tyamaguchi-ucla

Interesting! I think it's worth testing on a highly mutated metastasis sample as well. It likely works well with relatively quiet tumors but this joint genotyping process with tumors may introduce bias in the germline calls of the paired normal samples when applied to more progressed tumor samples.

tyamaguchi-ucla · 2024-04-05T20:02:33Z

config/default.config

-    gatk_version = "4.2.4.1"
+    gatk_version = "4.5.0.0"


Was the test performed before or after this update? There are numerous changes since 4.2.4.1. I would suggest creating a separate PR for this or does this pipeline change require the latest version?

Ah yes, the update was required to get the other branch to work. That branch uses GenomicsDBImport instead of CombineGVCFs. The initial comparisons between this branch and the original method were both done with gatk version 4.2.4.1. The later comparisons between GenomicsDBImport andCombineGVCFs` were both done with gatk version 4.5.0.0.

sorelfitzgibbon · 2024-04-05T20:39:58Z

Interesting! I think it's worth testing on a highly mutated metastasis sample as well. It likely works well with relatively quiet tumors but this joint genotyping process with tumors may introduce bias in the germline calls of the paired normal samples when applied to more progressed tumor samples.

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

So it's a separate question whether it should be joint calling with the normal sample. I wonder if the benefits of having a second sample from an independent library outweigh any cons of somatic mutations influencing the normal calls. I assume even very progressed tumor samples have orders of magnitude less somatic variants than germline?

tyamaguchi-ucla · 2024-04-05T20:54:52Z

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

tyamaguchi-ucla · 2024-04-05T20:56:49Z

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

sorelfitzgibbon · 2024-04-05T22:04:42Z

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

It runs HaplotypeCaller on each sample

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

Current main runs:

HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM
HaplotypeCaller jointly on all of the input BAMs for the patient

This update runs:

HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM
GenotypeGVCFs jointly on all of the GVCFs for the patient

They yield essentially the same results, but the current method repeats the computationally heavy step.

tyamaguchi-ucla · 2024-04-05T23:27:06Z

HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM

Yup, this was done to generate inputs for https://github.com/uclahs-cds/pipeline-regenotype-gSNP

pboutros · 2024-04-10T04:18:41Z

Really exciting! May be worth benchmarking on the Precision FDA dataset: https://precision.fda.gov/challenges/truth/results as well

tyamaguchi-ucla · 2024-04-22T20:36:40Z

@sorelfitzgibbon @pboutros @yashpatel6 Alternatively, it seems like we can genotype the gVCFs without combining them, which will make the process simpler and faster. Then, we can use https://github.com/uclahs-cds/pipeline-regenotype-gSNP for cohort level genotyping.

sorelfitzgibbon · 2024-04-22T20:55:50Z

@sorelfitzgibbon @pboutros @yashpatel6 Alternatively, it seems like we can genotype the gVCFs without combining them, which will make the process simpler and faster. Then, we can use https://github.com/uclahs-cds/pipeline-regenotype-gSNP for cohort level genotyping.

I don't think this is possible. My understanding is they have to be combined with either CombineGVCFs or GenomicsDBImport before genotyping.

sorelfitzgibbon · 2024-05-11T01:22:17Z

This branch (combine-gvcfs) has now been tested in the following ways (details below):

main branch and combine-gvcfs branch run on HG001 30x downsampled with each result compared to GIAB HG001 benchmark VCF (part of precision FDA), as suggested by @pboutros.
main vs combine-gvcfs on HG001 exomes
combine-gvcfs branch on four samples from a PCAWG-63 donor, 2 tumor and 2 normal.

Details

Benchmark HG001 variants and recommended benchmarking regions dowloaded from:

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38

to:

/hot/user/sfitzgibbon/giab-downloading/HG001/HG001_GRCh38_1_22_v4.2.1_benchmark.bed
/hot/user/sfitzgibbon/giab-downloading/HG001/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz

1. `HG001` 30x downsampled

BAM downloaded from

ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam

Reads realigned, recalibrated, then call-gSNP.
Results:

`Call-gSNP` `main` and `combine-gvcfs` variants compared with benchmark variants (using `hap.py`: standardize regions to haplotypes)

There are slightly more "false positive" and slightly more "false negative" SNVs in the combine-gvcfs branch compared to the main branch.
- I suspect this is due to the benchmark variants and the main branch variants both being called by the old HaplotypeCaller method, and those discrepant variants are low quality.
- Supporting this: The TiTv ratio is slightly higher for the combine-gvcfs branch.
There are slightly more "false positive" and slightly less "false negative" INDELs in the combine-gvcfs branch

Type	Branch	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio
SNP	main	3254352	3025399	228953	3308188	2735	279297	275	149	0.9296469999999999	0.9990969999999999	0.084426	0.9631219999999999	2.110715142410159	2.0595541024351274	1.526276146858805	1.468632011740415
SNP	combine	3254352	2994915	259437	3258372	2974	259731	265	163	0.9202799999999999	0.999008	0.079712	0.9580290000000001	2.110715142410159	2.064118390928878	1.526276146858805	1.614994284283063
INDEL	main	467684	452975	14709	842223	5003	367279	892	3076	0.968549	0.989466	0.436083	0.9788959999999999			1.4564622691647133	1.71186535561161
INDEL	combine	467684	453032	14652	842327	5005	367343	892	3075	0.968671	0.9894629999999999	0.436105	0.978957			1.4564622691647133	1.7121678625645012

`Call-gSNP` `main` and `combine-gvcfs` branches compared directly against each other

Type	Query	Truth	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio
INDEL	combine	main	458311	457695	616	879053	845	404068	0	773	0.998656	0.9982209999999999	0.459663	0.9984379999999999			1.4146944562280916	1.710720187710512
SNP	combine	main	3027873	2930405	97468	3364587	67341	365945	0	9154	0.9678100000000001	0.977543	0.10876400000000001	0.9726520000000001	2.1212151194249054	2.0610247841269524	1.473022797923228	1.6170602116421053

`pipeline-call-gSNP` output:

main branch:

/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/main-RMNISTHS_30xdownsample-gSNP

combine-gvcfs branch:

/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/sfitz-combine-gvcfs-RMNISTHS_30xdownsample-gSNP

2. Two `HG001` Exomes, NIST7035 and NIST7036

FASTQs downloaded from

ftp.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome

Reads aligned, recalibrated, then call-gSNP.
Results:

`Call-gSNP` `main` branch variants compared with `combine-gvcfs` branch variants

I'm not sure yet how to interpret these exome results, but probably the methods were not adequate. I did not have the exome regions bed file and I used a benchmark regions bed file for WGS.

Type	Exome	Query	Truth	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio
SNP	NIST7035	main	combine	165578	162858	2720	230841	42792	25176	67	356	0.9835729999999999	0.791933	0.10906199999999999	0.877411	2.1412257905419505	2.0782943772583633	0.5538806656592298	0.46946168783624614
SNP	NIST7035	combine	main	205629	162857	42772	185934	2733	20329	64	360	0.791994	0.983497	0.10933399999999999	0.877418	2.143839909498112	2.070116062141949	0.46188905163464683	0.5439527582598296
SNP	NIST7086	main	combine	180913	177914	2999	244897	40331	26637	60	381	0.9834229999999999	0.8152159999999999	0.108768	0.8914540000000001	2.143892267593397	2.075800577671732	0.524525503152075	0.45642178643017317
SNP	NIST7086	combine	main	218225	177914	40311	202985	3014	22043	57	383	0.815278	0.9833430000000001	0.10859400000000001	0.891458	2.1425362496220246	2.0719527880759627	0.4484251053801985	0.5160760174505469
INDEL	NIST7035	main	combine	30083	29695	388	51990	4886	17163	10	475	0.987102	0.859707	0.330121	0.91901			0.49863636363636366	0.5439859053989489
INDEL	NIST7035	combine	main	34505	29698	4807	44652	401	14309	8	205	0.8606870000000001	0.986784	0.320456	0.9194319999999999			0.47614913841746603	0.5612105022670556
INDEL	NIST7086	main	combine	32208	31846	362	54688	4526	18037	14	385	0.9887610000000001	0.876511	0.329816	0.929258			0.4915246278818613	0.5381309204015269
INDEL	NIST7086	combine	main	36318	31847	4471	47990	377	15493	9	166	0.876893	0.988399	0.322838	0.929313			0.46822207643103164	0.5585738539898133

3. PCAWG-63 donor `DO35350`. `combine-gvcfs` branch ran successfully on four samples, two tumor, two normal.

main run log: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/sfitz-combine-gvcfs_DO35350_gSNP.log
output: /hot/data/unregistered/PCAWG-63/pipeline-runs/call-gSNP/sfitz-combine-gvcfs-DO35350-gSNP

Interestingly, this set of samples reliably (tested 6+ times) causes errors indicating the detachment of scratch (which has been a somewhat infrequent random error across varied pipeline runs). This was true for both main and combine-gvcfs branch runs and was solved by changing the workDir to /hot.

yashpatel6

A couple of comments for now

main.nf

module/haplotypecaller.nf

sorelfitzgibbon · 2024-05-31T22:23:48Z

The previous exome comparisons had accidentally been done with different GATK versions. Now repeated, both with GATK-4.5.0.0, yielding somewhat better numbers. I was not able to get comparisons to the "truth" set working and feel like I've spent too much time on this already.

Type	Exome	Query	Truth	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio
SNP	NIST7035	main	combine	3301	3264	37	8961	498	5198	1	5	0.988791	0.8676590000000001	0.580069	0.924273	2.3488843813387423	2.2156440617151056	0.875	0.7894947074096266
SNP	NIST7035	combine	main	3761	3264	497	7919	37	4617	0	2	0.8678540000000001	0.9887950000000001	0.583028	0.924386	2.326259946949602	2.2313341493268055	0.742354031510658	0.925115487478726
INDEL	NIST7035	main	combine	529	523	6	1882	70	1282	0	10	0.988658	0.883333	0.68119	0.933033			0.8434163701067615	0.9778247096092925
INDEL	NIST7035	combine	main	589	522	67	1663	8	1126	0	7	0.8862479999999999	0.985102	0.67709	0.9330639999999999			0.7753846153846153	1.0357142857142858

sorelfitzgibbon · 2024-06-03T13:27:25Z

Final test after removing HaplotypeCallerVCF successful: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/gSNP_sfitz-combine-gvcfs_DO36881_PCAWG-63.log

yashpatel6 · 2024-06-05T03:18:32Z

Panel sample comparison of sites (keep in mind this was done with the main branch on GATK v4.2.4.1 and this branch on GATK 4.5.0.0):

	main only	combine-gvcf-only	overlap
Pre-filter	12	5	34476
Filtered SNV	666	82	21043
Filtered INDEL	35	22	5723

Generally speaking, the numbers are in alignment with what we've seen before with GATK version updates. Are we pending any other tests/comparisons? The results seem generally agreeable with the previous method

Anything else to add @tyamaguchi-ucla ?

tyamaguchi-ucla · 2024-06-06T01:53:12Z

Anything else to add @tyamaguchi-ucla ?

I'm curious to understand the difference between the original method, which has been deprecated, and this updated method in the context of calling germline SNPs using normal and highly mutated tumor samples (e.g. mets). (Let me know if the tested samples above are highly mutated) Could we ask Jieun to test out this branch and maybe process an average sample and the most mutated one from the SU2C dataset? I think she already has gSNP calls for the older version.

Also, regarding regenotype-gSNP, GATK has advised using the same version of GATK HC whenever possible. The exome benchmarking looks fine but end users should at least be aware of the changes/updates and the possible consequences of using the updated version especially if they are currently processing large datasets.

https://gatk.broadinstitute.org/hc/en-us/articles/16957867036315-Introducing-GATK-Biggest-Practices-for-Joint-Calling-Supersized-Cohorts

"Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible."

yashpatel6 · 2024-06-06T23:11:06Z

Anything else to add @tyamaguchi-ucla ?

I'm curious to understand the difference between the original method, which has been deprecated, and this updated method in the context of calling germline SNPs using normal and highly mutated tumor samples (e.g. mets). (Let me know if the tested samples above are highly mutated) Could we ask Jieun to test out this branch and maybe process an average sample and the most mutated one from the SU2C dataset? I think she already has gSNP calls for the older version.

The methods aren't really deprecated; this is more of an update to the steps taken without any change to the way calling is done in the context of single-sample vs joint. The current calling is being done on a per-patient level and it's the same with the updated approach here. The only difference being performing the full calling directly with HC from BAM(s) versus going to intermediate GVCF per sample and then doing the genotyping together.

That being said, the results with the GIAB sample (with a given truth set) plus the exome samples would suggest this branch should be functional and follows the recommended practices; I'm not sure what additional information running a highly mutated sample would add for this particular PR as we don't really have a truth set for the SU2C samples to compare against. If the intent is to have some understanding of what the differences might be between the two workflows, that can be done separately from the PR itself here, starting with a release candidate with these changes implemented, rather than hold up this PR for that purpose?

Also, regarding regenotype-gSNP, GATK has advised using the same version of GATK HC whenever possible. The exome benchmarking looks fine but end users should at least be aware of the changes/updates and the possible consequences of using the updated version especially if they are currently processing large datasets.

https://gatk.broadinstitute.org/hc/en-us/articles/16957867036315-Introducing-GATK-Biggest-Practices-for-Joint-Calling-Supersized-Cohorts

"Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible."

Totally agree, users should be made aware of trying to keep the versions the same between HC and regenotyping. We can add it as a note in the regenotype-gSNP pipeline. The update made here will also be a major version change with proper notification of the changes made and things to be aware of for on-going processing runs.

tyamaguchi-ucla · 2024-06-07T00:59:54Z

@yashpatel6

rather than hold up this PR for that purpose?

Here the conventional comments are useful. I should've added suggestion/thought/question: (non-blocking).

The methods aren't really deprecated; this is more of an update to the steps taken without any change to the way calling is done in the context of single-sample vs joint. The current calling is being done on a per-patient level and it's the same with the updated approach here. The only difference being performing the full calling directly with HC from BAM(s) versus going to intermediate GVCF per sample and then doing the genotyping together.

That being said, the results with the GIAB sample (with a given truth set) plus the exome samples would suggest this branch should be functional and follows the recommended practices; I'm not sure what additional information running a highly mutated sample would add for this particular PR as we don't really have a truth set for the SU2C samples to compare against. If the intent is to have some understanding of what the differences might be between the two workflows, that can be done separately from the PR itself here, starting with a release candidate with these changes implemented, rather than hold up this PR for that purpose?

Right, I meant the best practices but I'm curious to understand to what extent GenotypeGVCFs efficiently rescues germline SNPs compared to the current method. It seems like the classic HC method that we use does some rescuing but is not as effective as GenotypeGVCFs. (computationally for sure though I'm not sure about the performance of SNP detection. Let me know if GenotypeGVCFs and the direct HC method are using the same function for rescuing variants.)

https://gatk.broadinstitute.org/hc/en-us/articles/360035890431-The-logic-of-joint-calling-for-germline-short-variants (Thanks for the link, @sorelfitzgibbon)

As joint genotyping is performed using one or more tumor samples with normal to predict germline variants in this pipeline, I'm curious to understand how much contamination (as in FPs and TNs) might be introduced based on the purity, PGA, CNAs, LOH, and # somatic variants. At the same time, tumor samples generally have higher coverage, which should positively influence joint genotyping. Also, it's recommended to use > 50 samples to perform GenotypeGVCFs anyway and there might not be much rescuing for a few samples.

Even if we encounter such cases, since we keep the single-sample normal gVCFs for regenotype-gSNP, we can use the regenotyped VCF instead of the VCFs from this pipeline. In any case, this update will generally benefit both us and end users. Thus, it's non-blocking.

sorelfitzgibbon · 2024-06-07T03:55:11Z

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

tyamaguchi-ucla · 2024-06-07T04:15:43Z

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

sorelfitzgibbon · 2024-06-07T04:49:39Z

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

sorelfitzgibbon · 2024-06-07T04:55:07Z

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

For dropping Indel Realignment there might be a problem in that MuSE recommends it. As far as I've been able to tell, all other tools within metapipeline-DNA don't need it.

yashpatel6

Functionally, the changes look good! I'm good with merging this and getting a release candidate out.

tyamaguchi-ucla · 2024-06-07T23:28:22Z

For dropping Indel Realignment there might be a problem in that MuSE recommends it. As far as I've been able to tell, all other tools within metapipeline-DNA don't need it.

"The results of our investigation favored the co-local realignment of tumor and matched-normal BAMs rather than the local realignment of tumor and matched-normal BAMs separately (data not shown)."

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1029-6

We believe this generally applies to other somatic variant detection algorithms and that's why we keep the IR (data not shown).

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

We can take several approaches but can create GitHub discussions internally.

sorelfitzgibbon added 8 commits March 29, 2024 12:46

use F72 config for F16

1a6f75d

add genotype from gvcfs

2c7b3eb

tmp F16 changed to F72 values

255beb4

add combine and genotype gvcfs

460fcbd

tmp F72 to F16 config

4b130c5

tidy up scripts and replace input validation

5b5d892

replace original F16.config

8407e29

update changelog

694b6d2

sorelfitzgibbon requested a review from a team as a code owner April 1, 2024 16:31

yashpatel6 self-assigned this Apr 1, 2024

yashpatel6 reviewed Apr 1, 2024

View reviewed changes

module/combine-gvcfs.nf Show resolved Hide resolved

sorelfitzgibbon added 5 commits April 2, 2024 09:16

update gatk

9c3f981

update resources

46c9555

removed unused genomicsdb-import

edf5010

fix intermediate output

9cdd5ff

adjust resources

1297686

tyamaguchi-ucla self-assigned this Apr 5, 2024

tyamaguchi-ucla reviewed Apr 5, 2024

View reviewed changes

update changelog

18c2738

yashpatel6 self-requested a review May 15, 2024 17:22

yashpatel6 reviewed May 20, 2024

View reviewed changes

main.nf Outdated Show resolved Hide resolved

module/haplotypecaller.nf Outdated Show resolved Hide resolved

remove intervals from combine-gvcfs

e883688

remove unused haplotypecallervcf

e929f14

yashpatel6 self-requested a review June 7, 2024 20:49

yashpatel6 approved these changes Jun 7, 2024

View reviewed changes

sorelfitzgibbon merged commit 3de3df6 into main Jun 7, 2024
5 checks passed

sorelfitzgibbon deleted the sfitz-combine-gvcfs branch June 7, 2024 23:14

tyamaguchi-ucla linked an issue Jun 8, 2024 that may be closed by this pull request

Update GATK to 4.5.0.0 #108

Closed

tyamaguchi-ucla mentioned this pull request Jun 8, 2024

Update GATK to 4.5.0.0 #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

sorelfitzgibbon commented Apr 1, 2024 •

edited by tyamaguchi-ucla

Loading

yashpatel6 left a comment

tyamaguchi-ucla left a comment

tyamaguchi-ucla Apr 5, 2024

sorelfitzgibbon Apr 5, 2024 •

edited

Loading

sorelfitzgibbon commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

sorelfitzgibbon commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

pboutros commented Apr 10, 2024

tyamaguchi-ucla commented Apr 22, 2024

sorelfitzgibbon commented Apr 22, 2024

sorelfitzgibbon commented May 11, 2024

yashpatel6 left a comment

sorelfitzgibbon commented May 31, 2024

sorelfitzgibbon commented Jun 3, 2024

yashpatel6 commented Jun 5, 2024

tyamaguchi-ucla commented Jun 6, 2024

yashpatel6 commented Jun 6, 2024

tyamaguchi-ucla commented Jun 7, 2024

sorelfitzgibbon commented Jun 7, 2024 •

edited

Loading

tyamaguchi-ucla commented Jun 7, 2024 •

edited

Loading

sorelfitzgibbon commented Jun 7, 2024

sorelfitzgibbon commented Jun 7, 2024

yashpatel6 left a comment

tyamaguchi-ucla commented Jun 7, 2024

Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

Conversation

sorelfitzgibbon commented Apr 1, 2024 • edited by tyamaguchi-ucla Loading

Description

Performance

Original method

Combined GVCFs

Closes #156 #108

Testing Results

Checklist

yashpatel6 left a comment

Choose a reason for hiding this comment

tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

tyamaguchi-ucla Apr 5, 2024

Choose a reason for hiding this comment

sorelfitzgibbon Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

sorelfitzgibbon commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

sorelfitzgibbon commented Apr 5, 2024

tyamaguchi-ucla commented Apr 5, 2024

pboutros commented Apr 10, 2024

tyamaguchi-ucla commented Apr 22, 2024

sorelfitzgibbon commented Apr 22, 2024

sorelfitzgibbon commented May 11, 2024

Details

1. HG001 30x downsampled

Call-gSNP main and combine-gvcfs variants compared with benchmark variants (using hap.py: standardize regions to haplotypes)

Call-gSNP main and combine-gvcfs branches compared directly against each other

pipeline-call-gSNP output:

2. Two HG001 Exomes, NIST7035 and NIST7036

Call-gSNP main branch variants compared with combine-gvcfs branch variants

3. PCAWG-63 donor DO35350. combine-gvcfs branch ran successfully on four samples, two tumor, two normal.

yashpatel6 left a comment

Choose a reason for hiding this comment

sorelfitzgibbon commented May 31, 2024

sorelfitzgibbon commented Jun 3, 2024

yashpatel6 commented Jun 5, 2024

tyamaguchi-ucla commented Jun 6, 2024

yashpatel6 commented Jun 6, 2024

tyamaguchi-ucla commented Jun 7, 2024

sorelfitzgibbon commented Jun 7, 2024 • edited Loading

tyamaguchi-ucla commented Jun 7, 2024 • edited Loading

sorelfitzgibbon commented Jun 7, 2024

sorelfitzgibbon commented Jun 7, 2024

yashpatel6 left a comment

Choose a reason for hiding this comment

tyamaguchi-ucla commented Jun 7, 2024

sorelfitzgibbon commented Apr 1, 2024 •

edited by tyamaguchi-ucla

Loading

sorelfitzgibbon Apr 5, 2024 •

edited

Loading

1. `HG001` 30x downsampled

`Call-gSNP` `main` and `combine-gvcfs` variants compared with benchmark variants (using `hap.py`: standardize regions to haplotypes)

`Call-gSNP` `main` and `combine-gvcfs` branches compared directly against each other

`pipeline-call-gSNP` output:

2. Two `HG001` Exomes, NIST7035 and NIST7036

`Call-gSNP` `main` branch variants compared with `combine-gvcfs` branch variants

3. PCAWG-63 donor `DO35350`. `combine-gvcfs` branch ran successfully on four samples, two tumor, two normal.

sorelfitzgibbon commented Jun 7, 2024 •

edited

Loading

tyamaguchi-ucla commented Jun 7, 2024 •

edited

Loading