Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GVCFs for genotyping - run time/CPU hours substantially reduced (0.52) #158

Merged
merged 16 commits into from
Jun 7, 2024

Conversation

sorelfitzgibbon
Copy link
Contributor

@sorelfitzgibbon sorelfitzgibbon commented Apr 1, 2024

Description

The pipeline has been creating GVCFs for the input samples (1 normal and 1 or more tumor samples for a patient), but then starting over to run joint genotyping on the set of samples using HaplotypeCaller. This unnecessarily repeats the heavy computation steps. This PR instead combines the GVCFs and joint genotypes using GenotypeGVCFs. This runs 1.8 x faster, (0.52 of CPU hours).

Performance

Original method

Duration : 1d 5h 36m 53s
CPU hours : 348.3

Each run of 50 (scattered) HaplotypeCallerVCF takes over an hour on average, < 2 CPU, ~ 3 GB

Combined GVCFs

Duration : 16h 4m 13s
CPU hours : 181.2

Each run of 50 CombineGVCFs processes takes < 4 minutes, < 2 CPU, ~ 1 GB
Each run of 50 GenotypeGVCFs processes takes < 3 minutes, < 2 CPU, ~ 3 GB

The final output is much the same. Not identical, but the GenotypeGVCF method is the newer best practices method and is likely to be the better of the two. Differences seem to be mostly (exclusively?) in lower quality calls.

Genotype Concordances:
Details: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/orig-cgvcf.cmp
plots-af
plots-dp-ndr

Closes #156 #108

Testing Results

  • N-T paired WGS
    • samples: DO40167 from PCAWG-63, Normal: 153 GB, Tumor: 178 GB
    • input csv: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/test/DO40167.yaml
    • config: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/test/PCAWG-63.config
    • original method output: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/main-DO40167-PCAWG-63-orig
    • combine GVCFs output: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/sfitz-combine-gvcfs-DO40167-PCAWG-63-cgvcf

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.

@sorelfitzgibbon sorelfitzgibbon requested a review from a team as a code owner April 1, 2024 16:31
@yashpatel6 yashpatel6 self-assigned this Apr 1, 2024
Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results do look promising! We'll need to test it out with additional modes to get some information

module/combine-gvcfs.nf Show resolved Hide resolved
@tyamaguchi-ucla tyamaguchi-ucla self-assigned this Apr 5, 2024
Copy link

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I think it's worth testing on a highly mutated metastasis sample as well. It likely works well with relatively quiet tumors but this joint genotyping process with tumors may introduce bias in the germline calls of the paired normal samples when applied to more progressed tumor samples.

gatk_version = "4.2.4.1"
gatk_version = "4.5.0.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the test performed before or after this update? There are numerous changes since 4.2.4.1. I would suggest creating a separate PR for this or does this pipeline change require the latest version?

Copy link
Contributor Author

@sorelfitzgibbon sorelfitzgibbon Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the update was required to get the other branch to work. That branch uses GenomicsDBImport instead of CombineGVCFs. The initial comparisons between this branch and the original method were both done with gatk version 4.2.4.1. The later comparisons between GenomicsDBImport andCombineGVCFs` were both done with gatk version 4.5.0.0.

@sorelfitzgibbon
Copy link
Contributor Author

Interesting! I think it's worth testing on a highly mutated metastasis sample as well. It likely works well with relatively quiet tumors but this joint genotyping process with tumors may introduce bias in the germline calls of the paired normal samples when applied to more progressed tumor samples.

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

So it's a separate question whether it should be joint calling with the normal sample. I wonder if the benefits of having a second sample from an independent library outweigh any cons of somatic mutations influencing the normal calls. I assume even very progressed tumor samples have orders of magnitude less somatic variants than germline?

@tyamaguchi-ucla
Copy link

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

@tyamaguchi-ucla
Copy link

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

@sorelfitzgibbon
Copy link
Contributor Author

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

It runs HaplotypeCaller on each sample

This pipeline was doing joint calling with the normal sample prior to this PR. I just made the same process more efficient, I didn't change anything fundamental.

I think we might be mixing up joint calling and joint genotyping? If not, I'm not aware of this. Have we already implemented GenotypeGVCFs in call-gSNP prior to this PR? I think this pipeline does HaplotypeCaller + VQSR.

Or does this update run GenotypeGVCFs on each sample?

Current main runs:

  • HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM
  • HaplotypeCaller jointly on all of the input BAMs for the patient

This update runs:

  • HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM
  • GenotypeGVCFs jointly on all of the GVCFs for the patient

They yield essentially the same results, but the current method repeats the computationally heavy step.

@tyamaguchi-ucla
Copy link

HaplotypeCaller in emit-ref-confidence GVCF mode separately on each sample/BAM

Yup, this was done to generate inputs for https://github.com/uclahs-cds/pipeline-regenotype-gSNP

@pboutros
Copy link

Really exciting! May be worth benchmarking on the Precision FDA dataset: https://precision.fda.gov/challenges/truth/results as well

@tyamaguchi-ucla
Copy link

@sorelfitzgibbon @pboutros @yashpatel6 Alternatively, it seems like we can genotype the gVCFs without combining them, which will make the process simpler and faster. Then, we can use https://github.com/uclahs-cds/pipeline-regenotype-gSNP for cohort level genotyping.

@sorelfitzgibbon
Copy link
Contributor Author

@sorelfitzgibbon @pboutros @yashpatel6 Alternatively, it seems like we can genotype the gVCFs without combining them, which will make the process simpler and faster. Then, we can use https://github.com/uclahs-cds/pipeline-regenotype-gSNP for cohort level genotyping.

I don't think this is possible. My understanding is they have to be combined with either CombineGVCFs or GenomicsDBImport before genotyping.

@sorelfitzgibbon
Copy link
Contributor Author

This branch (combine-gvcfs) has now been tested in the following ways (details below):

  1. main branch and combine-gvcfs branch run on HG001 30x downsampled with each result compared to GIAB HG001 benchmark VCF (part of precision FDA), as suggested by @pboutros.
  2. main vs combine-gvcfs on HG001 exomes
  3. combine-gvcfs branch on four samples from a PCAWG-63 donor, 2 tumor and 2 normal.

Details

Benchmark HG001 variants and recommended benchmarking regions dowloaded from:

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38

to:

/hot/user/sfitzgibbon/giab-downloading/HG001/HG001_GRCh38_1_22_v4.2.1_benchmark.bed
/hot/user/sfitzgibbon/giab-downloading/HG001/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz

1. HG001 30x downsampled

  • BAM downloaded from
ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam
  • Reads realigned, recalibrated, then call-gSNP.
  • Results:
Call-gSNP main and combine-gvcfs variants compared with benchmark variants (using hap.py: standardize regions to haplotypes)
  • There are slightly more "false positive" and slightly more "false negative" SNVs in the combine-gvcfs branch compared to the main branch.
    • I suspect this is due to the benchmark variants and the main branch variants both being called by the old HaplotypeCaller method, and those discrepant variants are low quality.
    • Supporting this: The TiTv ratio is slightly higher for the combine-gvcfs branch.
  • There are slightly more "false positive" and slightly less "false negative" INDELs in the combine-gvcfs branch
Type Branch TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
SNP main 3254352 3025399 228953 3308188 2735 279297 275 149 0.9296469999999999 0.9990969999999999 0.084426 0.9631219999999999 2.110715142410159 2.0595541024351274 1.526276146858805 1.468632011740415
SNP combine 3254352 2994915 259437 3258372 2974 259731 265 163 0.9202799999999999 0.999008 0.079712 0.9580290000000001 2.110715142410159 2.064118390928878 1.526276146858805 1.614994284283063
INDEL main 467684 452975 14709 842223 5003 367279 892 3076 0.968549 0.989466 0.436083 0.9788959999999999 1.4564622691647133 1.71186535561161
INDEL combine 467684 453032 14652 842327 5005 367343 892 3075 0.968671 0.9894629999999999 0.436105 0.978957 1.4564622691647133 1.7121678625645012
Call-gSNP main and combine-gvcfs branches compared directly against each other
Type Query Truth TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL combine main 458311 457695 616 879053 845 404068 0 773 0.998656 0.9982209999999999 0.459663 0.9984379999999999 1.4146944562280916 1.710720187710512
SNP combine main 3027873 2930405 97468 3364587 67341 365945 0 9154 0.9678100000000001 0.977543 0.10876400000000001 0.9726520000000001 2.1212151194249054 2.0610247841269524 1.473022797923228 1.6170602116421053
pipeline-call-gSNP output:
  • main branch:
/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/main-RMNISTHS_30xdownsample-gSNP
  • combine-gvcfs branch:
/hot/user/sfitzgibbon/giab-downloading/HG001/NA12878_HiSeq_downsampled30X_GRCh37_10262015/pipeline-runs/sfitz-combine-gvcfs-RMNISTHS_30xdownsample-gSNP

2. Two HG001 Exomes, NIST7035 and NIST7036

  • FASTQs downloaded from
ftp.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome
  • Reads aligned, recalibrated, then call-gSNP.
  • Results:
Call-gSNP main branch variants compared with combine-gvcfs branch variants
  • I'm not sure yet how to interpret these exome results, but probably the methods were not adequate. I did not have the exome regions bed file and I used a benchmark regions bed file for WGS.
Type Exome Query Truth TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
SNP NIST7035 main combine 165578 162858 2720 230841 42792 25176 67 356 0.9835729999999999 0.791933 0.10906199999999999 0.877411 2.1412257905419505 2.0782943772583633 0.5538806656592298 0.46946168783624614
SNP NIST7035 combine main 205629 162857 42772 185934 2733 20329 64 360 0.791994 0.983497 0.10933399999999999 0.877418 2.143839909498112 2.070116062141949 0.46188905163464683 0.5439527582598296
SNP NIST7086 main combine 180913 177914 2999 244897 40331 26637 60 381 0.9834229999999999 0.8152159999999999 0.108768 0.8914540000000001 2.143892267593397 2.075800577671732 0.524525503152075 0.45642178643017317
SNP NIST7086 combine main 218225 177914 40311 202985 3014 22043 57 383 0.815278 0.9833430000000001 0.10859400000000001 0.891458 2.1425362496220246 2.0719527880759627 0.4484251053801985 0.5160760174505469
INDEL NIST7035 main combine 30083 29695 388 51990 4886 17163 10 475 0.987102 0.859707 0.330121 0.91901 0.49863636363636366 0.5439859053989489
INDEL NIST7035 combine main 34505 29698 4807 44652 401 14309 8 205 0.8606870000000001 0.986784 0.320456 0.9194319999999999 0.47614913841746603 0.5612105022670556
INDEL NIST7086 main combine 32208 31846 362 54688 4526 18037 14 385 0.9887610000000001 0.876511 0.329816 0.929258 0.4915246278818613 0.5381309204015269
INDEL NIST7086 combine main 36318 31847 4471 47990 377 15493 9 166 0.876893 0.988399 0.322838 0.929313 0.46822207643103164 0.5585738539898133

3. PCAWG-63 donor DO35350. combine-gvcfs branch ran successfully on four samples, two tumor, two normal.

main run log: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/sfitz-combine-gvcfs_DO35350_gSNP.log
output: /hot/data/unregistered/PCAWG-63/pipeline-runs/call-gSNP/sfitz-combine-gvcfs-DO35350-gSNP

Interestingly, this set of samples reliably (tested 6+ times) causes errors indicating the detachment of scratch (which has been a somewhat infrequent random error across varied pipeline runs). This was true for both main and combine-gvcfs branch runs and was solved by changing the workDir to /hot.

@yashpatel6 yashpatel6 self-requested a review May 15, 2024 17:22
Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments for now

main.nf Outdated Show resolved Hide resolved
module/haplotypecaller.nf Outdated Show resolved Hide resolved
@sorelfitzgibbon
Copy link
Contributor Author

The previous exome comparisons had accidentally been done with different GATK versions. Now repeated, both with GATK-4.5.0.0, yielding somewhat better numbers. I was not able to get comparisons to the "truth" set working and feel like I've spent too much time on this already.

Type Exome Query Truth TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
SNP NIST7035 main combine 3301 3264 37 8961 498 5198 1 5 0.988791 0.8676590000000001 0.580069 0.924273 2.3488843813387423 2.2156440617151056 0.875 0.7894947074096266
SNP NIST7035 combine main 3761 3264 497 7919 37 4617 0 2 0.8678540000000001 0.9887950000000001 0.583028 0.924386 2.326259946949602 2.2313341493268055 0.742354031510658 0.925115487478726
INDEL NIST7035 main combine 529 523 6 1882 70 1282 0 10 0.988658 0.883333 0.68119 0.933033 0.8434163701067615 0.9778247096092925
INDEL NIST7035 combine main 589 522 67 1663 8 1126 0 7 0.8862479999999999 0.985102 0.67709 0.9330639999999999 0.7753846153846153 1.0357142857142858

@sorelfitzgibbon
Copy link
Contributor Author

Final test after removing HaplotypeCallerVCF successful: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-gSNP/gSNP_sfitz-combine-gvcfs_DO36881_PCAWG-63.log

@yashpatel6
Copy link
Collaborator

Panel sample comparison of sites (keep in mind this was done with the main branch on GATK v4.2.4.1 and this branch on GATK 4.5.0.0):

main only combine-gvcf-only overlap
Pre-filter 12 5 34476
Filtered SNV 666 82 21043
Filtered INDEL 35 22 5723

Generally speaking, the numbers are in alignment with what we've seen before with GATK version updates. Are we pending any other tests/comparisons? The results seem generally agreeable with the previous method

Anything else to add @tyamaguchi-ucla ?

@tyamaguchi-ucla
Copy link

Anything else to add @tyamaguchi-ucla ?

I'm curious to understand the difference between the original method, which has been deprecated, and this updated method in the context of calling germline SNPs using normal and highly mutated tumor samples (e.g. mets). (Let me know if the tested samples above are highly mutated) Could we ask Jieun to test out this branch and maybe process an average sample and the most mutated one from the SU2C dataset? I think she already has gSNP calls for the older version.

Also, regarding regenotype-gSNP, GATK has advised using the same version of GATK HC whenever possible. The exome benchmarking looks fine but end users should at least be aware of the changes/updates and the possible consequences of using the updated version especially if they are currently processing large datasets.

https://gatk.broadinstitute.org/hc/en-us/articles/16957867036315-Introducing-GATK-Biggest-Practices-for-Joint-Calling-Supersized-Cohorts

"Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible."

@yashpatel6
Copy link
Collaborator

Anything else to add @tyamaguchi-ucla ?

I'm curious to understand the difference between the original method, which has been deprecated, and this updated method in the context of calling germline SNPs using normal and highly mutated tumor samples (e.g. mets). (Let me know if the tested samples above are highly mutated) Could we ask Jieun to test out this branch and maybe process an average sample and the most mutated one from the SU2C dataset? I think she already has gSNP calls for the older version.

The methods aren't really deprecated; this is more of an update to the steps taken without any change to the way calling is done in the context of single-sample vs joint. The current calling is being done on a per-patient level and it's the same with the updated approach here. The only difference being performing the full calling directly with HC from BAM(s) versus going to intermediate GVCF per sample and then doing the genotyping together.

That being said, the results with the GIAB sample (with a given truth set) plus the exome samples would suggest this branch should be functional and follows the recommended practices; I'm not sure what additional information running a highly mutated sample would add for this particular PR as we don't really have a truth set for the SU2C samples to compare against. If the intent is to have some understanding of what the differences might be between the two workflows, that can be done separately from the PR itself here, starting with a release candidate with these changes implemented, rather than hold up this PR for that purpose?

Also, regarding regenotype-gSNP, GATK has advised using the same version of GATK HC whenever possible. The exome benchmarking looks fine but end users should at least be aware of the changes/updates and the possible consequences of using the updated version especially if they are currently processing large datasets.

https://gatk.broadinstitute.org/hc/en-us/articles/16957867036315-Introducing-GATK-Biggest-Practices-for-Joint-Calling-Supersized-Cohorts

"Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible."

Totally agree, users should be made aware of trying to keep the versions the same between HC and regenotyping. We can add it as a note in the regenotype-gSNP pipeline. The update made here will also be a major version change with proper notification of the changes made and things to be aware of for on-going processing runs.

@tyamaguchi-ucla
Copy link

@yashpatel6

rather than hold up this PR for that purpose?

Here the conventional comments are useful. I should've added suggestion/thought/question: (non-blocking).

The methods aren't really deprecated; this is more of an update to the steps taken without any change to the way calling is done in the context of single-sample vs joint. The current calling is being done on a per-patient level and it's the same with the updated approach here. The only difference being performing the full calling directly with HC from BAM(s) versus going to intermediate GVCF per sample and then doing the genotyping together.

That being said, the results with the GIAB sample (with a given truth set) plus the exome samples would suggest this branch should be functional and follows the recommended practices; I'm not sure what additional information running a highly mutated sample would add for this particular PR as we don't really have a truth set for the SU2C samples to compare against. If the intent is to have some understanding of what the differences might be between the two workflows, that can be done separately from the PR itself here, starting with a release candidate with these changes implemented, rather than hold up this PR for that purpose?

Right, I meant the best practices but I'm curious to understand to what extent GenotypeGVCFs efficiently rescues germline SNPs compared to the current method. It seems like the classic HC method that we use does some rescuing but is not as effective as GenotypeGVCFs. (computationally for sure though I'm not sure about the performance of SNP detection. Let me know if GenotypeGVCFs and the direct HC method are using the same function for rescuing variants.)

https://gatk.broadinstitute.org/hc/en-us/articles/360035890431-The-logic-of-joint-calling-for-germline-short-variants (Thanks for the link, @sorelfitzgibbon)

As joint genotyping is performed using one or more tumor samples with normal to predict germline variants in this pipeline, I'm curious to understand how much contamination (as in FPs and TNs) might be introduced based on the purity, PGA, CNAs, LOH, and # somatic variants. At the same time, tumor samples generally have higher coverage, which should positively influence joint genotyping. Also, it's recommended to use > 50 samples to perform GenotypeGVCFs anyway and there might not be much rescuing for a few samples.

Even if we encounter such cases, since we keep the single-sample normal gVCFs for regenotype-gSNP, we can use the regenotyped VCF instead of the VCFs from this pipeline. In any case, this update will generally benefit both us and end users. Thus, it's non-blocking.

@sorelfitzgibbon
Copy link
Contributor Author

sorelfitzgibbon commented Jun 7, 2024

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@tyamaguchi-ucla
Copy link

tyamaguchi-ucla commented Jun 7, 2024

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

@sorelfitzgibbon
Copy link
Contributor Author

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

@sorelfitzgibbon
Copy link
Contributor Author

@tyamaguchi-ucla I think you're probably right that the influence of the one additional (tumor) sample won't have much impact on the final calls, perhaps only rescuing very close calls. Given both that and the orders of magnitude more germline variants than somatic variants even in a highly mutated lineage (I assume?), I doubt there will be many important changes with the addition of the tumor genome; changes that do result are more likely to be positive, and I especially doubt that important changes will depend on which method is used. Both methods minimally consider the other sample in constructing the final result.

@sorelfitzgibbon Yeah, I guess we can only speculate. To my knowledge, GATK hasn't been really systematically benchmarked in this context. This also applies to Indel Realignment but we haven't done much systematic benchmarking (IR has been already dropped from GATK4). As we're getting more datasets to look into spatial heterogeneity (e.g. a GBM patient with 1 normal and 20 tumors) and datasets including organoids, I've been curious about this issue. Perhaps this could be a project for a summer student.

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

For dropping Indel Realignment there might be a problem in that MuSE recommends it. As far as I've been able to tell, all other tools within metapipeline-DNA don't need it.

@yashpatel6 yashpatel6 self-requested a review June 7, 2024 20:49
Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, the changes look good! I'm good with merging this and getting a release candidate out.

@sorelfitzgibbon sorelfitzgibbon merged commit 3de3df6 into main Jun 7, 2024
5 checks passed
@sorelfitzgibbon sorelfitzgibbon deleted the sfitz-combine-gvcfs branch June 7, 2024 23:14
@tyamaguchi-ucla
Copy link

For dropping Indel Realignment there might be a problem in that MuSE recommends it. As far as I've been able to tell, all other tools within metapipeline-DNA don't need it.

"The results of our investigation favored the co-local realignment of tumor and matched-normal BAMs rather than the local realignment of tumor and matched-normal BAMs separately (data not shown)."

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1029-6

We believe this generally applies to other somatic variant detection algorithms and that's why we keep the IR (data not shown).

I like that idea! Would twenty highly mutated tumors cause more harm than good. But how would the closest to truth be determined?

We can take several approaches but can create GitHub discussions internally.

@tyamaguchi-ucla tyamaguchi-ucla linked an issue Jun 8, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Joint genotyping from GVCFs would substantially reduce computation Update GATK to 4.5.0.0
4 participants