Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indel matching on rsid #65

Merged
merged 23 commits into from
Sep 18, 2024
Merged

indel matching on rsid #65

merged 23 commits into from
Sep 18, 2024

Conversation

alkaZeltser
Copy link
Collaborator

@alkaZeltser alkaZeltser commented Sep 6, 2024

I encountered a systematic issue with PGS Catalog harmonized coordinate data. Harmonized (GRCh37 to GRCh38) INDEL coordinates are almost always off by one base pair from genotype data that is called against the GRCh38 GENCODE reference with GATK HaplotypeCaller. Not sure if this happens with other aligners/variant callers/references, but since this is a pretty common workflow, it is worth accounting for. Note that this problem persists even when normalizing indels against the GRCh38 ENSEMBL reference (which is cited as the source of harmonization by the PGS Catalog).

By default, PGS data is matched to VCF data by genomic coordinate (CHROM, POS). I have added a secondary merge operation that only operates on SNPs missed from the primary merge. This merge attempts to match SNPs by rsID. Since rsID is not a consistent label, it is not recommended as a primary matching mechanism, however for the case of inconsistently harmonized INDEL coordinates, this works as a reasonable backup. rsID is also not a required column for PGS Catalog registration, so this method is conditional on availability of the rsID in PGS data.

In this PR:

  • Added new module to combine.vcf.with.pgs() for rsID based secondary matching
  • Small refactor to apply.polygenic.score() missing genotype handling methods to account for a difference in coordinates between the PGS and VCF data of rsID-matched SNPs
  • Added rsID as an optional column in input PGS weight data, verified in import.pgs.weight.file
  • Added and updated unit tests & test data for above
  • Update documentation of merge method

Note: data.table is now potentially a more sophisticated dependency, does anyone know if a full import is needed to manipulate data tables? @dan-knight ?

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added the changes included in this pull request to NEWS under the next release version or unreleased, and updated the date.

  • I have updated the version number in metadata.yaml and DESCRIPTION.

  • Both R CMD build and R CMD check run successfully.

Testing Results

All unit tests PASS

@alkaZeltser alkaZeltser marked this pull request as draft September 6, 2024 18:00
@alkaZeltser alkaZeltser marked this pull request as ready for review September 6, 2024 20:59
Copy link

@forbiddenpersimmon forbiddenpersimmon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


# keep coordinates from VCF data for matched SNPs with coordinate mismatch
merged.vcf.with.missing.pgs.data[!is.na(merged.vcf.with.missing.pgs.data$REF), 'CHROM'] <- merged.vcf.with.missing.pgs.data[!is.na(merged.vcf.with.missing.pgs.data$REF), 'CHROM.vcf'];
merged.vcf.with.missing.pgs.data[!is.na(merged.vcf.with.missing.pgs.data$REF), 'POS'] <- merged.vcf.with.missing.pgs.data[!is.na(merged.vcf.with.missing.pgs.data$REF), 'POS.vcf'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be done all at once, without calling is.na(merged.vcf.with.missing.pgs.data$REF) multiple times.

R/handle-weight-files.R Outdated Show resolved Hide resolved
@alkaZeltser alkaZeltser merged commit bbe9bf5 into main Sep 18, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants