Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: gwas catalog processing with google batch operator #12

Closed
wants to merge 37 commits into from

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Jul 17, 2024

First approach for the genetics pipeline for GWAS Catalog processing x airflow utils development

Things implemented

  • Added test coverage threshold (Makefile).
  • Added coverage as a test dependency.
  • Added display of the coverage run to the make test command.
  • Moved airflow image to the docker/ and added automatic build for artifact registry for genetics_etl image based on gentropy docker image - github actions + OIDC.
  • Added 3 execution modes to the gwas_catalog_dag that include:
    • RESUME - when one want to run the pipeline for manifests that have failed previously,
    • CONTINUE - when one want to run the pipeline on manifests that were not processed yet,
    • FORCE when one wants to rerun all manifests from scratch
      To set the correct flag, update config/config.yaml
  • Generated test data in gs://ot_orchestration bucket
  • Added CLI script to fetch gwas catalog sumstat paths based on their studyIDs ot fetch-raw-sumstat-paths
  • Added batch_processing_job (ot gwas-catalog-pipeline) - cli command that gets as input the single input manifest file and based on it's content runs the gentropy steps - currently there are two gentropy steps implemented (harmonisation and qc), other steps are in progress
  • Added IOManager and ProtoPath ( with concrete implementations for GCS and Posix) implementations to be able to perform file system agnostic concurrent read and writes.
  • Updates to config parser - not finished (this will require some level of abstraction in next developments, as I would like to make it dag agnostics - The strategy I am aiming is following - config resolver object based on dag name should look for the dag config parser and use it to read the correct configuration with some level of config validation.

The gwas-catalog-pipeline is not runnable yet. The major concern was found during the testing of the batch_processing_job which can not run without gentropy, given that I tried adding gentropy as a test dependency, unfortunately this is not possible due to the dependency versions of apache-beam and hail.

The other consern is with regards to the step presented in gwas_catalog_curation dag, which requires reading variant table (~60Gb) so is not suitable for batch job in current implementation.

To resolve the first issue I want to try to split the batch_processing_job and all manifest_processing tasks with dependencies outside to separate package and inject the steps that run this into PythonVirtualenvOperator or KubernetesPodOperator . This will also allow us to move to the cloud composer.

To resolve the other issue I need to undrestand the process of curation.

@project-defiant project-defiant changed the title Szsz code cleanup feat: google batch job for gwas_catalog processing Jul 18, 2024
@project-defiant project-defiant changed the title feat: google batch job for gwas_catalog processing feat: google batch job for gwas_catalog processing - harmonisation Jul 18, 2024
@project-defiant project-defiant changed the title feat: google batch job for gwas_catalog processing - harmonisation feat: gwas catalog processing with google batch operator Jul 23, 2024
@project-defiant project-defiant requested review from javfg and removed request for tskir July 31, 2024 10:40
@project-defiant project-defiant self-assigned this Jul 31, 2024
@project-defiant project-defiant marked this pull request as ready for review July 31, 2024 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant