Skip to content

Commit

Permalink
fix(finngen_r11): preserve all studyIds (#747)
Browse files Browse the repository at this point in the history
* fix(finngen_r11): preserve all studyIds

Preserve all studyIds, even if EFO mapping is missing, so mapping
between studyIndex and StudyLocus is by studyId column is accurate.

* fix: typo in docstring

Co-authored-by: Irene López Santiago <[email protected]>

---------

Co-authored-by: Szymon Szyszkowski <[email protected]>
Co-authored-by: Irene López Santiago <[email protected]>
  • Loading branch information
3 people authored Sep 4, 2024
1 parent 3ea47a9 commit 12ff35b
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions src/gentropy/datasource/finngen/study_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ def join_efo_mapping(
All studies without EFO traits are dropped. The EFO mappings are then aggregated into lists per
studyId.
NOTE: preserve all studyId entries even if they don't have EFO mappings.
This is to avoid discrepancies between `study_index` and `credible_set` `studyId` column.
The rows with missing EFO mappings will be dropped in the study_index validation step.
Args:
study_index (StudyIndex): Study index table.
efo_curation_mapping (DataFrame): Dataframe with EFO mappings.
Expand Down Expand Up @@ -70,8 +74,10 @@ def join_efo_mapping(
f.col("PROPERTY_VALUE").alias("traitFromSource"),
)
)
# NOTE: inner join to keep only the studies with EFO mappings
si_df = study_index.df.join(efo_mappings, on="traitFromSource", how="inner")

si_df = study_index.df.join(
efo_mappings, on="traitFromSource", how="left_outer"
)
common_cols = [c for c in si_df.columns if c != "traitFromSourceMappedId"]
si_df = si_df.groupby(common_cols).agg(
f.collect_list("traitFromSourceMappedId").alias("traitFromSourceMappedIds")
Expand Down

0 comments on commit 12ff35b

Please sign in to comment.