Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHR Compatibility #7

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open

EHR Compatibility #7

wants to merge 44 commits into from

Conversation

meliao
Copy link

@meliao meliao commented Aug 7, 2020

This update adds support for the following EHR tables available from the UK Biobank:

  • gp_clinical the main outpatient care dataset
  • gp_scripts outpatient prescription dataset
  • gp_registrations outpatient health insurance registrations
  • hesin the main hospital inpatient dataset
  • hesin_diag hospital inpatient diagnosis dataset

Major changes to the code:

  • Refactoring Pheno2SQL class into a LoadSQL base class and adding EHR2SQL class
  • Refactoring Query classes into YamlQuery, PhenoQuery, and EHRQuery.
  • Updating load_data.py to include a --load_ehr command line option and function.
  • Added testing to reflect these changes

Other notes:

  • I updated the Dockerfile to add new environment variables, but I'm not sure how to write tests for this.
  • The conda environment needed to be updated slightly to fix a conflict between Flask and Werkzeug.

@meliao meliao requested a review from miltondp August 7, 2020 14:41
@meliao
Copy link
Author

meliao commented Aug 7, 2020

@miltondp I now see that the Docker Hub image has failed to build! I don't have experience with Docker Hub -- is there a tutorial I can read somewhere to try to fix this?

@meliao
Copy link
Author

meliao commented Aug 7, 2020

Travis CI build is passing. I'm not sure how to see the build logs for Docker Hub CI (If that is a separate thing from Travis CI)

Copy link
Collaborator

@miltondp miltondp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @meliao! This is a partial review on your changes, since I couldn't finish reviewing all the files and test the code. I guess I won't be able to do that until early October, since I have an important deadline. However, I hope my comments here will be helpeful. My main concern is that several unit tests are disabled now.

.gitignore Outdated
@@ -0,0 +1,3 @@
*/__pycache__/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should not be necessary, at least according to the standard .gitignore

- eventlet=0.21.0
- ruamel.yaml=0.15.34
- sqlalchemy
- flask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to fix most of the versions here, since otherwise the Docker image will be built always with different versions of critical packages like flask.

My current approach is to have two environment.yml files:

  1. One with most of the packages with fixed major versions, like, for instance, python==3.8 or numpy=1.13 (note that I'm not fixing the revision part of the version). This one is for production, building the Docker image, etc.
  2. Another one, intended for developers, with the list of needed packages and almost no versions; this one is to easily update the environment when you want to do so. For example, for a major release of ukbREST, you use this file to create a new conda environment, and then export that environment to update the first file (the one for production).

@@ -0,0 +1,159 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file here? Looks similar than docker/start.py, but without new methods you introduced like _setup_ehr_paths.

@@ -0,0 +1,7 @@
eid data_provider event_dt read_2 read_3 value1 value2 value3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this is not real data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would add a proper extension to the file, probably .tsv here?

Comment on lines +17 to +18
self.basic_2sql_config = {'db_uri': POSTGRESQL_ENGINE,
'sql_chunksize': None}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a minor stylish improvement in the dict creation.

Suggested change
self.basic_2sql_config = {'db_uri': POSTGRESQL_ENGINE,
'sql_chunksize': None}
self.basic_2sql_config = {
'db_uri': POSTGRESQL_ENGINE,
'sql_chunksize': None
}

def test_load_ehr_single_path_gp(self):
ehr2sql = self._get_ehr2sql(self.ehr_path, None,
**self.basic_2sql_config)
ehr2sql.load_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you have some assert statements here to make sure the data was correctly loaded?

@@ -718,6 +718,7 @@ def test_sqlite_query_single_table(self):
assert query_result.loc[3, 'c48_0_0'] == '2010-01-01'
assert query_result.loc[4, 'c48_0_0'] == '2011-02-15'

@unittest.skip('Pheno2sql was refactored')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok, but the test should be telling you if the refactoring is fine :-) Here I see that basically most of the probably failing tests are disabled. I'd like to see if these tests are passing with your refactored code.

from ukbrest.common.utils.auth import PasswordHasher

class AppTests(DBTest):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some documentation to the class. What does this do? I guess I'm the culprit and I should have done that at the beginning :-D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miltondp could you help with this? Even imperfect, I would like to have a working version of the tool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hakyim, I fixed the Dockerfile to make it work with the new bgen repo. So now it's possible to get a Docker image for this developmental branch from here: https://hub.docker.com/repository/docker/hakyimlab/ukbrest/builds

You should be able to get a Docker image of this branch with: docker pull hakyimlab/ukbrest:ehr-dev
And then use hakyimlab/ukbrest:ehr-dev everywhere instead of just hakyimlab/ukbrest

However, I do not recommend merging this branch into master since several unit tests were disabled, and I can't test the changes in real data. If someone else can put the unit tests back and test it, also with real UK Biobank data, then it would be fine to merge it, but otherwise, it might break the latest Docker image tag and prevent others from using it.

@miltondp
Copy link
Collaborator

miltondp commented Jun 23, 2021 via email

# Conflicts:
#	.gitignore
#	.travis.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants