Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHR Compatibility #7

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b4279a1
Moved some files to use via command line
meliao May 4, 2020
fe2f5a2
Added gitignore; removed error handling
meliao May 4, 2020
cc624ea
load_withdrawals should now accept command line args
meliao May 4, 2020
e0a9286
load_codings should now accept command line parameters
meliao May 4, 2020
6f0b9cc
Printing log messages for pheno query
meliao May 4, 2020
b30a2fd
More logging messages
meliao May 4, 2020
6b19567
Added one logging message
meliao May 5, 2020
143dc27
First commit for pg_clinical loading
meliao May 15, 2020
6dbc8c8
Minor changes from test env
meliao May 15, 2020
61cc7c9
Changes to primary care loading
meliao May 15, 2020
00ebe32
Need to drop duplicates
meliao May 15, 2020
ca3fd29
Need to segment by service provider
meliao May 15, 2020
9421069
I did not think there would be so much data cleaning
meliao May 15, 2020
cedecce
I will refactor later
meliao May 17, 2020
58e5ae4
Found small errors in testing environment. Now for uploading scripts …
meliao May 17, 2020
d8e8a78
PG scripts data parsing complete
meliao May 17, 2020
c3c488f
Changes from bionimbus
meliao May 17, 2020
57ccaef
Added gp_registrations loading
meliao May 19, 2020
7188bfd
Changes from bionimbus
meliao May 19, 2020
03ea9a0
Made some changes before pulling. Whoops
meliao May 19, 2020
a352260
Merge branch 'EHR_loading' of github.com:meliao/ukbrest into EHR_loading
meliao May 19, 2020
acad6e0
Support for Hospital Inpatient records
meliao May 19, 2020
8ff189b
Changes from bionimus
meliao May 19, 2020
e2f046c
Changes
meliao May 19, 2020
dfd3625
Refactoring
meliao May 19, 2020
74a78d0
Headers for EHR test files
meliao May 20, 2020
0a572dc
Commit before deleting incorrect app.py and load_data.py files
meliao May 21, 2020
0b05ba4
Need to check out main to see how error handling works
meliao May 21, 2020
acb3dac
Beginning to write tests for EHR querying
meliao May 22, 2020
54f6072
More work on testing setup
meliao May 22, 2020
a0ded58
Almost finished; need to update and test dockerfile
meliao May 22, 2020
4765034
Updated environment and Dockerfile
meliao May 27, 2020
8e30a2c
Changed enviornment file for conda formatting
meliao May 27, 2020
23cc238
Changes to environment file
meliao May 27, 2020
7e5360b
Changes from bionimbus
meliao May 29, 2020
2b0fddb
Now chunking the primary care data loading, and abandoning hopes of a…
meliao Jun 2, 2020
ed728fc
Nonnull row drop warning no longer at chunk level
meliao Jun 2, 2020
69dcb97
Changes from bionimbus
meliao Jun 15, 2020
2140f07
Merge pull request #5 from meliao/master
meliao Jul 17, 2020
f980c24
Merge pull request #6 from meliao/EHR_loading
meliao Jul 22, 2020
9e01a43
Fixed sql schema problems
meliao Aug 7, 2020
c47b23e
Updated travis CI config file with new BGEN installation
meliao Aug 7, 2020
c4f3b84
Added resources to setup_app in wsgi script
meliao Aug 21, 2020
f698227
Merge branch 'master' into EHR_development
miltondp Jul 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
.DS_Store
__pycache__/
*.py[cod]

8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,24 +20,24 @@ install:
fi
- bash miniconda.sh -b -p $HOME/miniconda
- hash -r
# Install bgenix
# Install bgenix
- wget http://code.enkre.net/bgen/tarball/release/bgen.tgz
- tar -xf bgen.tgz
- tar -xzf bgen.tgz
- cd bgen.tgz
- ./waf configure
- ./waf
- ./build/test/unit/test_bgen
- sudo cp ./build/apps/bgenix /usr/local/bin/
- cd ..
- sudo cp lib/qctool/* /usr/local/bin/
# Install conda and create environment
# Install conda and create environment
- source /home/travis/miniconda/etc/profile.d/conda.sh
- conda config --set always_yes yes --set changeps1 no
- conda update -q conda
- conda info -a
- conda env create -q -n test-environment --file environment.yml
- conda activate test-environment
# - sleep 10
# - sleep 10

script: nosetests --with-coverage

Expand Down
4 changes: 4 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ COPY misc/codings /var/lib/codings
# Other environmental variables
ENV UKBREST_WITHDRAWALS_PATH="/var/lib/withdrawals"

# EHR directories
ENV PRIMARY_CARE_DIR="/var/lib/primary_care"
ENV HOSPITAL_INPATIENT="/var/lib/hospital_inpatient"

WORKDIR /opt

COPY docker/start.py /opt/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

# ukbREST


**Title:** ukbREST: efficient and streamlined data access for reproducible research of large biobanks

**Authors:** Milton Pividori and Hae Kyung Im
Expand All @@ -32,6 +31,7 @@ These characteristics make ukbREST an important tool to make biobank’s valuabl
</p>

# News
* 2020-05-22: ukbREST supports [loading](https://github.com/hakyimlab/ukbrest/wiki/Load-real-UK-Biobank-data) and [querying](https://github.com/hakyimlab/ukbrest/wiki/Electronic-health-record-queries) electronic health records from the UK Biobank.
* 2019-12-06: the installation steps for macOS and PostgreSQL have been updated. [Check it out!](https://github.com/hakyimlab/ukbrest/wiki/Installation-instructions)
* 2018-11-25: fix when a dataset has a data-field already loaded. Docker image is now updated.
Check out the [documentation](https://github.com/hakyimlab/ukbrest/wiki/Load-real-UK-Biobank-data) (Section `Duplicated data-fields`).
Expand Down
Binary file removed codemap/.DS_Store
Binary file not shown.
21 changes: 20 additions & 1 deletion docker/start.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import re

from ukbrest.config import logger, GENOTYPE_PATH_ENV, PHENOTYPE_PATH, PHENOTYPE_CSV_ENV, DB_URI_ENV, CODINGS_PATH, \
SAMPLES_DATA_PATH, WITHDRAWALS_PATH
SAMPLES_DATA_PATH, WITHDRAWALS_PATH, PRIMARY_CARE_DIR, HOSPITAL_INPATIENT_DIR


parser = argparse.ArgumentParser()
Expand All @@ -16,6 +16,7 @@
parser.add_argument('--load-codings', action='store_true', help='Loads a set of codings files (coding_NUM.tsv).')
parser.add_argument('--load-withdrawals', action='store_true', help='Loads a list of participants who has withdrawn consent (*.csv files).')
parser.add_argument('--load-samples-data', action='store_true', help='Loads a set of files containing information about samples.')
parser.add_argument('--load-ehr', action='store_true', help='Loads electronic health records.')

args, unknown_args = parser.parse_known_args()

Expand Down Expand Up @@ -118,6 +119,19 @@ def _setup_db_uri():
parser.error('No DB URI was specified. You have to set it using the environment variable UKBREST_DB_URI. For '
'example, for PostgreSQL, the format is: postgresql://user:pass@host:port/dbname')

def _setup_ehr_paths():
primary_care_dir = environ.get(PRIMARY_CARE_DIR, None)
if not isdir(primary_care_dir):
parser.error("The specified primary care directory does not exist.")

hospital_inpatient_dir = environ.get(HOSPITAL_INPATIENT_DIR, None)
if not isdir(hospital_inpatient_dir):
parser.error("The specified hospital inpatient directory does not exist")

if (primary_care_dir is None) and (hospital_inpatient_dir is None):
parser.error("Neither primary care nor hospital inpatient directories were specified.")



if __name__ == '__main__':
if args.load:
Expand Down Expand Up @@ -149,6 +163,11 @@ def _setup_db_uri():

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-samples-data'] + unknown_args)

elif args.load_ehr:
_setup_ehr_paths()
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-ehr'])
else:
_setup_genotype_path()
_setup_db_uri()
Expand Down
47 changes: 24 additions & 23 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
name: ukbrest
channels:
- defaults
- conda-forge
- defaults
- conda-forge
dependencies:
- coveralls=1.2.0
- coverage=4.5.1
- eventlet=0.21.0
- flask-restful=0.3.6
- ruamel.yaml=0.15.34
- beautifulsoup4=4.6.0
- flask=0.12.2
- gevent=1.2.2
- gunicorn=19.7.1
- html5lib=0.999999999
- ipython=6.2.1
- joblib=0.11
- lxml=4.1.1
- numpy=1.13.3
- pandas=0.21.0
- psycopg2=2.7.3.2
- python=3.6.3
- sqlalchemy=1.1.13
- sqlite=3.20.1
- nose=1.3.7
- flask-httpauth=3.2
- coverage
- flask-httpauth=3.2
- eventlet=0.21.0
- ruamel.yaml=0.15.34
- sqlalchemy
- flask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to fix most of the versions here, since otherwise the Docker image will be built always with different versions of critical packages like flask.

My current approach is to have two environment.yml files:

  1. One with most of the packages with fixed major versions, like, for instance, python==3.8 or numpy=1.13 (note that I'm not fixing the revision part of the version). This one is for production, building the Docker image, etc.
  2. Another one, intended for developers, with the list of needed packages and almost no versions; this one is to easily update the environment when you want to do so. For example, for a major release of ukbREST, you use this file to create a new conda environment, and then export that environment to update the first file (the one for production).

- gevent=1.2.2
- sqlite=3.20.1
- coveralls=1.2.0
- python=3.6.3
- psycopg2=2.7.3.2
- lxml=4.1.1
- html5lib=0.999999999
- gunicorn=19.7.1
- joblib=0.11
- nose=1.3.7
- numpy=1.13.3
- ipython=6.2.1
- beautifulsoup4=4.6.0
- flask-restful=0.3.6
- pandas=0.21.0

159 changes: 159 additions & 0 deletions start.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file here? Looks similar than docker/start.py, but without new methods you introduced like _setup_ehr_paths.


from os import listdir, execvp
from os import environ
from os.path import isdir, join, basename
import argparse
import re

from ukbrest.config import logger, GENOTYPE_PATH_ENV, PHENOTYPE_PATH, PHENOTYPE_CSV_ENV, DB_URI_ENV, CODINGS_PATH, \
SAMPLES_DATA_PATH, WITHDRAWALS_PATH


parser = argparse.ArgumentParser()
parser.add_argument('--load', action='store_true', help='Specifies whether data should be loaded into the DB.')
parser.add_argument('--load-sql', action='store_true', help='Loads some useful SQL functions into the database.')
parser.add_argument('--load-codings', action='store_true', help='Loads a set of codings files (coding_NUM.tsv).')
parser.add_argument('--load-withdrawals', action='store_true', help='Loads a list of participants who has withdrawn consent (*.csv files).')
parser.add_argument('--load-samples-data', action='store_true', help='Loads a set of files containing information about samples.')

args, unknown_args = parser.parse_known_args()


def _setup_genotype_path():
genotype_path = environ.get(GENOTYPE_PATH_ENV, None)

if not isdir(genotype_path):
logger.warning('The genotype directory does not exist. You have to mount it using '
'the option "-v hostDir:{}" of "docker run"'.format(genotype_path))
return

bgen_files = [f for f in listdir(genotype_path) if f.lower().endswith('.bgen')]
if len(bgen_files) == 0:
logger.warning('No .bgen files were found in the genotype directory')

bgi_files = [f for f in listdir(genotype_path) if f.lower().endswith('.bgi')]
if len(bgi_files) == 0:
logger.warning('No .bgi files were found in the genotype directory')


def _setup_phenotype_path():
phenotype_path = environ.get(PHENOTYPE_PATH, None)

if not isdir(phenotype_path):
parser.error('The phenotype directory does not exist. You have to mount it using '
'the option "-v hostDir:{}" of "docker run"'.format(phenotype_path))

filename_number_pattern = re.compile('(?P<dataset_id>\d+)')

def sort_datasets(f):
"""Returns the first number found in the filename as a float. If none is found, then return a minimum number."""
filename = basename(f)

m = re.search(filename_number_pattern, filename)
if m is not None:
return float(m.group('dataset_id'))

return float('-inf')

# by default, sort .csv files in reverse order taking the first number found in their names.
# So for instance, these files: ukb00.csv, ukb01.csv and ukb50.csv would be loaded in
# this order: ukb50.csv, ukb01.csv and ukb00.csv
# the number in the file is interpreted as the dataset id, and greater means newer.
phenotype_csv_file = sorted(
[f for f in listdir(phenotype_path) if f.lower().endswith('.csv')],
key=sort_datasets,
reverse=True
)

# check whether there is at least one and only one csv file
if len(phenotype_csv_file) == 0:
parser.error('No .csv files were found in the phenotype directory')

environ[PHENOTYPE_CSV_ENV] = ';'.join([join(phenotype_path, csv_file) for csv_file in phenotype_csv_file])


def _setup_codings():
phenotype_path = environ.get(PHENOTYPE_PATH, None)
coding_path = environ.get(CODINGS_PATH, None)

if coding_path is None:
environ[CODINGS_PATH] = 'codings'
coding_path = 'codings'

coding_path = join(phenotype_path, coding_path)

if not isdir(coding_path):
parser.error('The codings directory does not exist: {}'.format(coding_path))


def _setup_withdrawals():
withdrawals_path = environ.get(WITHDRAWALS_PATH, None)

if withdrawals_path is None:
parser.error('The withdrawals directory was not specified')

if not isdir(withdrawals_path):
parser.error('The withdrawals directory does not exist: {}'.format(withdrawals_path))


def _setup_samples_data():
phenotype_path = environ.get(PHENOTYPE_PATH, None)
samples_data_path = environ.get(SAMPLES_DATA_PATH, None)

if samples_data_path is None:
environ[SAMPLES_DATA_PATH] = 'samples_data'
samples_data_path = 'samples_data'

samples_data_path = join(phenotype_path, samples_data_path)

if not isdir(samples_data_path):
parser.error('The samples data directory does not exist: {}'.format(samples_data_path))


def _setup_db_uri():
db_uri = environ.get(DB_URI_ENV, None)

if db_uri is None:
parser.error('No DB URI was specified. You have to set it using the environment variable UKBREST_DB_URI. For '
'example, for PostgreSQL, the format is: postgresql://user:pass@host:port/dbname')


if __name__ == '__main__':
if args.load:
_setup_phenotype_path()
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py'])

elif args.load_sql:
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-sql'])

elif args.load_codings:
_setup_codings()
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-codings'])

elif args.load_withdrawals:
_setup_withdrawals()
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-withdrawals'])

elif args.load_samples_data:
_setup_samples_data()
_setup_db_uri()

commands = ('python', ['python', '/opt/ukbrest/load_data.py', '--load-samples-data'] + unknown_args)

else:
_setup_genotype_path()
_setup_db_uri()
# TODO: check if data was loaded into PostgreSQL

commands = ('gunicorn', ['gunicorn', 'ukbrest.wsgi:app'])

execvp(*commands)
7 changes: 7 additions & 0 deletions tests/data/ehr/gp_clinical.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
eid data_provider event_dt read_2 read_3 value1 value2 value3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this is not real data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would add a proper extension to the file, probably .tsv here?

1 2 22/03/2014 j550.
2 1 17/03/2016 42Z7. 12.800
2 1 17/03/2016 42Z7. 12.800
3 1 04/11/2013 426.. 4.000
3 1 04/11/2013 4266. 0.000
3 1 04/11/2013 428.. 28.700
8 changes: 8 additions & 0 deletions tests/data/ehr/gp_registrations.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
eid data_provider reg_date deduct_date
1 2 02/02/1902
2 1 12/06/2002 21/03/2014
2 1 25/04/2014
3 1 30/09/1993
3 30/09/1993
4 30/09/1993
5 6
9 changes: 9 additions & 0 deletions tests/data/ehr/gp_scripts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 500microgram tablets 56 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/07/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/07/2015 02.08.02.00.00 Warfarin 500microgram tablets 56 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/12/2015 07.04.01.01.00 Tamsulosin 400microgram / Dutasteride 500microgram capsules 30 capsule
5 changes: 5 additions & 0 deletions tests/data/ehr/hesin.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
eid ins_index dsource source epistart epiend epidur bedyear epistat epitype epiorder spell_index spell_seq spelbgin spelend speldur pctcode gpprpct category elecdate elecdur admidate admimeth_uni admimeth admisorc_uni admisorc firstreg classpat_uni classpat intmanag_uni intmanag mainspef_uni mainspef tretspef_uni tretspef operstat disdate dismeth_uni dismeth disdest_uni disdest carersi
1 1 HES 18 20120305 20120306 1 1 3
1 2 HES 18 20120305 20120306 1 2 3
1 3 GGG 19 20120305 20120306 1 1 3
2 1 HES 18 20120305 20120306 1 1 3
6 changes: 6 additions & 0 deletions tests/data/ehr/hesin_diag.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
eid ins_index arr_index level diag_icd9 diag_icd9_nb diag_icd10 diag_icd10_nb
1 1 1 Code_A
1 2 1 Code_B
1 2 2 Code_A
2 1 1 Code_C
3 1 1 Code_B
7 changes: 7 additions & 0 deletions tests/data/ehr_missing/gp_clinical.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
eid data_provider event_dt read_2 read_3 value1 value2 value3
1 2 22/03/2014 j550.
2 1 17/03/2016 42Z7. 12.800
2 1 17/03/2016 42Z7. 12.800
3 1 04/11/2013 426.. 4.000
3 1 04/11/2013 4266. 0.000
3 1 04/11/2013 428.. 28.700
9 changes: 9 additions & 0 deletions tests/data/ehr_missing/gp_scripts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 500microgram tablets 56 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/07/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/07/2015 02.08.02.00.00 Warfarin 500microgram tablets 56 tablet
1 2 30/10/2015 02.08.02.00.00 Warfarin 1mg tablets 28 tablet
1 2 07/12/2015 07.04.01.01.00 Tamsulosin 400microgram / Dutasteride 500microgram capsules 30 capsule
5 changes: 5 additions & 0 deletions tests/data/ehr_missing/hesin.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
eid ins_index dsource source epistart epiend epidur bedyear epistat epitype epiorder spell_index spell_seq spelbgin spelend speldur pctcode gpprpct category elecdate elecdur admidate admimeth_uni admimeth admisorc_uni admisorc firstreg classpat_uni classpat intmanag_uni intmanag mainspef_uni mainspef tretspef_uni tretspef operstat disdate dismeth_uni dismeth disdest_uni disdest carersi
1 1 HES 18 20120305 20120306 1 1 3
1 2 HES 18 20120305 20120306 1 2 3
1 3 GGG 19 20120305 20120306 1 1 3
2 1 HES 18 20120305 20120306 1 1 3
4 changes: 4 additions & 0 deletions tests/data/withdrawals/withdrawals.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
1000
1001
1002
1003
1 change: 1 addition & 0 deletions tests/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@
# -e POSTGRES_DB=ukb -p 5432:5432 postgres:9.6
POSTGRESQL_ENGINE='postgresql://test:test@localhost:5432/ukb'


# SQLite
SQLITE_ENGINE='sqlite:///tmp.db'
Loading