Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ktokunaga/30/add marketingyears to distribs #60

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
c7ca567
Initial mdt package structure
yevgenybulochnik May 2, 2021
b03342f
Initial database module within mdt package
yevgenybulochnik May 2, 2021
d723c17
Initial synthea module
yevgenybulochnik May 2, 2021
19a0a28
Moved mdt_functions methods into rxclass.py
yevgenybulochnik May 2, 2021
8336a3e
Moved mdt_functions into database.py
yevgenybulochnik May 2, 2021
96dee8b
Moved mdt_functions into synthea.py
yevgenybulochnik May 2, 2021
dc90fd1
Moved mdt_functions into utils.py
yevgenybulochnik May 2, 2021
f2e6f5f
Initial setup.py
yevgenybulochnik May 2, 2021
957669f
Comment out entrypoint setup in setup.py for now
yevgenybulochnik May 2, 2021
2d122e0
Initial run_mdt.py main script/module
yevgenybulochnik May 2, 2021
a20c4fe
added function to download RxNorm,
Bridg109 May 4, 2021
16bd740
adds to download and load RxNorm, Pathlib use
Bridg109 May 4, 2021
1af83aa
ignore .vim, .ds_store and python egg-info
yevgenybulochnik May 5, 2021
49e30dd
Meps utils module with get_dataset function
yevgenybulochnik May 7, 2021
548342e
Move meps_lists vars into new meps module columns.py
yevgenybulochnik May 7, 2021
bdf7ef2
Change meps get_dataset to get response.content vs response
yevgenybulochnik May 7, 2021
2889fcb
Allow modules to be imported from package namespace
yevgenybulochnik May 7, 2021
73dd0e9
Add load_meps function to database.py
yevgenybulochnik May 7, 2021
832ec25
Add load_meps to main function of run_mdt module
yevgenybulochnik May 7, 2021
bd9e747
Require requests and pandas to install if mdt is installed
yevgenybulochnik May 8, 2021
3ce65fe
Change package install name to mdt, include .sql files in packages
yevgenybulochnik May 9, 2021
ceee61e
Add sql packages to rxnorm & meps, include sql files
yevgenybulochnik May 9, 2021
e2defa1
Initial get_sql function to get meps package sql files
yevgenybulochnik May 9, 2021
5f5a454
Import lib requires python >3.7
yevgenybulochnik May 10, 2021
90a7f08
Missing comma in setup.py
yevgenybulochnik May 15, 2021
f303c5c
Use meps utils function to get reference sql
yevgenybulochnik May 16, 2021
fcbe75b
Add get_sql function to rxnorm utils
yevgenybulochnik May 16, 2021
2d08999
Rename synthea.py to utils.py
yevgenybulochnik May 16, 2021
9f46e23
Basic mdt config.py, will need future refactor
yevgenybulochnik May 16, 2021
a1dbc38
Add missing payload constructor import
yevgenybulochnik May 16, 2021
9b3b854
Add missing urllib import
yevgenybulochnik May 16, 2021
25ed016
Add missing imports, re, pandas, meps, database functions
yevgenybulochnik May 16, 2021
1ebb934
Move rx_api script into run_mdt.py, fix imports, currently broken
yevgenybulochnik May 16, 2021
c016d30
Use meps get_sql function
yevgenybulochnik May 16, 2021
ed6483a
Use meps get_sql function in mdt.utils
yevgenybulochnik May 16, 2021
d689303
Monkey patch to read age and age values from python config.py
yevgenybulochnik May 16, 2021
1848d75
Skip loading rxnorm and meps if MDT.db exists
yevgenybulochnik May 16, 2021
270bf4f
Uses system args to pass rxclass_id and rxclass_rela to run_mdt.py
yevgenybulochnik May 16, 2021
23569cd
Add rxnorm dosage form sql
yevgenybulochnik May 16, 2021
fffc61e
Load dfg table with load_rxnorm
yevgenybulochnik May 16, 2021
029fd49
Add filter_by_df function to mdt.utils, missing path.exists added
yevgenybulochnik May 16, 2021
d87c2ca
Initial FDA subpackage setup
yevgenybulochnik May 16, 2021
0880433
fda utils module setup, get_dataset function
yevgenybulochnik May 16, 2021
e1a413e
load_fda function setup
yevgenybulochnik May 16, 2021
5f72506
Dev setup in readme
yevgenybulochnik May 16, 2021
cc1154d
#30 adding marketyears to the generate_module function
kristentaytok May 17, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
data
__pycache__
.vscode
.vim
.DS_Store
venv
.venv

*.egg-info

.ipynb_checkpoints
*/.ipynb_checkpoints/*

*.ipynb
*.ipynb
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,11 @@ src/
│ │ │ ├─ hypothyroidism.json
│ │ │ ├─ ...
```

## Contribution Guide
1. Setup a venv with `python -m venv venv`, this will create a a directory called venv in your current working directory
2. Activate your venv with `source venv/bin/activate` or on windows `venv/Scripts/Activate`
3. Install MDT with `pip install -e .`, this sets up mdt as an installed editable package
4. Run MDT with `python -m mdt.run_mdt D007037 may_treat`
- `run_mdt` takes two system args the rxclass_id and rxclass_rela these must be specified
- the initial run of `run_mdt` will download all necessary files and build the database in `data/` in the current working directory
44 changes: 44 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from setuptools import setup, find_packages
import pathlib

here = pathlib.Path(__file__).parent.resolve()

# Get the long description from the README file
long_description = (here / 'README.md').read_text(encoding='utf-8')

setup(
name='mdt',
version='1.0.0',
# description='A sample Python project', # Optional
# long_description=long_description, # Optional
# long_description_content_type='text/markdown', # Optional
# url='https://github.com/pypa/sampleproject', # Optional
# author='A. Random Developer', # Optional
# author_email='[email protected]', # Optional
# keywords='sample, setuptools, development', # Optional
package_dir={'': 'src'},
packages=find_packages(where='src'),
python_requires='>=3.7, <4',
install_requires=[
'requests',
'pandas'
], # Optional

# If there are data files included in your packages that need to be
# installed, specify them here.
package_data={
"":['*.sql']
}
# Although 'package_data' is the preferred approach, in some case you may
# need to place data files outside of your packages. See:
# http://docs.python.org/distutils/setupscript.html#installing-additional-files
#
# In this case, 'data_file' will be installed into '<sys.prefix>/my_data'
# data_files=[('my_data', ['data/data_file'])], # Optional

# entry_points={ # Optional
# 'console_scripts': [
# 'mdt=mdt.cli:entry_point',
# ],
# },
)
Empty file added src/mdt/__init__.py
Empty file.
5 changes: 5 additions & 0 deletions src/mdt/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
MEPS_CONFIG={
"age": ["0-3", "4-7", "8-11", "12-18", "19-49", "50-64", "65-99"],
"demographic_distrib_flags" : {"age": "Y", "gender": "Y", "state": "Y"},
"meps_year" : "18"
}
225 changes: 225 additions & 0 deletions src/mdt/database.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
from mdt import rxnorm, meps, fda
from pathlib import Path
import zipfile
import io
import sqlite3
import pandas as pd
from datetime import datetime


def to_data():
"""creates paths to data folder, making directory if not present"""
path = Path.cwd() / 'data'
try:
path.mkdir(exist_ok=False)
except:
pass
return path


def create_mdt_con():
"""create defualt connection to the data/MDT.db database. If database does not exist it creates it."""
conn = sqlite3.connect(to_data() / 'MDT.db')
return conn


def sql_create_table(table_name, df, conn=None):
"""Creates a table in the connected database when passed a pandas dataframe.
Note default is to delete dataframe if table name is same as global variable name that stores the df and delete_df is True"""

if conn == None:
conn = create_mdt_con()

try:
df.to_sql(table_name, conn, if_exists='replace',index=False)
print('{} table created in DB'.format(table_name))
except:
print('Could not create table {0} in DB'.format(table_name))


def db_query(query_str,conn=None):
"""Sends Query to DB and returns results as a dataframe"""

if conn == None:
conn = create_mdt_con()

return pd.read_sql(query_str,conn)


def read_sql_string(file_name):
"""reads the contents of a sql script into a string for python to use in a query"""

fd = open(file_name, 'r')
query_str = fd.read()
fd.close()

print('Read {0} file as string'.format(file_name))

return query_str


def load_rxnorm():
"""downloads and loads RxNorm dataset into database"""

z = zipfile.ZipFile(rxnorm.utils.get_dataset(handler=io.BytesIO))

col_names = ['RXCUI','LAT','TS','LUI','STT','SUI','ISPREF','RXAUI','SAUI','SCUI','SDUI','SAB','TTY','CODE','STR','SRL','SUPPRESS','CVF','test']
rxnconso = pd.read_csv(z.open('rrf/RXNCONSO.RRF'),sep='|',header=None,dtype=object,names=col_names)
sql_create_table('rxnconso',rxnconso)
del rxnconso

col_names = ['RXCUI1','RXAUI1','STYPE1','REL','RXCUI2','RXAUI2','STYPE2','RELA','RUI','SRUI','SAB','SL','DIR','RG','SUPPRESS','CVF','test']
rxnrel = pd.read_csv(z.open('rrf/RXNREL.RRF'),sep='|',dtype=object,header=None,names=col_names)
sql_create_table('rxnrel',rxnrel)
del rxnrel

col_names = ['RXCUI','LUI','SUI','RXAUI','STYPE','CODE','ATUI','SATUI','ATN','SAB','ATV','SUPPRESS','CVF','test']
rxnsat = pd.read_csv(z.open('rrf/RXNSAT.RRF'),sep='|',dtype=object,header=None,names=col_names)
sql_create_table('rxnsat',rxnsat)
del rxnsat

del z

rxcui_ndc = db_query(rxnorm.utils.get_sql('rxcui_ndc.sql'))
sql_create_table('rxcui_ndc', rxcui_ndc)
del rxcui_ndc

dfg_df = db_query(rxnorm.utils.get_sql('dfg_df.sql'))
sql_create_table('dfg_df', dfg_df)
del dfg_df


def load_meps():
'''Load Meps data into db'''
z = zipfile.ZipFile(
meps.utils.get_dataset('h206adat.zip', handler=io.BytesIO)
)

meps_prescription = pd.read_fwf(
z.open('H206A.dat'),
header=None,
names=meps.columns.p_col_names,
converters={col: str for col in meps.columns.p_col_names},
colspecs=meps.columns.p_col_spaces,
)

sql_create_table('meps_prescription', meps_prescription)
del meps_prescription
del z

z = zipfile.ZipFile(
meps.utils.get_dataset('h209dat.zip', handler=io.BytesIO)
)

meps_demographics = pd.read_fwf(
z.open('h209.dat'),
header=None,
names=meps.columns.d_col_names,
converters={col: str for col in meps.columns.d_col_names},
colspecs=meps.columns.d_col_spaces,
usecols=['DUPERSID', 'PERWT18F', "REGION18", 'SEX', 'AGELAST']
)

# removing numbers from meps_demographic column names, since the '18' in region18 and perwt18f in MEPS are year-specific
meps_demographics.columns = meps_demographics.columns.str.replace(r'\d+', '',regex=True)
sql_create_table('meps_demographics', meps_demographics)
del meps_demographics
del z

sql_create_table('meps_region_states', meps.columns.meps_region_states)

meps_reference_str = meps.utils.get_sql('meps_reference.sql')
meps_reference = db_query(meps_reference_str)
sql_create_table('meps_reference', meps_reference)
del meps_reference

# TEST!!!!!!!!!!!!!!!! reads record count from created database
meps_prescription = db_query("Select count(*) AS records from meps_prescription")
print('DB table meps_prescription has {0} records'.format(meps_prescription['records'].iloc[0]))

meps_demographics = db_query("Select count(*) AS records from meps_demographics")
print('DB table meps_demographics has {0} records'.format(meps_demographics['records'].iloc[0]))

meps_reference = db_query("Select count(*) AS records from meps_reference")
print('DB table meps_reference has {0} records'.format(meps_reference['records'].iloc[0]))

meps_region_states = db_query("Select count(*) AS records from meps_region_states")
print('DB table meps_region_states has {0} records'.format(meps_region_states['records'].iloc[0]))


def load_fda():
'''Load FDA tables into db'''

z = zipfile.ZipFile(
fda.utils.get_dataset(handler=io.BytesIO)
)

#moves FDA files to sqlite database by reading as dataframes
product = pd.read_csv(z.open('product.txt'),sep='\t',dtype=object,header=0,encoding='cp1252')
package = pd.read_csv(z.open('package.txt'),sep='\t',dtype=object,header=0,encoding='cp1252')
sql_create_table('product',product)
sql_create_table('package',package)


#deletes FDA ZIP
del z



#join product table with the rxcui_ndc table
rxcui_ndc_string = read_sql_string('rxcui_ndc.sql')
rxcui_ndc = db_query(rxcui_ndc_string)
sql_create_table('rxcui_ndc', rxcui_ndc)


product['PRODUCTNDC'] = product['PRODUCTNDC'].str.replace('-', '').str.zfill(9)
rxcui_ndc['medication_ndc'] = rxcui_ndc['medication_ndc'].astype(str).str.zfill(9)
product_rxcui = product.merge(rxcui_ndc, left_on = 'PRODUCTNDC', right_on = rxcui_ndc['medication_ndc'].str.slice(start=0,stop=9), how = 'left')


#extract year from startmarketingdate & endmarketingdate
#fill NULL endmarketingyear with current year
product_rxcui['STARTMARKETINGYEAR'] = product_rxcui['STARTMARKETINGDATE'].str.slice(start=0, stop=4).astype(int)
product_rxcui['ENDMARKETINGYEAR'] = product_rxcui['ENDMARKETINGDATE'].str.slice(start=0, stop=4)
product_rxcui['ENDMARKETINGYEAR'] = product_rxcui['ENDMARKETINGYEAR'].fillna(datetime.now().year)
product_rxcui['ENDMARKETINGYEAR'] = product_rxcui['ENDMARKETINGYEAR'].astype(int)
product_rxcui = product_rxcui[['medication_ingredient_rxcui', 'medication_ingredient_name', 'medication_product_rxcui',
'medication_product_name', 'STARTMARKETINGYEAR', 'ENDMARKETINGYEAR']]

med_marketing_year_dict = {}
med_state_level_list = ['medication_ingredient', 'medication_product']

#create a dictionary of df's (one for ingredient, other for product) that contains a range of years that each rxcui was available o nthe market
def med_marketing_year(med_state_level_list):
for med_state_level in med_state_level_list:
#takes MIN startmarketingdate and MAX endmarketingdate for each rxcui
med_marketing_year_dict[med_state_level+'_max_marketingyear_range'] = product_rxcui.groupby([med_state_level+'_rxcui', med_state_level+'_name']).agg({'STARTMARKETINGYEAR': 'min', 'ENDMARKETINGYEAR': 'max'}).reset_index()

#creates a row for each year between startmarketingdate and endmarketingdate for each rxcui
zipped = zip(med_marketing_year_dict[med_state_level+'_max_marketingyear_range'][med_state_level+'_rxcui'], med_marketing_year_dict[med_state_level+'_max_marketingyear_range']['STARTMARKETINGYEAR'], med_marketing_year_dict[med_state_level+'_max_marketingyear_range']['ENDMARKETINGYEAR'])
med_marketing_year_dict[med_state_level+'_rxcui_years'] = pd.DataFrame([(i, y) for i, s, e in zipped for y in range(s, e+1)],
columns=[med_state_level+'_rxcui','year'])
sql_create_table(med_state_level+'_rxcui_years',med_marketing_year_dict[med_state_level+'_rxcui_years'])
print(med_state_level+'_rxcui_years')

med_marketing_year(med_state_level_list)

#deletes other dataframes
del product
del package
del rxcui_ndc
del medication_ingredient_rxcui_years
del medication_product_rxcui_years
Comment on lines +211 to +212
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kristentaytok - I get the error below because you are deleting a df that doesn't exist yet. Maybe you meant a different df? Please take a look. This only errors out if you are initially loading the DB, but doesn't affect the actual DB load.

UnboundLocalError: local variable 'medication_ingredient_rxcui_years' referenced before assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agh, I had that error at first and oddly it was fine after running the exact same code once in a Jupyter nb. but I think I figured it out in case we decide to use this (or some variation of it): I was supposed to write it this way:

del med_marketing_year_dict['medication_ingredient_rxcui_years']
del med_marketing_year_dict['medication_product_rxcui_years']

because I created the dataframes as dictionary key-value pairs because that's the only option (I'm aware of) to dynamically create a variable name in the for loop (i.e., create a variable called "medication_ingredient_df" in the first loop, and another variable "medication_product_df" in the next loop).


#TEST!!!!!!!!!!!!!!!! reads record count from created database
product = db_query("Select count(*) AS records from product limit 1")
print('DB table product has {0} records'.format(product['records'].iloc[0]))

package = db_query("Select count(*) AS records from package limit 1")
print('DB table package has {0} records'.format(package['records'].iloc[0]))

medication_product_rxcui_years = db_query("Select count(*) AS records from medication_product_rxcui_years limit 1")
print('DB table medication_product_rxcui_years has {0} records'.format(medication_product_rxcui_years['records'].iloc[0]))

medication_ingredient_rxcui_years = db_query("Select count(*) AS records from medication_ingredient_rxcui_years limit 1")
print('DB table medication_ingredient_rxcui_years has {0} records'.format(medication_ingredient_rxcui_years['records'].iloc[0]))
1 change: 1 addition & 0 deletions src/mdt/fda/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from . import utils
17 changes: 17 additions & 0 deletions src/mdt/fda/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import requests
from pathlib import Path


def get_dataset(
dest = Path.cwd(),
handler = None
):
url = f'https://www.accessdata.fda.gov/cder/ndctext.zip'
response = requests.get(url)

if handler:
return handler(response.content)

(dest / url.split('/')[-1]).write_bytes(response.content)

return response
4 changes: 4 additions & 0 deletions src/mdt/meps/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from . import utils
from . import columns

__all__ = ['utils', 'columns']
Loading