Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Pugh lab main [just to compare] #17

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

curlup
Copy link
Collaborator

@curlup curlup commented Jun 13, 2023

No description provided.

@@ -613,7 +613,8 @@ def get_clinical_ids_from_sample_ids(self) -> Dict[ClinicalID, str]:
self._clinical_data.items()}
else:
return {clinical_id: clinical_data['SAMPLE_ID'] for clinical_id, clinical_data in
self._clinical_data.items() if clinical_data['VITAL_STATUS'] == 'alive'}
self._clinical_data.items() if (clinical_data['VITAL_STATUS'] is not None and clinical_data['VITAL_STATUS'].lower() == 'alive')}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, we should have that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ME version moves this logic around a bit. Now it ignores users with VITAL_STATUS "deceased," rather than including only users with VITAL_STATUS "alive" (i.e. everyone is alive by default).

Comment on lines +76 to +99
# recompile the query to be case insensitive
# convert the $in into a list of $or conditions so we can use $regex inside a $in
# mongo has a limitation that cannot use $regex within a $in
# using regex
if "ONCOTREE_PRIMARY_DIAGNOSIS_NAME" in query_part.query:
if "$in" in query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME']:
new_conditions = [
{'ONCOTREE_PRIMARY_DIAGNOSIS_NAME': {'$regex': f'^{old_query}$', '$options': 'i'}} for
old_query in query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME']['$in']]
del query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME'] # Remove old query from query_part
query_part.query['$or'] = new_conditions # Add new conditions to query_part
else:
org_query = query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME'];
ignore_case_query = {'$regex': f'^{org_query}$', '$options': 'i'}
query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME'] = ignore_case_query

# Exclude documents where 'ONCOTREE_PRIMARY_DIAGNOSIS_NAME' is 'NA'
new_query = {
'$and': [
{join_field: {'$in': list(need_new)}},
query_part.query,
{'ONCOTREE_PRIMARY_DIAGNOSIS_NAME': {'$ne': 'NA'}}
]
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonhansel we should discuss

@@ -34,7 +34,7 @@ def get_genomic_details(genomic_doc: Dict, trial_match: TrialMatch):
alteration.append(hugo_symbol)

# add mutation
if true_protein is not None:
if true_protein is not None and true_protein:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. @jasonhansel do we already have that fixed or 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to incorporate this change without issues.

@@ -3529,6 +3530,7 @@
"Papillary Glioneuronal Tumor",
"Ewing Sarcoma",
"Renal Angiomyolipoma",
"Large Cell Neuroendocrine Carcinoma"
"Large Cell Neuroendocrine Carcinoma",
"Breast Invasive Carcinoma, NOS"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with the one above?

@@ -315,6 +315,7 @@
"Invasive Breast Carcinoma",
"Phyllodes Tumor of the Breast",
"Breast Invasive Carcinosarcoma, NOS",
"Breast Invasive Carcinoma, NOS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated programmatically from the OncoTree data. We should check with PMATCH to see why they want something other than what OncoTree provides; the config JSON file allows you to specify a separate path within this folder to use for this mapping.

@@ -235,6 +235,9 @@ def create_trial_matches(self, trial_match: TrialMatch, new_trial_match: Dict) -
query = trial_match.match_reason.extract_raw_query()
clinical_doc = self.cache.docs[trial_match.match_reason.clinical_id]
new_trial_match.update({'cancer_type_match': get_cancer_type_match(trial_match)})
# Add in additional fields we need for frontend
if ('arm_description' in trial_match.match_clause_data.match_clause_additional_attributes):
new_trial_match.update({'arm_description': trial_match.match_clause_data.match_clause_additional_attributes['arm_description']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to incorporate this change without issues.

elif trial_value.upper() == 'FALSE':
return QueryTransformerResult({sample_key: 'Negative'}, False)
else:
return QueryTransformerResult({sample_key: trial_value}, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to incorporate this change without issues.

@@ -39,7 +39,8 @@ def main(run_args):
me.update_all_matches()

if run_args.csv_output:
me.create_output_csv()
from matchengine.internals.utilities.output import create_output_csv
create_output_csv(me)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest ME version fixes this.

@@ -71,7 +72,7 @@ def main(run_args):
subp_p = subp.add_parser('load', help='Sets up your MongoDB for matching.')
subp_p.add_argument('-t', dest='trial', default=None, help=param_trials_help)
subp_p.add_argument('-c', dest='clinical', default=None, help=param_clinical_help)
subp_p.add_argument('-g', dest='extended_attributes', default=None, help=param_genomic_help)
subp_p.add_argument('-g', dest='genomic', default=None, help=param_genomic_help)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can fix this, we should be discouraging using this "loading" functionality for anything other than trials, in favor of having users load data into MongoDB directly using e.g. an ETL process of some sort.

@@ -71,7 +72,7 @@ def main(run_args):
subp_p = subp.add_parser('load', help='Sets up your MongoDB for matching.')
subp_p.add_argument('-t', dest='trial', default=None, help=param_trials_help)
subp_p.add_argument('-c', dest='clinical', default=None, help=param_clinical_help)
subp_p.add_argument('-g', dest='extended_attributes', default=None, help=param_genomic_help)
subp_p.add_argument('-g', dest='genomic', default=None, help=param_genomic_help)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can fix this, we should be discouraging using this "loading" functionality for anything other than trials, in favor of having users load data into MongoDB directly using e.g. an ETL process of some sort.

# if isinstance(identifier, ObjectId) or identifier is None:
# pass
# else:
# sort_array.append(int(identifier.replace("-", "")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version of ME will move this into TrialMatchDocumentCreator where it can be modified more easily.

if matchengine.report_all_clinical_reasons or \
keys.issubset(matchengine.match_criteria_transform.valid_clinical_reasons):
should_add_reason = True
if should_add_reason:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version of ME moves this code into TrialMatchDocumentCreator where it can be modified more easily (though it seems like the change here may just be a refactor).

@@ -72,7 +73,30 @@ async def execute_clinical_queries(matchengine: MatchEngine,
matchengine.cache.in_process.setdefault(query_hash, set()).update(need_new)

if need_new:
new_query = {'$and': [{join_field: {'$in': list(need_new)}}, query_part.query]}
# recompile the query to be case insensitive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we don't want to incorporate. There may be ways of doing this through the query transformers, but it will never be performant because of the cost of regex lookups. Ideally this could be fixed at the data ingestion layer by (say) lowercasing all inputs and then searching for lowercase cancer types or by making cancer types match oncotree.

@@ -172,7 +172,7 @@ def load_file(db_rw, filetype: str, path: str, collection: str):
db_rw[collection].insert_one(row)
else:
raw_file_data = file_handle.read()
if filetype == 'yaml':
if filetype == 'yml':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a bugfix that we can incorporate.

@@ -3,7 +3,7 @@
"trial_identifier": "protocol_no",
"match_trial_link_id": "protocol_no",
"trial_status_key": {
"key_name": null,
"key_name": "summary",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes to this file can be incorporated without issues. The one exception is the "trial_status_key," which determines how we decide if trials are open or closed; that may be something we need to keep separate for PMATCH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants