Skip to content

Commit

Permalink
Update doc - semehr_annotate.py
Browse files Browse the repository at this point in the history
  • Loading branch information
abrooks committed Mar 25, 2024
1 parent 30a49aa commit f8b54ab
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 12 deletions.
33 changes: 22 additions & 11 deletions doc/annotation_creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,24 @@ semehr_anon.py -i txt_dir -o anon_dir [--xml]
The annotation step can be performed with:

```
semehr_annotate.sh -i anon_dir/ -o annot_dir/
semehr_annotate.py -i anon_dir/ -o annot_dir/
```

Input files must be named `*.txt` and output files will be named similarly `*.json`.
It requires a config file specified with `-c` unless CogStack-SemEHR is in a
well-known location typically `/opt/semehr/CogStack-SemEHR`

Usage: `semehr_annotate.py -i input -o output -c semehr_processor.json -s CogStack-SemEHR/ -g gcp/`

```
-i INPUT, --input INPUT directory of *.txt files
-o OUTPUT, --output OUTPUT directory of *.json files
-c CONF, --conf path to semehr_processor.json filename
-s SEMEHR, --semehr /opt/semehr/CogStack-SemEHR
-g GCP, --gcp /opt/gcp (contains bio-yodie-1-2-1, gate, gcp-2.5-18658)
-d, --debug
```

## DICOM SR annotation

This is similar to Standalone document annotation but with a preceding
Expand All @@ -62,7 +73,7 @@ in SMI format).
Use the `CTP_DicomToText.py` script to extract the text, for example from MongoDB in SMI extract all documents with metadata for a given StudyDate:

```
CTP_DicomToText -y dataLoad.yaml -y dataExtract.yaml \
CTP_DicomToText.py -y dataLoad.yaml -y dataExtract.yaml \
-i <StudyDate> \
-o txt_dir/ -m meta_dir/
```
Expand All @@ -78,7 +89,7 @@ semehr_to_postgres.py -j annot_dir/ -t txt_dir/ -m meta_dir/
```

The `annot_dir` is the directory of annotations in JSON format
as produced by `semehr_annotate.sh`.
as produced by `semehr_annotate.py`.
The `txt_dir` is the directory of corresponding text files
which will be added to the database alongside their annotations.
It could be `anon_dir` from `semehr_anon.py` if you want to
Expand Down Expand Up @@ -162,32 +173,32 @@ ie. the words matching minor_type will be highlighted.

## Troubleshooting

Check which version of bio-yodie is used. The path `bio-yodie-1-2-1` is hardcoded. However you need to download the full-size version from Honghan.
* Check which version of bio-yodie is used. The path `bio-yodie-1-2-1` is hardcoded. However you need to download the full-size version from Honghan.

`Failed to do SemEHR process [Errno 2] No such file or directory: '/home/ubuntu/SemEHR/data/study/study.json'`
* `Failed to do SemEHR process [Errno 2] No such file or directory: '/home/ubuntu/SemEHR/data/study/study.json'`
Just comment out the study in the config. (Check what the study config does?)

`output_docs` has `stroke_study` annotations - why?
* `output_docs` has `stroke_study` annotations - why?
Because of the supplemental-gazetteer files you left in bio-yodie.
The study annotations can be ignored if you've already created them in the master database.
The study annotations can be ignored if you have already created them in the master database.

nothing in semehr_results
* nothing in semehr_results -
Because documents needed to be called %s.txt - fix the template in the config file

run in PICTURES vm - millions of docanalysis lines like this:
* run in PICTURES vm - millions of docanalysis lines like this:

```bash
docanalysis(587) root 2021-07-05 15:40:19,789 INFO to be developed [2558, 2573] ruled by hypothetical_filters.json
```

see above

also errors like this:
* also errors like this:

```bash
docanalysis(587) root 2021-07-05 15:40:19,810 INFO very slow [2662, 2671] ruled by hypothetical_filters.json
error doing <function analyse_doc_anns_file at 0x7fd442960f70> on /run/user/1000/semehr/tmp_semehr_run.sh_31062/output_docs/doc2299.json
'cmp' is an invalid keyword argument for sort()docanalysis(587) root 2021-07-05 15:40:19,811 INFO knee [1285, 1289] ruled by not_mention_filters.json
```

Fixed the source code to use a different cmp, see the repo commits
Now fixed the source code to use a different cmp, see the repo commits
2 changes: 1 addition & 1 deletion doc/tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ export PYTHONPATH=/path/to/Smi_Common_Python # if SmiServices is not yet in your

```
# input files must be named *.txt, output files will be *.json
./semehr_annotate.sh -i ~/SemEHR/structuredreports/src/data/mtsamples_ihi_docs/ -o ~/SemEHR/structuredreports/src/data/mtsamples_ihi_semehr_results/
./semehr_annotate.py -i ~/SemEHR/structuredreports/src/data/mtsamples_ihi_docs/ -o ~/SemEHR/structuredreports/src/data/mtsamples_ihi_semehr_results/
```

## Import the semehr_results into the MongoDB database
Expand Down

0 comments on commit f8b54ab

Please sign in to comment.