Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PDSM] Medical Testcases for benchmarking #157

Merged
merged 98 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
5a8749e
change iterations
Apr 2, 2024
4ecca83
I love python
SturmCamper Apr 5, 2024
b3b18a9
- Local-Test-Remove in conftest.py
SturmCamper Apr 5, 2024
d79cb67
Working project version
marlis-en Apr 21, 2024
813e2c1
First pytest not working
marlis-en Apr 21, 2024
ddd8bf7
1 test geht
SeraLunatic Apr 21, 2024
41ddf2a
test name
SeraLunatic Apr 21, 2024
29a62b8
second test
SeraLunatic Apr 21, 2024
3a38516
added physikum questions
SeraLunatic Apr 23, 2024
fb7c431
regex suche
SeraLunatic Apr 23, 2024
0f772de
lowercase wieder rein ups
SeraLunatic Apr 23, 2024
fb7905d
ADD Emergency Testcase
Apr 27, 2024
827aad8
CHANGE: regex and not regex in one runtest and csv
SeraLunatic Apr 29, 2024
7b1d785
REFACTOR test names changed
SeraLunatic Apr 29, 2024
3356c9e
ADDED new regex EEG-Questions
SeraLunatic Apr 30, 2024
392da50
ADDED new yes_no EEG-Questions
SeraLunatic Apr 30, 2024
a0ca30a
Merge pull request #1 from mehizli/datascience_fragen
ytehran May 14, 2024
9b4cf61
Merge pull request #2 from mehizli/testcase_emergency
ytehran May 14, 2024
a3788b2
test
May 14, 2024
8dcc567
Merge branch 'develop' into meli-test-file
May 14, 2024
8b1321d
indent fixes
May 14, 2024
84abb7d
Merge pull request #3 from mehizli/meli-test-file
ytehran May 14, 2024
4caf83a
FIXED Encoding Bug
SeraLunatic May 15, 2024
886c336
Merge pull request #4 from mehizli/bugfix/data_not_found
SeraLunatic May 15, 2024
1768961
Added new testcases for mental diseases
marlis-en May 15, 2024
372e3c8
test cases fixes
May 15, 2024
d731dba
Merge branch 'develop' into meli-test-file
May 15, 2024
e5be05f
test cases indent fixes
May 15, 2024
20183cb
added englisch translations
May 20, 2024
b177a7d
test for oncology
May 20, 2024
3dd8930
ADD translation for physikum
SeraLunatic May 22, 2024
5e2d180
ADD translation EEG
SeraLunatic May 22, 2024
a53a8be
REFACTORED and ADDED questions for mental diseases
marlis-en May 22, 2024
0a4b75a
WIP wrong_asnwer csv
SeraLunatic May 22, 2024
d3c6fce
BUGFIX eeg_answer
SeraLunatic May 22, 2024
c853b72
ADD writing csv with expected, wrong and failure_group
SeraLunatic May 22, 2024
f3a1597
ADD regex failure_groups
SeraLunatic May 22, 2024
4d67b3c
ADD regex failure_groups and better synonym tracker
SeraLunatic May 22, 2024
64e27d4
Merge pull request #5 from mehizli/meli-test-file
ytehran May 22, 2024
7ccd4f4
Merge pull request #6 from mehizli/translate_quest
ytehran May 22, 2024
83b4905
Merge pull request #7 from mehizli/testcase_mental_diseases
ytehran May 22, 2024
ad7a0da
Merge pull request #8 from mehizli/develop
ytehran May 22, 2024
b48d03f
Merge branch 'wrong_answers' into develop
SeraLunatic May 22, 2024
861282a
Merge pull request #9 from mehizli/develop
ytehran May 22, 2024
2b4f6ad
Translate the emergency cases and updating to the new schema
May 22, 2024
9f4e039
New cardiology cases in German
May 24, 2024
f871db0
Case translated into English
May 24, 2024
74b821c
Merge pull request #10 from mehizli/meli-test-file
ytehran May 24, 2024
eb73efb
Merge branch 'develop' into cardio_update
May 24, 2024
4a4fd38
Refactor and clean code
May 24, 2024
4b53bb0
Merge pull request #11 from mehizli/cardio_update
ytehran May 24, 2024
d4c433d
Merge pull request #12 from mehizli/develop
ytehran May 24, 2024
72c7429
ADDED new testcase dermatology
marlis-en May 27, 2024
be9cdc9
formatting
slobentanzer Jun 4, 2024
14490f1
revert doubly run tests
slobentanzer Jun 4, 2024
5c10b8b
revert non-skipping
slobentanzer Jun 4, 2024
2714b9a
comment out models for test purposes
slobentanzer Jun 10, 2024
49e4e41
Merge branch 'main' into pr/ytehran/157
slobentanzer Jun 10, 2024
018191e
pre-commit
slobentanzer Jun 10, 2024
5328bd9
correct function name
slobentanzer Jun 10, 2024
af82910
Merge branch 'main' into pr/ytehran/157
slobentanzer Jun 10, 2024
c137e4c
change nomenclature: `wrong_result` -> `failure_mode`
slobentanzer Jun 10, 2024
b1125cb
record single scores, calculate standard deviation
slobentanzer Jun 10, 2024
fffadab
delete venv/.env in gitignore
Jun 11, 2024
8477af7
delete test results
slobentanzer Jun 11, 2024
4caed92
Merge branch 'main' of https://github.com/mehizli/biochatter_pdsm int…
slobentanzer Jun 11, 2024
5cb9e52
one iteration for dev
slobentanzer Jun 11, 2024
b3e2908
bring back text extraction results
slobentanzer Jun 11, 2024
be8562e
rename `correctness` to `medical_exam`, more specific
slobentanzer Jun 11, 2024
d80efe6
run once on openhermes
slobentanzer Jun 11, 2024
b24b8d3
replace case separator (colon)
slobentanzer Jun 11, 2024
bf3808f
reset results
slobentanzer Jun 11, 2024
6ce37f0
run once on representative models
slobentanzer Jun 11, 2024
8aeaaae
REFACTOR testcases mental diseases and dermatology
marlis-en Jun 14, 2024
b73944e
little fixes + einheitliches format der fragen
Jun 15, 2024
c57d0ee
Refine the categorize_failures method
Jun 16, 2024
f5984db
Merge remote-tracking branch 'origin/develop' into develop
Jun 16, 2024
0651824
Merge branch 'develop' into promt-optimization
Jun 16, 2024
dedc67f
Refactor to part single and multiple choice
Jun 16, 2024
6ff88de
Merge pull request #14 from mehizli/promt-optimization
ytehran Jun 16, 2024
37d0916
Merge pull request #15 from mehizli/develop
ytehran Jun 16, 2024
66293a9
first batch of LLMs run on med exam
slobentanzer Jun 18, 2024
a1f4807
another batch of models run
slobentanzer Jun 19, 2024
ac5845f
another batch of open models run
slobentanzer Jun 20, 2024
112bd8d
+ j-notebook for analysis and graphs
SturmCamper Jun 25, 2024
e9bcc7b
Remove due to Deprecation (Just throwing errors while benchmarking)
SturmCamper Jun 25, 2024
f8393e8
J-Notebook, for simple Graph generation
SturmCamper Jun 25, 2024
99260b6
Added documentation of functions
marlis-en Jun 25, 2024
770802c
Documentation of the data analysis
marlis-en Jun 25, 2024
2cb47c9
Added Lang/cat Graph
SturmCamper Jun 25, 2024
70f7cc8
Merge branch 'stats_graphs' of https://github.com/mehizli/biochatter_…
SturmCamper Jun 25, 2024
f40cae7
Merge branch 'main' into main
slobentanzer Jul 2, 2024
ee0482b
Merge branch 'main' into main
slobentanzer Jul 2, 2024
c1490eb
Merge pull request #16 from mehizli/develop
ytehran Jul 15, 2024
40086de
Changes to fix PR
Jul 15, 2024
c7a4695
Fixed the calc of the std
Jul 15, 2024
e753aec
Merge branch 'stats_graphs' of github.com:mehizli/biochatter_pdsm int…
Jul 15, 2024
ad2070d
Merge pull request #18 from mehizli/stats_graphs
ytehran Jul 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,718 changes: 3,718 additions & 0 deletions benchmark/Data_Analysis.ipynb

Large diffs are not rendered by default.

195 changes: 194 additions & 1 deletion benchmark/benchmark_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from datetime import datetime

import re
from nltk.corpus import wordnet
import pytest
import importlib_metadata

Expand Down Expand Up @@ -32,6 +33,9 @@ def benchmark_already_executed(
"""
task_results = return_or_create_result_file(task)

# check if failure group csv already exists
return_or_create_failure_mode_file(task)

if task_results.empty:
return False

Expand Down Expand Up @@ -99,6 +103,50 @@ def return_or_create_result_file(
return results


def return_or_create_failure_mode_file(task: str):
"""
Returns the failure mode file for the task or creates it if it does not
exist.

Args:
task (str): The benchmark task, e.g. "biocypher_query_generation"

Returns:
pd.DataFrame: The failure mode recording file for the task
"""
file_path = get_failure_mode_file_path(task)
try:
results = pd.read_csv(file_path, header=0)
except (pd.errors.EmptyDataError, FileNotFoundError):
results = pd.DataFrame(
columns=[
"model_name",
"subtask",
"actual_answer",
"expected_answer",
"failure_modes",
"md5_hash",
"datetime",
]
)
results.to_csv(file_path, index=False)
return results


def get_failure_mode_file_path(task: str) -> str:
"""

Returns the path to the failure mode recording file.

Args:
task (str): The benchmark task, e.g. "biocypher_query_generation"

Returns:
str: The path to the failure mode file
"""
return f"benchmark/results/{task}_failure_modes.csv"


def write_results_to_file(
model_name: str,
subtask: str,
Expand Down Expand Up @@ -130,6 +178,151 @@ def write_results_to_file(
results.to_csv(file_path, index=False)


def write_failure_modes_to_file(
model_name: str,
subtask: str,
actual_answer: str,
expected_answer: str,
failure_modes: str,
md5_hash: str,
file_path: str,
):
"""

Writes the failure modes identified for a given response to a subtask to
the given file path.

Args:
model_name (str): The model name, e.g. "gpt-3.5-turbo"

subtask (str): The benchmark subtask test case, e.g. "entities"

actual_answer (str): The actual response given to the subtask question

expected_answer (str): The expected response for the subtask

failure_modes (str): The mode of failure, e.g. "Wrong word count",
"Formatting", etc.

md5_hash (str): The md5 hash of the test case

file_path (str): The path to the result file
"""
results = pd.read_csv(file_path, header=0)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
new_row = pd.DataFrame(
[
[
model_name,
subtask,
actual_answer,
expected_answer,
failure_modes,
md5_hash,
now,
]
],
columns=results.columns,
)
results = pd.concat([results, new_row], ignore_index=True).sort_values(
by=["model_name", "subtask"]
)
results.to_csv(file_path, index=False)


def categorize_failure_modes(
actual_answer, expected_answer, regex=False
) -> str:
"""
Categorises the mode of failure for a given response to a subtask.

Args:
actual_answer (str): The actual response given to the subtask question

expected_answer (str): The expected response for the subtask

regex (bool): Whether the expected answer is a regex expression

Returns:
str: The mode of failure, e.g. "Case Sensitivity", "Partial Match",
"Format Error", "Synonym", "Words Missing", "Entire Answer Incorrect",
"Other"
"""
if not regex:
# Check if the answer is right, but the case sensitivity was wrong (e.g. a / A)
if actual_answer.lower() == expected_answer.lower():
return "Case Sensitivity"

# Check if the wrong answer contains the expected answer followed by ")"
elif actual_answer.strip() == expected_answer + ")":
return "Format Error"

# Check if some of the answer is partially right, but only if it's more than one letter
elif len(expected_answer) > 1 and (actual_answer in expected_answer or expected_answer in actual_answer):
return "Partial Match"

# Check if the format of the answer is wrong, but the answer otherwise is right (e.g. "a b" instead of "ab")
elif re.sub(r"\s+", "", actual_answer.lower()) == re.sub(
r"\s+", "", expected_answer.lower()
):
return "Format Error"

# Check if the answer is a synonym with nltk (e.g. Illness / Sickness)
elif is_synonym(actual_answer, expected_answer):
return "Synonym"

# Check if the format of the answer is wrong due to numerical or alphabetic differences (e.g. "123" vs "one two three")
elif (
re.search(r"\w+", actual_answer)
and re.search(r"\w+", expected_answer)
and any(char.isdigit() for char in actual_answer)
!= any(char.isdigit() for char in expected_answer)
):
return "Format Error"

# Check if partial match with case sensitivity
elif (
actual_answer.lower() in expected_answer.lower()
or expected_answer.lower() in actual_answer.lower()
):
return "Partial Match / case Sensitivity"

# Else the answer may be completely wrong
else:
return "Other"

else:
# Check if all the words in actual_answer are expected but some of the expected are missing
if all(word in expected_answer for word in actual_answer.split()):
return "Words Missing"

# Check if some words in actual_answer are incorrect (present in actual_answer but not in expected_answer)
# elif any(word not in expected_answer for word in actual_answer.split()):
# return "Incorrect Words"

# Check if the entire actual_answer is completely different from the expected_answer
else:
return "Entire Answer Incorrect"


def is_synonym(word1, word2):
"""
Tests, if the input arguments word1 and word2 are synonyms of each other.
If yes, the function returns True, False otherwise.
"""
if word2 is "yes" or "no" or "ja" or "nein":
return False

synsets1 = wordnet.synsets(word1)
synsets2 = wordnet.synsets(word2)

for synset1 in synsets1:
for synset2 in synsets2:
if synset1.wup_similarity(synset2) is not None:
return True
return False


# TODO should we use SQLite? An online database (REDIS)?
def get_result_file_path(file_name: str) -> str:
"""Returns the path to the result file.
Expand Down
Loading