-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support MMMLU & MMMLU-lite Benchmark #1565
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset config name should end with generation type and version index, like "mmlu_5_shot_gen/ppl_xxx.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same name error as "mmmlu_5_shot.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Please add a default config named "mmmlu_gen.py" for chat model generation, with content like:
|
opencompass/datasets/mmmlu.py
Outdated
|
||
@staticmethod | ||
def load(path: str, name: str): | ||
path = get_data_path(path, local_mode=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dataset can be load from Hugging Face Dataset Hub, load by "datasets.load_dataset()" is needed, a template will be like:
from datasets import Dataset, load_dataset
from opencompass.registry import LOAD_DATASET
from ..base import BaseDataset
@LOAD_DATASET.register_module()
class LongBench2wikimqaDataset(BaseDataset):
@staticmethod
def load(path: str, name: str): # path为huggingface的数据path
dataset = load_dataset(path=path,
name=name,
trust_remote_code=True)
split = 'test'
raw_data = []
for i in range(len(dataset[split])):
question = dataset[split]['input'][i]
context = dataset[split]['context'][i]
answers = dataset[split]['answers'][i]
raw_data.append({
'input': question,
'context': context,
'answers': answers
})
dataset[split] = Dataset.from_list(raw_data)
return dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
_hint = f'هناك سؤال اختيار واحد. أجب عن السؤال بالرد على A أو B أو C أو D.' | ||
_prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:' | ||
_round = [ | ||
dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many prompts will affect readability, and I recommend importing these different language prompts from outside, like: https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/MathBench/mathbench_prompt.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,5 @@ | |||
categories = ['mmlu_AR-XY','mmlu_BN-BD','mmlu_DE-DE','mmlu_ES-LA','mmlu_FR-FR','mmlu_HI-IN','mmlu_ID-ID','mmlu_IT-IT','mmlu_JA-JP','mmlu_KO-KR','mmlu_PT-BR','mmlu_SW-KE','mmlu_YO-NG','mmlu_ZH-CN'] | |||
|
|||
mmlu_pro_summary_groups = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong groups name "mmlu_pro"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
_prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:' | ||
_round = [ | ||
dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"), | ||
dict(role='BOT', prompt='C'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a CoT few-shot prompt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the cot_gen file
Pull Request: 多语言 MMMLU BenchMark评测实现
Motivation
在多语言环境下,已有的 MMLU 实现存在局限性。因此,我们希望通过本 PR 引入OpenAI的多语言评测集支持,观测模型在不同语言任务下的表现。目标是实现一个可以评测多种语言(如中文、法语、西班牙语等)的方法。
Modification
本 PR 修改了以下内容:
在数据集支持中增加多语言支持,包括语料的下载和预处理。
实现了多语言mmlu评测pipeline,使得模型能够在多种语言上进行评估。
更新了模型评估和基准测试,增加了多语言的评估指标。
BC-breaking (Optional)
此修改未引入向后不兼容的变化,所有旧的 API 和方法仍然可用,用户可以在新的多语言功能与原有功能之间自由切换。
Use cases (Optional)
本 PR 支持多语言能力,使得开发者可以在一个统一框架下评测多种语言的任务。
Checklist
Before PR:
Pre-commit 或其他代码检查工具已被用来修复潜在的语法问题。
Bug 修复已被完整的单元测试覆盖,导致 bug 的情况已在单元测试中添加。
修改已被完整的单元测试覆盖。如果没有,请添加更多单元测试以确保正确性。
文档已相应修改,包括文档字符串或示例教程。
After PR:
如果该修改对下游或其他相关项目有潜在影响,这个 PR 已经与这些项目进行了测试。
CLA 已签署,所有提交者在此 PR 中均已签署 CLA。