[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

BobTsang1995 · 2024-09-26T08:14:26Z

Pull Request: 多语言 MMMLU BenchMark评测实现

Motivation
在多语言环境下，已有的 MMLU 实现存在局限性。因此，我们希望通过本 PR 引入OpenAI的多语言评测集支持，观测模型在不同语言任务下的表现。目标是实现一个可以评测多种语言（如中文、法语、西班牙语等）的方法。

Modification
本 PR 修改了以下内容：

在数据集支持中增加多语言支持，包括语料的下载和预处理。
实现了多语言mmlu评测pipeline，使得模型能够在多种语言上进行评估。
更新了模型评估和基准测试，增加了多语言的评估指标。
BC-breaking (Optional)
此修改未引入向后不兼容的变化，所有旧的 API 和方法仍然可用，用户可以在新的多语言功能与原有功能之间自由切换。

Use cases (Optional)
本 PR 支持多语言能力，使得开发者可以在一个统一框架下评测多种语言的任务。
Checklist
Before PR:
Pre-commit 或其他代码检查工具已被用来修复潜在的语法问题。
Bug 修复已被完整的单元测试覆盖，导致 bug 的情况已在单元测试中添加。
修改已被完整的单元测试覆盖。如果没有，请添加更多单元测试以确保正确性。
文档已相应修改，包括文档字符串或示例教程。
After PR:

如果该修改对下游或其他相关项目有潜在影响，这个 PR 已经与这些项目进行了测试。
CLA 已签署，所有提交者在此 PR 中均已签署 CLA。

liushz · 2024-09-27T09:15:16Z

opencompass/configs/datasets/mmmlu/mmmlu_5_shot.py

Dataset config name should end with generation type and version index, like "mmlu_5_shot_gen/ppl_xxx.py"

liushz · 2024-09-27T09:15:59Z

opencompass/configs/datasets/mmmlu/mmmlu_5_shot_cot.py

Same name error as "mmmlu_5_shot.py"

liushz · 2024-09-27T09:19:07Z

Please add a default config named "mmmlu_gen.py" for chat model generation, with content like:

from mmengine.config import read_base

with read_base():
    from .mmmlu_gen_xxx.py import mmmlu_datasets  # noqa: F401, F403

liushz · 2024-09-27T09:32:30Z

opencompass/datasets/mmmlu.py

+
+    @staticmethod
+    def load(path: str, name: str):
+        path = get_data_path(path, local_mode=True)


For dataset can be load from Hugging Face Dataset Hub, load by "datasets.load_dataset()" is needed, a template will be like:

from datasets import Dataset, load_dataset from opencompass.registry import LOAD_DATASET from ..base import BaseDataset @LOAD_DATASET.register_module() class LongBench2wikimqaDataset(BaseDataset): @staticmethod def load(path: str, name: str): # path为huggingface的数据path dataset = load_dataset(path=path, name=name, trust_remote_code=True) split = 'test' raw_data = [] for i in range(len(dataset[split])): question = dataset[split]['input'][i] context = dataset[split]['context'][i] answers = dataset[split]['answers'][i] raw_data.append({ 'input': question, 'context': context, 'answers': answers }) dataset[split] = Dataset.from_list(raw_data) return dataset

liushz · 2024-09-27T09:37:26Z

opencompass/configs/datasets/mmmlu/mmmlu_5_shot.py

+        _hint = f'هناك سؤال اختيار واحد. أجب عن السؤال بالرد على A أو B أو C أو D.'
+        _prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:'
+        _round = [
+                dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"),


Too many prompts will affect readability, and I recommend importing these different language prompts from outside, like: https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/MathBench/mathbench_prompt.py

liushz · 2024-09-27T09:43:53Z

opencompass/configs/summarizers/groups/mmmlu.py

@@ -0,0 +1,5 @@
+categories = ['mmlu_AR-XY','mmlu_BN-BD','mmlu_DE-DE','mmlu_ES-LA','mmlu_FR-FR','mmlu_HI-IN','mmlu_ID-ID','mmlu_IT-IT','mmlu_JA-JP','mmlu_KO-KR','mmlu_PT-BR','mmlu_SW-KE','mmlu_YO-NG','mmlu_ZH-CN']
+
+mmlu_pro_summary_groups = [


Wrong groups name "mmlu_pro"

liushz · 2024-09-27T10:12:21Z

opencompass/configs/datasets/mmmlu/mmmlu_5_shot_cot.py

+        _prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:'
+        _round = [
+                dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"),
+                dict(role='BOT', prompt='C'),


This is not a CoT few-shot prompt

remove the cot_gen file

…tmp_mmmlu

rm folder

d8cc738

mm-assistant bot assigned acylam Sep 26, 2024

tonysy requested a review from liushz September 26, 2024 08:15

BobTsang1995 temporarily deployed to prod September 26, 2024 08:15 — with GitHub Actions Inactive

tonysy changed the title ~~mmmlu benchmark eval~~ [Feature] Support MMMLU Benchmark Sep 26, 2024

liushz reviewed Sep 27, 2024

View reviewed changes

BobTsang added 5 commits September 27, 2024 21:24

modify format according to reviewer

811a187

modify format according to reviewer

3f9f688

modify format according to reviewer

f36bf5d

add some files requirement

71deab7

fix some bug

b19eeef

BobTsang1995 temporarily deployed to prod September 30, 2024 07:35 — with GitHub Actions Inactive

BobTsang added 2 commits October 7, 2024 19:27

fix bug

5247d17

change load type

a916e52

BobTsang1995 temporarily deployed to prod October 8, 2024 03:06 — with GitHub Actions Inactive

liushz added 2 commits October 8, 2024 07:43

Update MMMLU Dataset

48a0d9c

Merge branch 'mmmlu_dev' of github.com:BobTsang1995/opencompass into …

7b52277

…tmp_mmmlu

MaiziXiao temporarily deployed to prod October 8, 2024 07:54 — with GitHub Actions Inactive

liushz added 2 commits October 8, 2024 11:22

Update MMMLU Dataset

6cab94d

Add MMMLU-Lite Dataset

b759550

MaiziXiao temporarily deployed to prod October 8, 2024 12:51 — with GitHub Actions Inactive

liushz changed the title ~~[Feature] Support MMMLU Benchmark~~ [Feature] Support MMMLU & MMMLU-lite Benchmark Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

BobTsang1995 commented Sep 26, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

liushz commented Sep 27, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

liushz Sep 27, 2024

BobTsang1995 Sep 29, 2024

		@@ -0,0 +1,5 @@
		categories = ['mmlu_AR-XY','mmlu_BN-BD','mmlu_DE-DE','mmlu_ES-LA','mmlu_FR-FR','mmlu_HI-IN','mmlu_ID-ID','mmlu_IT-IT','mmlu_JA-JP','mmlu_KO-KR','mmlu_PT-BR','mmlu_SW-KE','mmlu_YO-NG','mmlu_ZH-CN']

		mmlu_pro_summary_groups = [

[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

Are you sure you want to change the base?

[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

Conversation

BobTsang1995 commented Sep 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liushz commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment