Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support MMMLU & MMMLU-lite Benchmark #1565

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

BobTsang1995
Copy link

Pull Request: 多语言 MMMLU BenchMark评测实现

Motivation
在多语言环境下,已有的 MMLU 实现存在局限性。因此,我们希望通过本 PR 引入OpenAI的多语言评测集支持,观测模型在不同语言任务下的表现。目标是实现一个可以评测多种语言(如中文、法语、西班牙语等)的方法。

Modification
本 PR 修改了以下内容:

在数据集支持中增加多语言支持,包括语料的下载和预处理。
实现了多语言mmlu评测pipeline,使得模型能够在多种语言上进行评估。
更新了模型评估和基准测试,增加了多语言的评估指标。
BC-breaking (Optional)
此修改未引入向后不兼容的变化,所有旧的 API 和方法仍然可用,用户可以在新的多语言功能与原有功能之间自由切换。

Use cases (Optional)
本 PR 支持多语言能力,使得开发者可以在一个统一框架下评测多种语言的任务。
Checklist
Before PR:
Pre-commit 或其他代码检查工具已被用来修复潜在的语法问题。
Bug 修复已被完整的单元测试覆盖,导致 bug 的情况已在单元测试中添加。
修改已被完整的单元测试覆盖。如果没有,请添加更多单元测试以确保正确性。
文档已相应修改,包括文档字符串或示例教程。
After PR:

如果该修改对下游或其他相关项目有潜在影响,这个 PR 已经与这些项目进行了测试。
CLA 已签署,所有提交者在此 PR 中均已签署 CLA。

@tonysy tonysy requested a review from liushz September 26, 2024 08:15
@tonysy tonysy changed the title mmmlu benchmark eval [Feature] Support MMMLU Benchmark Sep 26, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset config name should end with generation type and version index, like "mmlu_5_shot_gen/ppl_xxx.py"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same name error as "mmmlu_5_shot.py"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@liushz
Copy link
Collaborator

liushz commented Sep 27, 2024

Please add a default config named "mmmlu_gen.py" for chat model generation, with content like:

from mmengine.config import read_base

with read_base():
    from .mmmlu_gen_xxx.py import mmmlu_datasets  # noqa: F401, F403


@staticmethod
def load(path: str, name: str):
path = get_data_path(path, local_mode=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dataset can be load from Hugging Face Dataset Hub, load by "datasets.load_dataset()" is needed, a template will be like:

from datasets import Dataset, load_dataset

from opencompass.registry import LOAD_DATASET

from ..base import BaseDataset


@LOAD_DATASET.register_module()
class LongBench2wikimqaDataset(BaseDataset):

    @staticmethod
    def load(path: str, name: str): # path为huggingface的数据path
        dataset = load_dataset(path=path,
                               name=name,
                               trust_remote_code=True)
        split = 'test'
        raw_data = []
        for i in range(len(dataset[split])):
            question = dataset[split]['input'][i]
            context = dataset[split]['context'][i]
            answers = dataset[split]['answers'][i]
            raw_data.append({
                'input': question,
                'context': context,
                'answers': answers
            })
        dataset[split] = Dataset.from_list(raw_data)
        return dataset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_hint = f'هناك سؤال اختيار واحد. أجب عن السؤال بالرد على A أو B أو C أو D.'
_prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:'
_round = [
dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many prompts will affect readability, and I recommend importing these different language prompts from outside, like: https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/MathBench/mathbench_prompt.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,5 @@
categories = ['mmlu_AR-XY','mmlu_BN-BD','mmlu_DE-DE','mmlu_ES-LA','mmlu_FR-FR','mmlu_HI-IN','mmlu_ID-ID','mmlu_IT-IT','mmlu_JA-JP','mmlu_KO-KR','mmlu_PT-BR','mmlu_SW-KE','mmlu_YO-NG','mmlu_ZH-CN']

mmlu_pro_summary_groups = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong groups name "mmlu_pro"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_prompt = f'يتعلق بـ {{subject}} \nالسؤال: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nالإجابة:'
_round = [
dict(role='HUMAN', prompt="هناك سؤال اختيار من متعدد. أجب عن السؤال بالرد A أو B أو C أو D.\nيتعلق بـ الجبر المجرد\nالسؤال: ابحث عن أقصى حد ممكن لترتيب بعض العناصر في Z_4 x Z_6.\n A.4\nB.6\nC.12\nD.24\nلنفكر خطوة بخطوة\nالإجابة:"),
dict(role='BOT', prompt='C'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a CoT few-shot prompt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the cot_gen file

@liushz liushz changed the title [Feature] Support MMMLU Benchmark [Feature] Support MMMLU & MMMLU-lite Benchmark Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants