Releases · open-compass/opencompass

22 Sep 11:25

0.1.5

9b21613

OpenCompass v0.1.5

Dive into our newly improved features, bug fixes, and most notably our enhanced dataset support, coming together to refine your experience.

🆕 Highlights:

Boosted Dataset Integrations: This release paves the way for support on numerous datasets like ds1000, promptbench, antropics evals, kaoshi, and many more, making OpenCompass more versatile than ever.
More Evaluation Types: We starts integrating subjective and agent-adied LLM evaluation into OpenCompass. Stay tuned!

Explore the detailed changes:

🌟 New Features:

📦 New Datasets and Features:
- ds1000 dataset support (#395)
- promptbench dataset implementation (#239)
- antropics evals dataset support (#422)
- kaoshi dataset introduction (#392)
- Initial support for subjective evaluation (#421)
- Support for GSM8k evaluation tools (#277)
- scibench evaluation added (#393)

📖 Documentation:

News updates and introduction figure in README (#375, #413)
Updated get_started.md and fixed naming issues (#377, #380)
New FAQ section added (#384)
README addition in longeval (#389)
Multimodal documentation introduced (#334)

🛠️ Bug Fixes:

Addressed a potential OOM issue (#387)
Added has_image fix to scienceqa (#391)
Resolved performance issues of visualglm (#424)
Debug logger fix for summarizer (#417)
Addressed errors in keep keys (#431)

⚙ Enhancements and Refactors:

Refinement in docs and codes for better user guidance (#409)
Custom summarizer argument added in CLI mode (#411)
mlugowl llamaadapter introduced (#405)
Enhanced mm models support on public datasets (#412)
Customized config path support (#423)

🎉 New Contributors:

A heartfelt welcome to our first-time contributors:

@wangxidong06 (First PR)
@so2liu (First PR)
@HoBeedzc (First PR)
@CuteyThyme (First PR)
@chenbohua3 (First PR)

To all contributors, old and new, thank you for continually enhancing OpenCompass! Your efforts are deeply valued. 🙌 🎉

If you love OpenCompass, don't forget to star 🌟 our GitHub repository! Your feedback, reviews, and contributions immensely help in shaping the product.

Changelog

[Doc] Update News by @tonysy in #375
Update get_started.md by @liushz in #377
[CI] Publish to Pypi by @gaotongxiao in #366
[Docs] Fix incorrect name in get_started by @gaotongxiao in #380
fix potential OOM issue by @cdpath in #387
[Docs] Add FAQ by @gaotongxiao in #384
Add CMB by @wangxidong06 in #376
[Fix]: Add has_image to scienceqa by @YuanLiuuuuuu in #391
[Feat] support ds1000 dataset by @yingfhu in #395
[Feat] implementation for support promptbench by @yingfhu in #239
[Feat] refine docs and codes for more user guides by @yingfhu in #409
[Docs] Readme in longeval by @philipwangOvO in #389
feat: add custom summarizer argument in CLI run mode 在CLI启动模式中添加自定义Summarizer参数 by @so2liu in #411
Yhzhang/add mlugowl llamaadapter by @ZhangYuanhan-AI in #405
[Feat] Support mm models on public dataset and fix several issues. by @yyk-wew in #412
[Docs] Add intro figure to README by @gaotongxiao in #413
[fix] summarizer debug logger by @HoBeedzc in #417
[Doc] Update news by @Leymore in #420
[Feature] Use local accuracy from hf implements by @Leymore in #416
[Feat] support antropics evals dataset by @yingfhu in #422
[Fix] Fix performance issue of visualglm. by @yyk-wew in #424
[Feature] Log gold answer in prediction output by @gaotongxiao in #419
Support GSM8k evaluation with tools by Lagent and LangChain by @mzr1996 in #277
[Sync] Initial support of subjective evaluation by @gaotongxiao in #421
[Fix] P0: errors in keep keys by @gaotongxiao in #431
add evaluation of scibench by @CuteyThyme in #393
[Feature] Add kaoshi dataset by @liushz in #392
[Docs] Add multimodal docs by @fangyixiao18 in #334
support customize config path by @chenbohua3 in #423

Full Changelog: 0.1.4...0.1.5

Contributors

tonysy, so2liu, and 15 other contributors

Assets 2

08 Sep 13:18

gaotongxiao

0.1.4

c7a8b8f

OpenCompass v0.1.4

OpenCompass v0.1.4 is here with an array of features, documentation improvements, and key fixes! Dive in to see what's in store:

🆕 Highlights:

More Tools and Features: OpenCompass continues to expand its repertoire with the addition of tools like update suffix, codellama, preds collection tools, qwen & qwen-chat support, and more. Not forgetting our attention to Otter and the MMBench Evaluation!
Documentation Facelift: We've made several updates to our documentation, ensuring it stays relevant, user-friendly, and aesthetically pleasing.
Essential Bug Fixes: We’ve tackled numerous bugs, especially those concerning tokens, triviaqa, nq postprocess, and qwen config.
Enhancements: From simplifying execution logic to suppressing warnings, we’re always on the lookout for ways to improve our product.

Dive deeper to learn more:

🌟 New Features:

📦 Tools and Integrations:

Application of update suffix tool (#280).
Support for codellama and preds collection tools (#335).
Addition of qwen & qwen-chat support (#286).
Introduction of Otter to OpenCompass MMBench Evaluation (#232).
Support for LLaVA and mPLUG-Owl (#331).

🛠 Utilities and Functionality:

Enhanced sample count in prompt_viewer (#273).
Ignored ZeroRetriever error when id_list provided (#340).
Improved default task size (#360).

📝 Documentation:

Updated communication channels: WeChat and Discord (#328).
Documentation theme revamped for a fresh look (#332).
Detailed documentation for the new entry script (#246).
MMBench documentation updated (#336).

🛠️ Bug Fixes:

Resolved issue when missing both pad and eos token (#287).
Addressed triviaqa & nq postprocess glitches (#350).
Fixed qwen configuration inaccuracies (#358).
Default value added for zero retriever (#361).

⚙ Enhancements and Refactors:

Streamlined execution logic in run.py and ensured temp files cleanup (#337).
Suppressed unnecessary warnings raised by get_logger (#353).
Import checks of multimodal added (#352).

🎉 New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@Luodian (First PR)
@ZhangYuanhan-AI (First PR)
@HAOCHENYE (First PR)

Thank you to the entire community for pushing OpenCompass forward. Make sure to star 🌟 our GitHub repository if OpenCompass aids your endeavors! We treasure your feedback and contributions.

Changelog

[Feature] Add and apply update suffix tool by @Leymore in #280
support sample count in prompt_viewer by @cdpath in #273
docs: update wechat and discord by @vansin in #328
[Docs] Update doc theme by @gaotongxiao in #332
[Feat] support codellama and preds collection tools by @yingfhu in #335
[Feature] Add qwen & qwen-chat support by @Leymore in #286
[Feat] Add Otter to OpenCompass MMBench Evaluation by @Luodian in #232
[Docs] Update docs for new entry script by @gaotongxiao in #246
[Fix] Fix when missing both pad and eos token by @Leymore in #287
[Doc] Update MMBench.md by @kennymckormick in #336
[Feat] Support LLaVA and mPLUG-Owl by @ZhangYuanhan-AI in #331
[Feature] Ignore ZeroRetriever error when id_list provided by @Leymore in #340
[Enhance] Add import check of multimodal by @fangyixiao18 in #352
[Sync] [Enhancement] Simplify execution logic in run.py; use finally to clean up temp files by @gaotongxiao in #337
[Fix] Fix triviaqa & nq postprocess by @Leymore in #350
[Enhance] Supress warning raised by get_logger by @HAOCHENYE in #353
[Fix] Update qwen config by @Leymore in #358
[Fix] zero retriever add default value by @Leymore in #361
[Enhancement] Increase default task size by @gaotongxiao in #360
[Fix] Quick lint fix by @Leymore in #362
[Docs] update code evaluator docs by @yingfhu in #354
[Feat] support wizardcoder series by @yingfhu in #344
[Feat] Support Qwen-VL-Chat on MMBench. by @yyk-wew in #312
[Feature] Update claude2 postprocessor by @gaotongxiao in #365
[Doc] Update Overview by @tonysy in #242
[Feat] Update URL by @tonysy in #368
[Feature] Update llama2 implement by @Leymore in #372
[Feature] Add open source dataset eval config of instruct-blip by @fangyixiao18 in #370
[Fix] Update bbh implement & Fix bbh suffix by @Leymore in #371
[Feaure] Add new models: baichuan2, tigerbot, vicuna v1.5 by @Leymore in #373
Bump version to 0.1.4 by @gaotongxiao in #367

For an exhaustive list of changes, kindly check our Full Changelog.

Contributors

tonysy, cdpath, and 10 other contributors

Assets 2

25 Aug 10:56

gaotongxiao

0.1.3

b2d602f

OpenCompass v0.1.3

OpenCompass keeps getting better! v0.1.3 brings a variety of enhancements, new features, and crucial fixes. Here’s a summary of what we've packed into this release:

🆕 Highlights:

Extended Dataset Support: OpenCompass now integrates a broader range of public datasets, including but not limited to adv_glue, codegeex2, Humanevalx, SEED-Bench, LongBench, and LEval. We aim to provide extensive coverage to cater to a variety of research needs.
Utility Additions: From the inclusion of multi-modal evaluations on MME benchmark to the Tree-of-Thought method, this release comes packed with functionality enhancements.
Bug Extermination: Your feedback helps us grow. We’ve squashed a series of bugs to improve your experience.
More Evaluation Benchmark for Multimodal Models. We support another 10 evaluation benchmarks for multimodal models, including COCO Caption and ScienceQA, and provide corresponding evaluation code.

Let's delve deeper into what's new:

🌟 New Features:

📦 Extended Dataset Support:

Introduction of other public datasets (#206, #214).
Support for adv_glue dataset focused on adversarial robustness (#205).
Added codegeex2, Humanevalx (#210).
Integration of SEED-Bench (#203).
LongBench support (#236).
Reconstruct LEval dataset (#266).
Support another 10 public evaluation benchmarks for multimodal models (#214)

🛠 Utilities and Functionality:

Launch script added for ease of operations (#222).
Multi-modal evaluation on MME benchmark (#197).
Support for visualglm and llava on MMBench evaluation (#211).
Tree-of-Thought method introduced (#173).
Introduction of llama2 native implementations (#235).
Flamingo and Claude support added (#258, #253).

📝 Documentation:

Navigation bar language type updated for better clarity (#212).
News updates for keeping users informed (#241, #243).
Summarizer documentation added (#231).

🛠️ Bug Fixes:

Addressed an issue with multiple rounds of inference using mm_eval (#201).
Miscellaneous fixes such as name adjustments, requirements, and bin_trim corrections (#223, #229, #237).
Local runner debug issue fixed (#238).
Resolved bugs for PeftModel generate (#252).

⚙ Enhancements and Refactors:

Refactored instructblip for better performance and readability (#227).
Improved crowspairs postprocess (#251).
Optimization to use sympy only when necessary (#255).

🎉 New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@yyk-wew (First PR)
@fangyixiao18 (First PR)
@philipwangOvO (First PR)
@cdpath (First PR)

Thank you to our dedicated contributors for making OpenCompass even more comprehensive and user-friendly! 🙌 🎉

Remember to star 🌟 our GitHub repository if you find OpenCompass helpful! Your feedback and contributions are invaluable.

Change log

[Fix] Fix bugs of multiple rounds of inference when using mm_eval by @yyk-wew in #201
[Feature]: Add other public datasets by @YuanLiuuuuuu in #206
[Doc] Update Navigation bar language type by @Ezra-Yu in #212
[Feat] support adv_glue dataset for adversarial robustness by @yingfhu in #205
[Feat] Add codegeex2 and Humanevalx by @Ezra-Yu in #210
[Feature]: Add other public datasets config by @YuanLiuuuuuu in #214
[Feature] Support SEED-Bench by @fangyixiao18 in #203
[Feature]: Add launch script by @YuanLiuuuuuu in #222
[Fix]: Fix name by @YuanLiuuuuuu in #223
[Fix] requirements by @gaotongxiao in #229
[Dataset] LongBench by @philipwangOvO in #236
[Fix] bin_trim by @philipwangOvO in #237
[Feat] Support multi-modal evaluation on MME benchmark. by @yyk-wew in #197
[Feat] Support visualglm and llava for MMBench evaluation. by @yyk-wew in #211
[Fix] fix local runner debug by @Leymore in #238
Update News by @tonysy in #241
[Doc]update news by @tonysy in #243
Update run.py by @liushz in #247
[Doc] Add summarizer doc by @Leymore in #231
[Feature] Add llama2 native implements by @Leymore in #235
[Feature] Add Tree-of-Thought method by @liushz in #173
[Refactor] Refactor instructblip by @fangyixiao18 in #227
[Enhancement] Update crowspairs postprocess by @gaotongxiao in #251
[Fix] use sympy only when necessary by @gaotongxiao in #255
Update .owners.yml by @tonysy in #261
[Fix] Fix bugs for PeftModel generate by @LZHgrla in #252
[Feature]: Add Flamingo by @YuanLiuuuuuu in #258
[Feature] Add Claude support by @gaotongxiao in #253
[Dataset] Reconstruct LEval by @philipwangOvO in #266
[Feature]: Verify the acc of these public datasets by @YuanLiuuuuuu in #269
- [Feat] Support public dataset of visualglm and llava. by @yyk-wew in #265
[Fix] wrong path in dataset collections by @gaotongxiao in #272
[Fix] update descriptions of tools by @cdpath in #270
[Feature] Support model-bound prediction postprocessor, use it in Claude by @gaotongxiao in #268
[Feature] Simplify entry script by @gaotongxiao in #204
Update README.md by @tonysy in #262

For a complete list of changes, please refer to our Full Changelog.

Contributors

tonysy, cdpath, and 10 other contributors

Assets 2

11 Aug 10:45

gaotongxiao

0.1.2

4fc1701

OpenCompass v0.1.2

This release continues the evolution of OpenCompass, bringing a mix of new features, optimizations, documentation improvements, and bug fixes.

🆕Highlights

🏆 Leaderboard: The evaluation results of Qwen-7B, XVERSE-13B, LLaMA-2, and GPT-4 has been posted to our leaderboard. Now it's also possible to conduct model comparison online. We hope this feature offers deeper insights!

📊 Datasets: Introduction of Xiezhi, SQuAD2.0, ANLI, LEval datasets, and more for diverse applications. (#101, #192) Add datasets related to safety to collections. [#185]

🎭New modality: Support for MMBench is introduced, and the evaluation of multi-modal models is on the way! (#56 ,#161) Besides, Intern language model is introduced. (#51)

⚙️Enhancement: Several enhancements on OpenAI models, including key deprecation, temperature setting, etc. [#121] [#128] Supporting multiple tasks on one GPU, filtering messages by levels, and more. [#148] [#187]

📝 Documentation: Comprehensive updates and fixes across READMEs, issue templates, prompt docs, metric documentation, and more.

🛠️ Bug Fixes: Including seed fixes in HFEvaluator, addressing issues in AGIEval multiple choice questions, and more. [#122] [#137]

🎉 New Contributors

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@go-with-me000 (First Contribution)
@anakin-skywalker-Joseph (First Contribution)
@zhouzaida (First Contribution)
@dependabot (First Contribution)

Changelog

[Feat] add auto assignee bot by @yingfhu in #105
[Doc] Update Readme and Fix failed links by @Ezra-Yu in #108
Doc: add twitter link by @vansin in #111
Support intern lanuage model by @go-with-me000 in #51
[Docs] Update issue templates for proper guidance to discussions by @gaotongxiao in #116
[Feature] Allow explicitly setting the temperature for API model by @kennymckormick in #121
[Fix] Fix seed in HFEvaluator by @kennymckormick in #122
[Feature] Update SC by @Leymore in #126
说明文档标题修改 by @anakin-skywalker-Joseph in #125
[Docs] Update prompt docs by @Leymore in #46
[Enhancement] Update README.md by @tonysy in #119
[DOC] Add metric doc by @Ezra-Yu in #118
[Feature] Evaluating acc based on minimum edit distance, update SIQA by @gaotongxiao in #130
[Feature] Several enhancements by @gaotongxiao in #142
[Doc] update acknowledgements by @Leymore in #147
Fix typo in readme by @zhouzaida in #152
[Feature]: Use multimodal by @YuanLiuuuuuu in #73
[Refine] Refine PR #122 by @kennymckormick in #123
[Enhancement] Optimize OpenAI models by @gaotongxiao in #128
Update pre-commit ignore-word list by @gaotongxiao in #162
[Script] Add scripts to evaluate MMBench by @kennymckormick in #161
[Doc] Update Readme by @tonysy in #165
[Feature]: Add mm suport for local runner by @YuanLiuuuuuu in #169
Calculate max_out_len without hard code for OpenAI model by @zhouzaida in #158
[API] Refine OpenAI by @kennymckormick in #175
[Fix] Use a copy of the config object in Task by @gaotongxiao in #174
Bump requests from 2.28.1 to 2.31.0 by @dependabot in #178
[Fix] Fix AGIEval multiple choice by @Leymore in #137
[Feature]: Refactor input and output by @YuanLiuuuuuu in #176
[Feature] Add Xiezhi SQuAD2.0 ANLI by @Leymore in #101
[Feature] Support turbomind by @tonysy in #166
[Enhancement] Add humaneval postprocessor for GPT models & eval config for GPT4, enhance the original humaneval postprocessor by @gaotongxiao in #129
[Fix] Fix some sc errors by @liushz in #177
Fix meta template & unit tests by @gaotongxiao in #170
[Feature] Support CUDA_VISIBLE_DEVICES and multiple tasks on one GPU by @mzr1996 in #148
[Docs] Enhance issue template by @gaotongxiao in #183
Skip invalid keys to avoid requesting API by @zhouzaida in #184
[Feature] update news by @tonysy in #186
[Feature] Support filtering specified levels message by @zhouzaida in #187
[Feat] add safety to collections by @yingfhu in #185
[Docs] Update contribution guide & toc, improve user experience by @gaotongxiao in #188
[Feature] add llama-oriented dataset configs by @Leymore in #82
[Feat] update postprocessor to get first option more accurately by @yingfhu in #193
[Feature] Add LEval datasets by @gaotongxiao in #192
Bump version to 0.1.2 by @gaotongxiao in #190
[Fix] fix bug for postprocessor by @yingfhu in #195
[Doc] update readme by @Leymore in #196

Full Changelog: 0.1.1...0.1.2

Contributors

tonysy, Leymore, and 12 other contributors

Assets 2

26 Jul 07:11

Leymore

0.1.1

b7184e9

v0.1.1

Add some more datasets.

AGIEval
anli
cmmlu
jigsawmultilingual
realtoxicprompts
SQuAD2.0
TheoremQA
triviaqa
xiezhi
Xsum

Assets 3

06 Jul 07:23

Leymore

0.1.0

d1025c3

v0.1.0

First release with some datasets.

ARC
BBH
ceval
CLUE
FewCLUE
GAOKAO-BENCH
LCSTS
math
mbpp
mmlu
nq
summedits
SuperGLUE

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🆕 Highlights:

🌟 New Features:

📖 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Change log

Contributors

🆕Highlights

🎉 New Contributors

Changelog

Contributors

Releases: open-compass/opencompass

OpenCompass v0.1.5

🆕 Highlights:

🌟 New Features:

📖 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

OpenCompass v0.1.4

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

OpenCompass v0.1.3

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Change log

Contributors

OpenCompass v0.1.2

🆕Highlights

🎉 New Contributors

Changelog

Contributors

v0.1.1

v0.1.0