[MoE/ZeRO] fix .github conflict with main branch. #5827

Hz188 · 2024-06-18T02:36:09Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3

* [release] update version * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [test] fix ddp plugin test * [test] fix gptj and rpc test * [devops] fix cuda ext compatibility * [inference] fix flash decoding test * [inference] fix flash decoding test

* [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe;

…5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost

* [misc] update dockerfile * [misc] update dockerfile

* Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist

* remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <[email protected]>

* [test] smaller gpt2 test case * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py * Revert "[test] smaller gpt2 test case" Some tests might depend on the size of model (num of chunks) This reverts commit df705a5. * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py * [CI] smaller test model for two mwo the two modifid cases * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there

* [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [fix] fix test_deepfm_model & test_dlrf_model； * [fix] fix test_hf_albert & test_hf_gpt;

* [gemini] optimize reduce scatter d2h copy * [fix] fix missing reduce variable * [refactor] remove legacy async reduce scatter code * [gemini] missing sync * Revert "[refactor] remove legacy async reduce scatter code" This reverts commit 58ad76d. * [gemini] further optimize with async all reduce * [fix] pass flag from manager to chunk

Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are.

* fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…or ppo, sp is buggy

for more information, see https://pre-commit.ci

[Inference] Refactor modeling attention layer by abstracting attention backends

* refactor baichuan * remove unused code and add TODO for lazyinit

[ColossalChat] Colossalchat upgrade

…seless test

* Fix torch int32 dtype Signed-off-by: char-1ee <[email protected]> * Fix flash-attn import Signed-off-by: char-1ee <[email protected]> * Add generalized model test Signed-off-by: char-1ee <[email protected]> * Remove exposed path to model Signed-off-by: char-1ee <[email protected]> * Add default value for use_flash_attn Signed-off-by: char-1ee <[email protected]> * Rename model test Signed-off-by: char-1ee <[email protected]> --------- Signed-off-by: char-1ee <[email protected]>

…class member

…5781) * use async stream to prefetch and h2d data moving * Remove redundant code

* [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation

* [shardformer]upgrade transformers for gpt2/gptj/whisper (hpcaitech#5807) * [shardformer] fix modeling of gpt2 and gptj * [shardformer] fix whisper modeling * [misc] update requirements --------- Co-authored-by: ver217 <[email protected]> * [shardformer]upgrade transformers for mistral (hpcaitech#5808) * upgrade transformers for mistral * fix * fix * [shardformer]upgrade transformers for llama (hpcaitech#5809) * update transformers fix * fix * fix * [inference] upgrade transformers (hpcaitech#5810) * update transformers fix * fix * fix * fix * fix * [gemini] update transformers for gemini (hpcaitech#5814) --------- Co-authored-by: ver217 <[email protected]>

…Hz188/ColossalAI into feature/moe

…ero working/master params bug

* support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <[email protected]>

yuanheng-zhao and others added 30 commits May 30, 2024 13:48

[Fix/Example] Fix Llama Inference Loading Data Type (hpcaitech#5763)

677cbfa

* [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3

fix (hpcaitech#5765)

3f2be80

[test] Fix/fix testcase (hpcaitech#5770)

1b76564

* [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe;

[Hotfix] Add missing init file in inference.executor (hpcaitech#5774)

4064432

[CI/tests] simplify some test case to reduce testing time (hpcaitech#…

e22b827

…5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost

[misc] update dockerfile (hpcaitech#5776)

32f4187

* [misc] update dockerfile * [misc] update dockerfile

[devops] fix docker ci (hpcaitech#5780)

ee6fd38

[Inference]Add Streaming LLM (hpcaitech#5745)

b45000f

* Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist

[hotfix] fix llama flash attention forward (hpcaitech#5777)

50b4c8e

[misc] Accelerate CI for zero and dist optim (hpcaitech#5758)

79f7a7b

* remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <[email protected]>

Allow building cuda extension without a device. (hpcaitech#5535)

c46e097

Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are.

[misc] fix dist logger (hpcaitech#5782)

b9d646f

[install]fix setup (hpcaitech#5786)

a1e39f4

* fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[misc] update requirements (hpcaitech#5787)

5ead00f

[shardformer] fix import (hpcaitech#5788)

73e88a5

upgrade colossal-chat support tp_group>1, add sp for sft

7a7e869

upgrade ppo dpo rm script

929e1e3

run pre-commit

7e65b71

moupdate ci tests, st ci test cases passed, tp failed in generation f…

0b4a335

…or ppo, sp is buggy

fix training script

7ae87b3

fix ci

b1031f7

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b880ce

for more information, see https://pre-commit.ci

fix transformers version

b8b5cac

remove duplicated test

62eb28b

fix datasets version

0bbac15

remove models that require huggingface auth from ci

bf57b13

char-1ee and others added 27 commits June 10, 2024 11:52

Merge pull request hpcaitech#5771 from char-1ee/refactor/modeling

77a219a

[Inference] Refactor modeling attention layer by abstracting attention backends

update sft trainning script

84eab13

[Inference]refactor baichuan (hpcaitech#5791)

c0948af

* refactor baichuan * remove unused code and add TODO for lazyinit

Merge pull request hpcaitech#5759 from hpcaitech/colossalchat_upgrade

74f4a29

[ColossalChat] Colossalchat upgrade

[test] fix chatglm test kit (hpcaitech#5793)

587bbf4

[shardformer] fix modeling of bloom and falcon (hpcaitech#5796)

aa125bc

[test] fix qwen2 pytest distLarge (hpcaitech#5797)

aac941e

[moe refactor] update unit test with the refactored ZeRO and remove u…

b6ea9e7

…seless test

sync with upstream

79d63ec

move moe checkpoint to checkpoint folder and exchange global axis to …

ec99700

…class member

[Gemini] Use async stream to prefetch and h2d data moving (hpcaitech#…

d9dddf5

…5781) * use async stream to prefetch and h2d data moving * Remove redundant code

[gemini] quick fix on possible async operation (hpcaitech#5803)

3bcbba9

* [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation

Merge branch 'hpcaitech:feature/moe' into feature/moe

be92747

Merge branches 'feature/moe' and 'feature/moe' of https://github.com/…

76aeec3

…Hz188/ColossalAI into feature/moe

update moe hybrid parallel plugin with newest version of zero & fix z…

64fc0f7

…ero working/master params bug

fix zero unit test

8b277cc

Add an assertion to prevent users from using it incorrectly

ed42193

Merge remote-tracking branch 'upstream/feature/moe' into feature/moe

88b78fa

Modify function parameter names to resolve compatibility issues

419d25e

remove useless code: MoECheckpoint

3364ac9

update github workflow config file

f7298bc

fix typo

e6839fb

Merge branch 'hpcaitech:feature/moe' into feature/moe

cc9d0bb

Support 4d parallel + flash attention (hpcaitech#5789)

8795bb2

* support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <[email protected]>

fix .github worfflow conflict with main branch

1405cf1

Hz188 requested a review from a team as a code owner June 18, 2024 02:36

Hz188 closed this Jun 18, 2024

Hz188 deleted the feature/moe branch June 18, 2024 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE/ZeRO] fix .github conflict with main branch. #5827

[MoE/ZeRO] fix .github conflict with main branch. #5827

Hz188 commented Jun 18, 2024

[MoE/ZeRO] fix .github conflict with main branch. #5827

[MoE/ZeRO] fix .github conflict with main branch. #5827

Conversation

Hz188 commented Jun 18, 2024

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?