MoE #639

Muennighoff · 2024-06-30T21:38:10Z

Replaces #541

Notes:

I didn't find norm_after to work well but added it to conform with other parts of the code but can also remove it
Only left in the config file used for the final 5T run
I didn't include all configurations that we ran for OLMoE (e.g. expert choice) - I will probably put instructions for those in a separate olmoe repository for people who want to exactly reproduce

…hoff/MoE

epwalsh · 2024-08-01T20:29:33Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


What's different about your branch for the original source?

It includes zloss which we use during training for better stability

you can view the exact difference here: databricks/megablocks@main...Muennighoff:megablocks:olmoe ; besides zloss it also has expert choice which is currently not used but i think we may want to try in the future when we go multimodal

Can you upstream this, so we don't have to depend on a private fork?

Sure, opened a PR here databricks/megablocks#133 - If / when it gets merged, I will update the install instructions. If people don't want to use zloss, it also works with the regular megablocks - it's not a big difference.

@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

dirkgr · 2024-08-02T16:06:20Z

olmo/config.py

+    The number of experts to use in the MoE block.
+    """
+
+    moe_top_k: Optional[int] = 2


If these are Optional, what does it mean when it's None?

They're optional when no MoE is used, otherwise required. Is this not an acceptable usage of Optional[int]? Can change it

In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

I can change it to option 1) if others agree? Note that there's other params not following this:

embedding_size: Optional[int] = 50304 gen1_gc_interval: Optional[int] = 1 distributed_strategy: Optional[DistributedStrategy] = DistributedStrategy.fsdp fsdp: Optional[FSDPConfig] = field(default_factory=FSDPConfig) auxiliary_loss_multiplier: Optional[float] = 1e-4

Do you actually rely on the defaults you put in here anywhere? If not, let's go with Shane's version, and default these to None. I assume something somewhere will fail if they are not set and you need them.

Do you actually rely on the defaults you put in here anywhere?

Yes quite a lot, e.g. the loss weights; the use of dropless MoEs (moe_dropless); leaving moe_interleave,moe_lbl_in_fp32,moe_shared_expert as False

Actually, I don't think setting them all to None is a good idea, as it means that everytime we add a new MoE-specific configuration parameter all MoE configs become outdated since every MoE-specific configuration parameter is Optional in that dense.

I can also remove the Optional from it as they have defaults anyways but then as seen in the examples I pasted above, we do have Optional config params with default values in the codebase anyways.

If it doesn't break everything, I'd prefer to have a special config object for MoE, which is Optional, but none of the items inside of that object are Optional. This may break backwards compatibility with the model we already released though?

Yes it would break compat with the configs we released but can pin a commit to our released repo if people want to reuse our configs to reproduce things exactly

Hm, that's unfortunate, but I think I prefer the MoEConfigObject. It reduces the impact on old-school dense model training.

olmo/initialization.py

dirkgr · 2024-08-02T16:08:50Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


Can you upstream this, so we don't have to depend on a private fork?

dirkgr · 2024-08-02T16:15:53Z

olmo/model.py

+                x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+            else:
+                x = self.ff_norm(x)
+            # Activation checkpointing for the MoE FFN is not supported


Why not? If there is a technical problem with it, will it affect whole_layer activation checkpointing as well?

It fails with

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case. 2024-05-15T20:15:01.172963498Z 2024-05-15 13:15:01.171 jupiter-cs-aus-133.reviz.ai2.in:3 olmo.util:158 CRITICAL Uncaught CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case.

This paper has some explanations why it is difficult to do act ckpt for MoEs: https://dspace.mit.edu/bitstream/handle/1721.1/153897/wisdom-dwisdom-meng-eecs-2024-thesis.pdf

whole_layer is not supported with MoE, only fine_grained - I added code to raise an error if it's not fine_grained & MoE is configured.

Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

This is probably a fairly big blocker to going bigger though. For dense models, our fastest settings still use a lot of checkpointing.

olmo/model.py

olmo/train.py

scripts/train.py

Muennighoff · 2024-08-20T17:51:37Z

Linking this related PR that we should merge after: #707

If this PR here looks good to you, could you approve it @epwalsh / @dirkgr ? :)

2015aroras · 2024-08-20T19:50:40Z

olmo/config.py

+    The number of experts to use in the MoE block.
+    """
+
+    moe_top_k: Optional[int] = 2


In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

2015aroras · 2024-08-20T23:45:16Z

olmo/config.py

@@ -1273,3 +1334,41 @@ def update_legacy_settings(cls, config: D) -> D:
                new_config.optimizer = OptimizerConfig.update_legacy_settings(new_config.optimizer)

        return new_config
+
+
+def config_to_moe_args(config: ModelConfig) -> Dict[str, Any]:


I think it would be better to have this as an instance method of ModelConfig that can be invoked with something like config.build_moe_args()

I think the moe args may include things outside of the ModelConfig in the future. Currently, I put some things that may be considered as TrainingConfig params like moe_zloss_weight in the ModelConfig but in case we move them in the future to TrainingConfig then it would not only use the ModelConfig anymore.

olmo/model.py

olmo/optim.py

configs/official/OLMoE-7B-A1B.yaml

Co-authored-by: Shane A <[email protected]>

Muennighoff · 2024-09-04T23:36:33Z

All tests are passing except the GPU test which I assume is expected to fail. Feel free to merge 😊

dirkgr · 2024-09-04T23:43:42Z

olmo/config.py

+    moe_interleave: Optional[bool] = False
+    """
+    Interleave sequential with MoE blocks starting with sequential.
+    """


You tried this? Do we need this setting? I am interested in interleaving, especially with SSM layers, but I don't think we'd want to do it like this. If we don't need this for any config you have run or described in the paper, I'd rather take out this functionality.

dirkgr · 2024-09-08T21:20:06Z

olmo/model.py

-            device=config.init_device,
-        )
-        self.ff_out._is_residual = True  # type: ignore
+        if self.config.block_type != BlockType.moe:


Can you make this dependent on whether the block has a ff_out, instead of the block type?

with hasattr(), I mean

dirkgr · 2024-09-08T21:21:42Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

dirkgr · 2024-09-08T21:23:28Z

olmo/model.py

+        if hasattr(self.ffn.experts.mlp, "v1"):
+            init_normal(self.ffn.experts.mlp.v1, std=in_std, init_cutoff_factor=cutoff_factor)


What is v1 and why is it optional? Maybe I'll keep reading and find out.

It is when SwiGLU is activated with MoEs

dirkgr · 2024-09-08T21:31:40Z

olmo/model.py

+                x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+            else:
+                x = self.ff_norm(x)
+            # Activation checkpointing for the MoE FFN is not supported


Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

dirkgr · 2024-09-08T21:32:55Z

olmo/model.py

+        if self.config.moe_interleave:
+            blocks = []
+            for i in range(config.n_layers):
+                if i % 2 == 0:
+                    blocks.append(OLMoSequentialBlock(i, config, self.__cache))
+                else:
+                    blocks.append(OLMoEBlock(i, config, self.__cache))


If we don't need this, I'd rather not have it.

dirkgr · 2024-09-08T21:41:06Z

olmo/train.py

+                (self.model.config.block_type != BlockType.moe)
+                or (self.model.config.moe_log_expert_assignment is False)
+            )
+            else torch.zeros((self.model.config.n_layers, self.model.config.moe_num_experts))


This does not put it on CPU. This puts it on the default device.

dirkgr · 2024-09-08T21:43:14Z

olmo/train.py

                # Run backward pass.
                loss.backward()

            # Remove output hooks
            for hook in output_hooks:
                hook.remove()

-        return ce_batch_loss, z_batch_loss
+        return ce_batch_loss, z_batch_loss, lb_batch_loss, moe_z_batch_loss, expert_assignments


@epwalsh, does the new trainer support all of this stuff? This seems like a lot of extra things.

Not directly but I think it could be supported through the callback system.

dirkgr · 2024-09-08T21:49:20Z

olmo/train.py

+                        lb_loss = batched_load_balancing_loss(self.moe_args) / len(micro_batches)
+                    if self.model.config.moe_log_expert_assignment:
+                        if self.model.config.moe_zloss_weight:
+                            tokens_per_expert, _, _ = zip(*get_load_balancing_loss())


Goes get_load_balancing_loss() take care of reducing the expert assignments across ranks?

dirkgr · 2024-09-08T21:50:23Z

olmo/train.py

+                            tokens_per_expert, _, _ = zip(*get_load_balancing_loss())
+                        else:
+                            tokens_per_expert, _ = zip(*get_load_balancing_loss())
+                        expert_assignments += torch.stack(tokens_per_expert, dim=0).cpu()


Not a big deal, but are you sure that this is faster on CPU? Back in the day I always thought this too, keep small stuff on the CPU, but in practice doing it all on GPU was always faster.

If tokens_per_expert were on GPU this will trigger a host-device sync. In that case it's almost certainly better to keep on GPU.

were on GPU this will trigger a host-device sync

Don't we want to avoid these syncs?

Muennighoff added 30 commits June 19, 2024 22:13

Clean MoE implementation

e725eb9

Add conf

db24750

Fix return args

18450de

Rmv outdated kwarg

4ab7f77

Rmv legacy kwarg

dba42fd

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

6c5f8a3

…hoff/MoE

Add distributed_strategy

6a8e089

Allow w/o weight attr

1a9a317

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

ddf6fd4

…hoff/MoE

Allow w/o weight attr

ab55e07

Add MoE params

7aeefd4

Rmv kwarg

3eab45c

Reduce lb & moe losses

6d736da

LN & Emb Dec

d07c638

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

cdb592f

…hoff/MoE

Do not decay emb

1399841

Tmp - debug throughput

a13b5b8

Fix

935167e

Fix

b96972d

maintain init order

0079490

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

8b1c441

…hoff/MoE

Decay emb

e2c7286

Keep EA on CPU

d39a37c

Do not decay emb

3acfc04

Change norm

4432261

Confs

cef7707

Adapt wrap

2a6df33

Add conf

7421890

decemb conf

021974e

Updates

d5a0626

Muennighoff added 4 commits August 1, 2024 12:22

Simplify

ae6f16a

Sort

02781be

Fix type checks

a62a1ee

Format

0ecd4b8

Muennighoff requested review from epwalsh and dirkgr August 1, 2024 20:20

epwalsh reviewed Aug 1, 2024

View reviewed changes

dirkgr requested changes Aug 2, 2024

View reviewed changes

Muennighoff added 3 commits August 2, 2024 18:17

Fix typo; MoEArgs func

d8452a0

Format

8a28ced

Check for act ckpt strategy & moe; fix typo

91f5553

Muennighoff requested a review from dirkgr August 3, 2024 01:33

fix import

61ac104

Muennighoff added 2 commits August 20, 2024 10:52

Sort impot

f4faf8a

Merge branch 'main' into Muennighoff/MoE

fdc1021

2015aroras reviewed Aug 21, 2024

View reviewed changes

Muennighoff and others added 3 commits August 20, 2024 20:27

Fix typo

ed82181

Co-authored-by: Shane A <[email protected]>

Simplify isinstance

b0cc754

Co-authored-by: Shane A <[email protected]>

Clean conf & move constructor

ca9b41f

Muennighoff mentioned this pull request Aug 29, 2024

Add OLMoE huggingface/transformers#32406

Merged

3 tasks

Muennighoff added 4 commits September 4, 2024 16:24

Add ref

215c0f5

Merge main

43baf74

Sort imports

775e514

Format

cd0004b

dirkgr requested changes Sep 8, 2024

View reviewed changes

Muennighoff added 3 commits September 11, 2024 20:46

No exp ass

acb23dd

Revert

a143469

Simplify

1a4bdae

		if hasattr(self.ffn.experts.mlp, "v1"):
		init_normal(self.ffn.experts.mlp.v1, std=in_std, init_cutoff_factor=cutoff_factor)

MoE #639

Are you sure you want to change the base?

MoE #639

Conversation

Muennighoff commented Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

Muennighoff Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Jun 30, 2024 •

edited

Loading

Muennighoff Aug 1, 2024 •

edited

Loading