Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating GNS for other optimizers #143

Open
HariSeldon11988 opened this issue Aug 20, 2024 · 0 comments
Open

Calculating GNS for other optimizers #143

HariSeldon11988 opened this issue Aug 20, 2024 · 0 comments

Comments

@HariSeldon11988
Copy link

Dear all,

I have a question about your calculation of GNS. You use different classes for the calculation (GradientNoiseScale, AdamGradientNoiseScale). As far as I understand, the preconditioner matrices differ for each class (GNS is always 1, AdamGNS is adjusted).

I have three questions regarding this:

  • For which optimizers does the code work or deliver correct results, and what would need to be done if using other optimizers to correctly calculate the GNS? Is it correct, that it only works for SGD, Adam, Adagrad?

  • To what extent does the scheduler or scaling rule influence the calculation of the GNS because this is the criteria on which you decide how the GNS is calculated?

  • If I use another optimizer than Adam or AdamW the "normal" GNS class is used for calculating the GNS (with precondition matrices 1). Does this work for all other opimizer like SGD, LAMB and some other, or is this only valid for SGD. (precondition = 1 should be the vanilla SGD case from the original GNS paper "An Empirical Model of Large-Batch Training".

Here is the relevant code:

if not scaling_rule and (isinstance(optimizer, torch.optim.Adam) or
                                 isinstance(optimizer, torch.optim.AdamW)):
     self.scaling_rule = AdamScale()
else:
     self.scaling_rule = scaling_rule or AdaScale()

if isinstance(scaling_rule, AdamScale):
     self.gns = AdamGradientNoiseScale(self, optimizer,
                                              mp_scaler=mp_scaler)
else:
     self.gns = GradientNoiseScale(self, optimizer, mp_scaler=mp_scaler)
self.scaling_rule.initialize(self, optimizer, patch_optimizer=True)

I would appreciate any kind of help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant