Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPT-NeoX and SAE #791

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

StellaAthena
Copy link

Adds two libraries developed by EleutherAI, GPT-NeoX and SAE.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @StellaAthena! Thanks for opening this PR, it would be really nice to have those two libraries officially supported on the Hub! I've left a couple comments to help with the integration. Let me know if you have any question :)

packages/tasks/src/model-libraries.ts Outdated Show resolved Hide resolved
packages/tasks/src/model-libraries.ts Outdated Show resolved Hide resolved
@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
filter: false,
countDownloads: `path:"checkpoints/byt5_model.pt"`,
},
"gpt-neox": {
Copy link
Contributor

@Wauplin Wauplin Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that gpt_neox has 3919 models while gpt-neox has 33 models. For consistency with other libraries, naming it gpt-neox as you did makes more sense but it will mean most GPT-NeoX models will not be listed. A solution is to open a PR on all 3.9k models to update their metadata in the model card (we can provide a script for that) but it might be a bit too much. Any idea @osanseviero ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is actually a bit problematic. As we have 4k models with gpt-neox tag (automatically determined due to model_type in config.json), this will lead to some issues. I wonder if we could explore a different name gpt-neox-original or something like that for the tag (to be added to all repos that can run inference with the gpt-neox library). WDYT?

Copy link
Author

@StellaAthena StellaAthena Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, the above two comments are contradictory. To be clear, it looks to me that gpt-neox is the small one and gpt_neox is the big one.

I'm not deeply attached to either stylism. If we can rename 33 models and use a slightly different tag that's fine with me. I think the core question is if y'all like style consistency more than you dislike making a very large number of small edits. That said, if most of these models are getting automatically tagged there may need to be a CI that runs on new models to correct the tag each time. Would it be possible to hack the GPT-NeoX model class definition so that using gpt_neox induces the metadata entry gpt-neox? Or is that too wonky of a patch? I assume changing the config name is non-viable at this point in time.

repoName: "sae",
repoUrl: "https://github.com/EleutherAI/sae",
filter: true,
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • is there SAE repos where weights are saved as .bin? I haven't find any but I might have missed them
  • if a user loads all layers, counting .safetensors files will count X downloads (for instance 30 layers == 30 downloads in sae-llama-3-8b-32x). It might be difficult to interpret the download number in the end but I don't have a proper solution for that at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's use the embed_tokens/cfg.json file? https://huggingface.co/EleutherAI/sae-llama-3-8b-32x/tree/main/embed_tokens

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the other example, this was me guessing. To make sure I understand best practices, we're looking for a file path pattern that

  1. Appears in every model
  2. Only matches one file in each model

Is that correct? Perhaps the easiest solution is to just create a new metadata file which (for now, at least) primarily exists to support the integration.

Copy link
Contributor

@Wauplin Wauplin Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, we are looking for a file path pattern that:

  1. Appears in every model
  2. Only matches one file when loading the model

For example, on Meta-Llama-3.1-8B-Instruct-GGUF any download of a *.gguf file is counted because we know that users won't download all GGUF files when instanciating the model (i.e. 1 usage == 1 GGUF file download).

packages/tasks/src/model-libraries.ts Outdated Show resolved Hide resolved
@@ -358,6 +365,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
filter: false,
countDownloads: `path:"tokenizer.model"`,
},
"sae": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any models on the Hub tracked as sae yet. It would be good to start adding a few ones (by adding library_name: sae or tags: [sae, ...] in the model card metadata of existing SAE models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to add library_name: sae to https://huggingface.co/EleutherAI/sae-llama-3-8b-32x and others

Copy link
Author

@StellaAthena StellaAthena Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin Yes, we have trained and uploaded a bunch of models using this library (and are actively working on more) but I figured we should add the connection to the back-end first and then add the tag to the models? Is that not the preferred order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually recommend to tag at least a few models before merging the back-end PR. There is no hard requirement for that but asking it ensures that the library definition will be used. We don't want to add libraries and then realize model authors forgot to update their library. Once PR is merged, we know we don't have to check anything else afterwards.

repoUrl: "https://github.com/EleutherAI/sae",
filter: true,
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`,
},
Copy link
Contributor

@Wauplin Wauplin Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can also provide a snippet to let the users know how to instantiate the models. The snippets are usually generated using the id, making it easy to copy-paste from the browser for end users. For instance here you could have something like:

from sae import Sae

# Load specific layer
sae = Sae.load_from_hub("EleutherAI/sae-llama-3-8b-32x", layer=10)

# Load all layers
saes = Sae.load_many_from_hub("EleutherAI/sae-llama-3-8b-32x")
saes["layer_10"]

It is not necessary to explain/document everything since the SAE github repo is better suited for that but having the "Use this model" button on your repos could help getting some traction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a great idea. Tagging @norabelrose for suggested example code.

Copy link
Member

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
filter: false,
countDownloads: `path:"checkpoints/byt5_model.pt"`,
},
"gpt-neox": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to split the PR into the two different libraries

@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
filter: false,
countDownloads: `path:"checkpoints/byt5_model.pt"`,
},
"gpt-neox": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is actually a bit problematic. As we have 4k models with gpt-neox tag (automatically determined due to model_type in config.json), this will lead to some issues. I wonder if we could explore a different name gpt-neox-original or something like that for the tag (to be added to all repos that can run inference with the gpt-neox library). WDYT?

@@ -358,6 +365,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
filter: false,
countDownloads: `path:"tokenizer.model"`,
},
"sae": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to add library_name: sae to https://huggingface.co/EleutherAI/sae-llama-3-8b-32x and others

repoName: "sae",
repoUrl: "https://github.com/EleutherAI/sae",
filter: true,
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's use the embed_tokens/cfg.json file? https://huggingface.co/EleutherAI/sae-llama-3-8b-32x/tree/main/embed_tokens

@Wauplin
Copy link
Contributor

Wauplin commented Jul 30, 2024

Hi @StellaAthena, thanks for your comment and changes. I commented a few but I'm still unsure how to deal with gpt-neox models. Osanseviero is off for a few days and we'll have to discuss it afterwards. In the meantime, would it be possible to split this PR in two so we can deal with SAE integration first? Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants