hf_olmo: support flash attn 2 #471

wade3han · 2024-02-29T10:59:09Z

#460, tested with a simple snippet as below:

import transformers, torch

model = transformers.AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct", use_flash_attention_2="flash_attention_2", trust_remote_code=True).cuda()
tokenizer = transformers.AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct", trust_remote_code=True)

print(tokenizer.decode(model.generate(torch.tensor(tokenizer.encode("Hello World! My name is")).unsqueeze(0).cuda())[0]))
# Hello World! My name is Emily and I am a second year student at the University of California,

epwalsh · 2024-03-01T00:35:40Z

Someone familiar with transformers internals should review this (maybe @AkshitaB). I'm not sure what transformers does with this, but I'd be very cautious if they're monkey-patching our attention mechanism since flash-attn expends a different input shape (the head and sequence dimensions are flipped compared to what we normally do).

hf_olmo: support flash attn

2f1c81b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hf_olmo: support flash attn 2 #471

hf_olmo: support flash attn 2 #471

wade3han commented Feb 29, 2024

epwalsh commented Mar 1, 2024

hf_olmo: support flash attn 2 #471

Are you sure you want to change the base?

hf_olmo: support flash attn 2 #471

Conversation

wade3han commented Feb 29, 2024

epwalsh commented Mar 1, 2024