Replies: 2 comments 1 reply
-
Past keys eg caching #409 is first thing to try to improve AR generation speed (cc @gpucce ) |
Beta Was this translation helpful? Give feedback.
-
For this model specifically, caching would def help for generate. For all models, recent updates in PyTorch 2.0 (nightlies right now) could help. The default transformers use pytorch MHA and that supports flash attention and the mem efficient kernel from xformer on the nightlies. Also, torchcompile could be used and often has a pretty significant speedup. You should definitely use an AMP context, but could try pure fp16 or bf16 as well, just verify the outputs are similar. |
Beta Was this translation helpful? Give feedback.
-
I tried out image captioning with coca_ViT-L-14 / mscoco_finetuned_laion2B-s13B-b90k. Results are pretty impressive!
Inference speed however not so much. I measure about 600 ms on a T4 per image.
Has anyone tried ways of optimizing it and can give some advice?
Would for example FasterTransformer or running it with float16 help?
Beta Was this translation helpful? Give feedback.
All reactions