optimize inference speed #417

tfriedel · 2023-02-11T19:21:27Z

tfriedel
Feb 11, 2023

I tried out image captioning with coca_ViT-L-14 / mscoco_finetuned_laion2B-s13B-b90k. Results are pretty impressive!
Inference speed however not so much. I measure about 600 ms on a T4 per image.
Has anyone tried ways of optimizing it and can give some advice?
Would for example FasterTransformer or running it with float16 help?

rom1504 · 2023-02-11T21:22:50Z

rom1504
Feb 11, 2023
Maintainer

Past keys eg caching #409 is first thing to try to improve AR generation speed (cc @gpucce )

1 reply

rom1504 Feb 11, 2023
Maintainer

Help would definitely be appreciated

rwightman · 2023-02-14T22:21:27Z

rwightman
Feb 14, 2023
Maintainer

For this model specifically, caching would def help for generate.

For all models, recent updates in PyTorch 2.0 (nightlies right now) could help. The default transformers use pytorch MHA and that supports flash attention and the mem efficient kernel from xformer on the nightlies. Also, torchcompile could be used and often has a pretty significant speedup. You should definitely use an AMP context, but could try pure fp16 or bf16 as well, just verify the outputs are similar.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize inference speed #417

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

optimize inference speed #417

tfriedel Feb 11, 2023

Replies: 2 comments · 1 reply

rom1504 Feb 11, 2023 Maintainer

rom1504 Feb 11, 2023 Maintainer

rwightman Feb 14, 2023 Maintainer

tfriedel
Feb 11, 2023

Replies: 2 comments 1 reply

rom1504
Feb 11, 2023
Maintainer

rom1504 Feb 11, 2023
Maintainer

rwightman
Feb 14, 2023
Maintainer