This repo is great thank you for sharing! #15

vgoklani · 2024-04-05T04:16:54Z

Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.

Thanks!

lessw2020 · 2024-04-07T18:51:26Z

Hi @vgoklani - let me check and get back to you this week. I believe we have continuous batching in TorchServe but let me verify.

vgoklani · 2024-04-08T23:10:59Z

Hi @lessw2020 - first i want to say thank you for your YouTube videos on FSDP!!!

For continuous/dynamic batching, we really want something that's in python :) where it's easy to tweak the server. As the main bottleneck is the GPU related generation (at least for LLMs), there is only a marginal benefit to using a Rust/Java based web server framework. Nevertheless, all the main frameworks (i.e. TGI and vLLM) are not in python. Thanks!

lessw2020 · 2024-04-10T16:14:23Z

Hi @vgoklani - got it, thanks for your feedback.
This has generated a discussion about possibly making a reference architecture to showcase these type of features.
Let me leave this issue open and will update if it turns into a real effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This repo is great thank you for sharing! #15

This repo is great thank you for sharing! #15

vgoklani commented Apr 5, 2024

lessw2020 commented Apr 7, 2024

vgoklani commented Apr 8, 2024

lessw2020 commented Apr 10, 2024

This repo is great thank you for sharing! #15

This repo is great thank you for sharing! #15

Comments

vgoklani commented Apr 5, 2024

lessw2020 commented Apr 7, 2024

vgoklani commented Apr 8, 2024

lessw2020 commented Apr 10, 2024