Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This repo is great thank you for sharing! #15

Open
vgoklani opened this issue Apr 5, 2024 · 3 comments
Open

This repo is great thank you for sharing! #15

vgoklani opened this issue Apr 5, 2024 · 3 comments

Comments

@vgoklani
Copy link

vgoklani commented Apr 5, 2024

Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.

Thanks!

@lessw2020
Copy link
Contributor

Hi @vgoklani - let me check and get back to you this week. I believe we have continuous batching in TorchServe but let me verify.

@vgoklani
Copy link
Author

vgoklani commented Apr 8, 2024

Hi @lessw2020 - first i want to say thank you for your YouTube videos on FSDP!!!

For continuous/dynamic batching, we really want something that's in python :) where it's easy to tweak the server. As the main bottleneck is the GPU related generation (at least for LLMs), there is only a marginal benefit to using a Rust/Java based web server framework. Nevertheless, all the main frameworks (i.e. TGI and vLLM) are not in python. Thanks!

@lessw2020
Copy link
Contributor

Hi @vgoklani - got it, thanks for your feedback.
This has generated a discussion about possibly making a reference architecture to showcase these type of features.
Let me leave this issue open and will update if it turns into a real effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants