feat: vllm llama integration #129

jorgeantonio21 · 2024-08-16T02:02:14Z

Integrate Llama Model for Fast, Batched Inference in atoma-vllm

Motivation

The atoma-vllm library aims to provide efficient, batched inference capabilities for large language models. Integrating the Llama model into our system will expand our support for popular and powerful language models, enabling users to leverage Llama's capabilities within our high-performance inference framework.

Description

This PR introduces Llama model support to the atoma-vllm library, implementing the necessary components for model loading, execution, and integration with our existing infrastructure. Key changes include:

Added a new llama.rs file in the models directory to implement Llama-specific functionality.
Implemented the ModelLoader, ModelMetadata, and ModelExecutor traits for the LlamaModel struct.
Updated the ModelFilePaths struct to accommodate Llama's file structure.
Modified the fetch and load functions to handle Llama model files and configurations.
Integrated Llama-specific parameters and configurations into the existing model execution pipeline.
Added a test file llama.rs in the tests directory to ensure proper functionality of the Llama integration.
Updated dependencies in Cargo.toml to include necessary Llama-related libraries.
Made adjustments to existing code to accommodate Llama's specific requirements and ensure compatibility with our batched inference system.

These changes allow users to easily load and use Llama models within the atoma-vllm framework, benefiting from our optimized, batched inference capabilities.

Breaking Changes

The ModelLoader trait's fetch and load functions now have slightly different signatures to accommodate the new ModelFilePaths struct and additional parameters.
The ModelMetadata trait has been updated with new method names and signatures. Any custom model implementations will need to be updated to match the new trait definition.

These breaking changes are necessary to support a wider range of models and improve overall system flexibility. Users of the library may need to update their code if they have implemented custom models or are directly interacting with low-level components of the system.

…ith current logic of atoma-vllm

atoma-vllm/src/llm_service.rs

atoma-vllm/src/worker.rs

atoma-vllm/src/token_output_stream.rs

Cifko

LGTM, I will test that later

jorgeantonio21 added 30 commits June 22, 2024 18:10

first commit

c125132

logic for prepare model inputs

2161306

add changes

efa422a

add atoma paged attention dependency and refactor code to integrate w…

408c4e9

…ith current logic of atoma-vllm

work on sampling logic

576a61a

refactor selected token indices computation

403252d

minor changes

45b821f

add minor changes

ee6f278

compiler issues and simplify code

b7a4a47

add changes

9a55dbd

refactors

d0a9ad4

resolve compilation issues

9efcc44

minor mods

ece840c

resolve compilation issues

c349810

resolve compilation issues

5b982ae

resolve compilation issues

b1ec4fa

resolve compilation issues

1731ae8

resolve compilation issues

a32cfbb

clippy

f753d8a

testing

3c28a8a

testing

15b4e36

testing

dc3a56f

remove sampling from testing

d2227c5

testing

4fdf80c

testing

054c28b

remove sampling from testing

1bace2b

fmt

4f4b46b

add further changes

9cc4673

refactor parts of the code, and add more logs

2244266

add changes

1ba71b4

jorgeantonio21 added 2 commits September 17, 2024 16:51

add import

c929172

squeeze logits

b54956e

Cifko approved these changes Sep 18, 2024

View reviewed changes

atoma-vllm/src/llm_service.rs Show resolved Hide resolved

atoma-vllm/src/worker.rs Show resolved Hide resolved

atoma-vllm/src/token_output_stream.rs Outdated Show resolved Hide resolved

jorgeantonio21 added 22 commits September 18, 2024 11:09

clippy

cb1e4a4

address clippy and compilation errors

7ccec11

address clippy and compilation errors

2ea5eaa

address clippy and compilation errors

dbbb1da

address clippy and compilation errors

ff7d91d

clippy

cfefe66

clippy

3af9475

address tests clippy

2ffcde9

address tests clippy

c844b05

address tests clippy

b1d19f3

address tests clippy

cd94f97

add changes

a2eaa3d

add changes

1724cdf

add changes

d365983

add changes

69f5fb0

refactor check for max token length

c68614a

remove unnecessary files

bae0a9c

merge main and resolve conflicts

f7aabca

resolve bug with decoding full stream of tokens

92b52f1

resolve bug with decoding full stream of tokens

40ed214

resolve bug with decoding full stream of tokens

d7c8a7d

resolve bug with decoding full stream of tokens

d08df67

Cifko approved these changes Sep 18, 2024

View reviewed changes

jorgeantonio21 added 3 commits September 18, 2024 15:33

add changes

af3deb1

add changes

1952a6e

add changes

54ef3cf

jorgeantonio21 merged commit bb8b07b into main Sep 18, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vllm llama integration #129

feat: vllm llama integration #129

jorgeantonio21 commented Aug 16, 2024 •

edited

Loading

Cifko left a comment

feat: vllm llama integration #129

feat: vllm llama integration #129

Conversation

jorgeantonio21 commented Aug 16, 2024 • edited Loading

Integrate Llama Model for Fast, Batched Inference in atoma-vllm

Motivation

Description

Breaking Changes

Cifko left a comment

Choose a reason for hiding this comment

jorgeantonio21 commented Aug 16, 2024 •

edited

Loading