The Atoma Node Inference repository is a collection of optimized infrastructure for Large Language Models (LLMs) compute. We rely on highly optimized KV cache memory management software, through block pagination, such as PagedAttention and FlashAttention2. The codebase is mostly written in the Rust programming language, which leads to safe and highly optimized inference request scheduling, enhancing LLM inference serving.
- Implements Paged Attention for efficient KV cache management
- Supports Llama3.1 models
- Optimized for inference serving in distributed systems
- Integrates with the Candle ML framework for high-performance Rust-based machine learning
- Scalable architecture for handling multiple concurrent requests
- Efficient memory management for improved performance
Paged Attention is an innovative technique for managing KV cache memory in LLMs. It significantly improves inference efficiency, especially for long-context scenarios. For more details, see the original paper.
Flash Attention 2 is a highly optimized algorithm for efficient attention computation in transformers. It mostly relies on the observation that writing in and out of HBM GPU memory presents the main bottleneck to compute attention efficiently. The latter is especially relevant when computing attention softmax intermediate values. The algorithm exploits custom CUDA kernels that minimize memory HBM writes, by computing the attention scores in a block-wise fashion, requiring only shared memory reads and writes. For more details, see the original paper.
This project leverages Candle, HuggingFace's Rust-based ML framework. Candle offers several advantages:
- Blazing fast performance of Rust
- Memory safety guarantees
- Seamless integration with AI inference distributed systems
- Fully open-source and community-driven development
- Fork and star the repository.
- Clone your forked repository:
git clone https://github.com/your-username/atoma-node-inference.git
- Install Rust: Follow the instructions at https://www.rust-lang.org/tools/install
- Navigate to the project directory:
cd atoma-node-inference
- Initialize the git submodules:
git submodule init
and thengit pull --recurse-submodules
- Build the project:
cargo build --release
- Run tests:
cargo test
For more detailed instructions, please refer to our documentation.
Under no circumstances should a single PR mix different purposes: Your PR is either a bug fix, a new feature, or a performance improvement, never a combination. Nor should you include, for example, two unrelated performance improvements in one PR. Please just submit separate PRs. The goal is to make reviewing your PR as simple as possible, and you should be thinking about how to compose the PR to minimise the burden on the reviewer.
Here are a few specific guidelines for the three main categories of PRs that we expect:
In the PR description, please clearly but briefly describe
- the bug (could be a reference to a GH issue; if it is from a discussion (on Discord/email/etc. for example), please copy in the relevant parts of the discussion);
- what turned out to the cause the bug; and
- how the PR fixes the bug.
Wherever possible, PRs that fix bugs should include additional tests that (i) trigger the original bug and (ii) pass after applying the PR.
In the PR description, please clearly but briefly describe
- what the feature does
- the approach taken to implement it
All PRs for new features must include a suitable test suite.
Performance improvements are particularly welcome!
- The target bottleneck (only one per PR to avoid confusing things!)
- How performance is measured
- Characteristics of the machine used (CPU, OS, GPU, etc.)
- Performance gains in terms of speedups and memory usage (e.g. 2x speedup and 50% memory reduction).
If you find a bug, please open an issue in our GitHub repository with a clear description and steps to reproduce.
Help us enhance our documentation by fixing typos, clarifying explanations, or adding examples.
Participate in discussions, answer questions, and share your expertise with the community.