Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

Introduction

Have you ever wondered how to deploy a Large Language Model (LLM) on OCI? In this solution, you will learn how to deploy a Large Language Models using NVIDIA OCI Bare Metal Compute Instances (NVIDIA A10 Tensor Core GPUs), with an inference server called vLLM.

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API, meaning that we can use common routes like v1/completion,v1/chat/completion across various LLLMs according to the application requirements.

These LLMs can come from any HuggingFace well-formed repository (your choice), so we will need to authenticate to HuggingFace to pull the models (if you haven't built them from source) with an authentication token.

0. Prerequisites & Docs

Prerequisites

In order to run the setup, you will require to have the following requirements:

An Oracle Cloud Infrastructure (OCI) Account
The user's tenancy and region must have the necessary limits for GPU machines. By default, it uses BM.GPU.A10.4, a cluster of 4 tensor cores.
If the user is not a tenancy administrator, the user must have OCI access to the following resource groups:
```
logging-family
virtual-network-family
instance-family
volume-family
API-gateway-family
```
A valid API token from HuggingFace (READ permission is enough; no need for read + write) 4 In case if you are using an access-protected repository (like LLaMA-2 or LLaMA-3), make sure to accept the terms and conditions and get access to the gated repository in advance, through the HuggingFace portal

Docs

1. Deploy Infrastructure

We'll use the Terraform stack to deploy the required infrastructure.

(Recommended) Option 1. Using OCI Terraform provider and Terraform CLI

Clone the repository:

git clone https://github.com/oracle-devrel/oci-terraform-genai-llm-on-gpuvms

Create/edit a terraform.tfvars file with the following 9 variables (compartment, tenancy, region, user OCIDs, and key location fingerprint model_path huggingface_access_token and ssh_public_key):

cd oci-terraform-genai-llm-on-gpuvms
vi terraform.tfvars

# Authentication
tenancy_ocid         = "OCID of OCI Tenancy"
user_ocid            = "OCID of OCI User "
fingerprint          = "OCI User fingerprint"
private_key_path     = "OCI User private key path"
# Region
region = "OCI Region"
# Compartment
compartment_ocid = "OCID of OCI Compartment"
#LLM Information
model_path = "PATH of your LLM - example meta-llama/Meta-Llama-3-8B"
huggingface_access_token = "READ access token from Hugging face"
ssh_public_key="SSH Public key to access the BM"

The private key and fingerprint need to be added to your OCI user within your tenancy, in Identity >> Domains >> OracleIdentityCloudService >> Users. You can use the section on "API Keys" to create a key pair and obtain the tenancy and user OCIDs.

If you don't have one already, you can create a public-private keypair by running the following command in bash:

ssh-keygen

Depending on the compute shape you want to use, modify variables.tf (instance_shape variable) and setup.sh (parallel_gpu_count). If you have a cluster of n GPUs, the GPU count should also be n.

Execute the Terraform plan & apply:

terraform init
terraform plan
terraform apply

(Optional) After you're done with development and want to delete the stack, run the following command:
```
terraform destroy
```
Note: this action is irreversible!

3. Execution Workflow

These are the created OCI resources with the Terraform stack:
- OCI VM based on a GPU Image
- OCI API Gateway and Deployment, which exposes the calls to the LLM through an API hosted on OCI
setup.sh is executed inside the Terraform script (you don't need to run it).

With your own HF access token, to be able to pull the Large Language Model.

This script will install all necessary software libraries, modules, and load the LLM.

It will start and provide an inference endpoint, so we can later call the model through an API if we want to.

3. How to use the LLM

By default, the startup script exposes LLM inference with an OpenAI-compatible route. This framework basically allows us to Plug and play models easily, as well as bringing exposed API endpoints to a common framework, even with models that are very different from standard GPT models. It also enables new supported features like streaming / Chat completion/embeddings/detokenization...
Some of the possible routes (OpenAI compatible) using the vLLM are:
- /v1/models
- /v1/completion
- /v1/chat/completion

After Terraform has completed, refer the execution outcome to fetch the URL and API key:

terraform output LLM_URL
"https://XXXX.<OCIREGION>/path/name"
terraform output API_KEY
"AlphaNumeric..."

We can run the currently enabled models in our LLM server.

Make sure to set the variables url and token with the information from the previous step.

Asking for available models using curl:

export URL="<LLM_URL value>"
export TOKEN="<API_KEY value>"
curl -k $URL/v1/models  -H "Authorization: Bearer $TOKEN"

Asking for available models using Python:
```
python scripts/api_gw_llm.py
```
Chat completion using Python OpenAI library in this file:
```
python scripts/completions_llm.py
```

Basic Troubleshooting

LLM inference is not ready or getting 504 error code when trying the URL:
- Login to the VM where you've deployed the solution using the SSH private-key you created in chapter 1.
- Check the startup logs - Default path /home/opc/llm-init/init.log.
- For any failures, refer to setup.sh to know about which exact steps to run manually
- Check the vLLM service is up and running:
```
sudo systemctl status vllm-inference-openai.service
```
Details about LLM usage or response. We are capturing inference logs under path /home/opc/vllm-master with in file vllm_log_file.log. Incase wish to push the logs to a specific path,update file bash.sh and bash_openai.sh with in the same path and restart the service.
```
#Update below line with in file bash.sh or bash_openai.sh
export vllm_log_file=<new absolute path to the logs>
sudo systemctl restart  vllm-inference-openai.service
```

Contributors

Author: Rahul M R.

Editor: Nacho Martinez

Last release: May 2024

This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the open source community.

License

Licensed under the Universal Permissive License (UPL), Version 1.0.

See LICENSE for more details.

ORACLE AND ITS AFFILIATES DO NOT PROVIDE ANY WARRANTY WHATSOEVER, EXPRESS OR IMPLIED, FOR ANY SOFTWARE, MATERIAL OR CONTENT OF ANY KIND CONTAINED OR PRODUCED WITHIN THIS REPOSITORY, AND IN PARTICULAR SPECIFICALLY DISCLAIM ANY AND ALL IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. FURTHERMORE, ORACLE AND ITS AFFILIATES DO NOT REPRESENT THAT ANY CUSTOMARY SECURITY REVIEW HAS BEEN PERFORMED WITH RESPECT TO ANY SOFTWARE, MATERIAL OR CONTENT CONTAINED OR PRODUCED WITHIN THIS REPOSITORY. IN ADDITION, AND WITHOUT LIMITING THE FOREGOING, THIRD PARTIES MAY HAVE POSTED SOFTWARE, MATERIAL OR CONTENT TO THIS REPOSITORY WITHOUT ANY REVIEW. USE AT YOUR OWN RISK.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
cloud-init		cloud-init
img		img
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
apigateway.tf		apigateway.tf
compute.tf		compute.tf
datasource.tf		datasource.tf
license_policy.yml		license_policy.yml
locals.tf		locals.tf
output.tf		output.tf
provider.tf		provider.tf
release_files.json		release_files.json
repolinter.json		repolinter.json
sonar-project.properties		sonar-project.properties
tag.tf		tag.tf
tls.tf		tls.tf
variables.tf		variables.tf
vcn.tf		vcn.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

Introduction

0. Prerequisites & Docs

Prerequisites

Docs

1. Deploy Infrastructure

(Recommended) Option 1. Using OCI Terraform provider and Terraform CLI

3. Execution Workflow

3. How to use the LLM

Basic Troubleshooting

Contributors

License

About

Releases

Packages

Contributors 4

Languages

License

oracle-devrel/oci-terraform-genai-llm-on-gpuvms

Folders and files

Latest commit

History

Repository files navigation

Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

Introduction

0. Prerequisites & Docs

Prerequisites

Docs

1. Deploy Infrastructure

(Recommended) Option 1. Using OCI Terraform provider and Terraform CLI

3. Execution Workflow

3. How to use the LLM

Basic Troubleshooting

Contributors

License

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages