Skip to content

Commit

Permalink
Merge pull request nutanix#24 from saileshd1402/clean_readme
Browse files Browse the repository at this point in the history
Clean README - Redirect to Opendocs
  • Loading branch information
johnugeorge authored Nov 16, 2023
2 parents 5f0389a + fc0f0d8 commit 3b30121
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 212 deletions.
219 changes: 10 additions & 209 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,214 +1,15 @@
# NAI-LLM
## Nutanix GPT-in-a-Box: Virtual Machine

## Setup
This is the official repository for the virtual machine version of Nutanix GPT-in-a-Box.

Install openjdk, pip3:
```
sudo apt-get install openjdk-17-jdk python3-pip
```
Nutanix GPT-in-a-Box is a new turnkey solution that includes everything needed to build AI-ready infrastructure for organizations wanting to implement GPT capabilities while maintaining control of their data and applications.

Install required packages:
This new solution includes:
- Software-defined Nutanix Cloud Platform™ infrastructure supporting GPU-enabled server nodes for seamless scaling of virtualized compute, storage, and networking supporting both traditional virtual machines and Kubernetes-orchestrated containers
- Files and Objects storage; to fine-tune and run a choice of GPT models
- Open source software to deploy and run AI workloads including PyTorch framework & KubeFlow MLOps platform
- The management interface for enhanced terminal UI or standard CLI
- Support for a curated set of LLMs including Llama2, Falcon and MPT

```
pip install -r requirements.txt
```

Install NVIDIA Drivers:

[Installation Reference](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#runfile) <br />

Download the latest [Datacenter Nvidia drivers](https://www.nvidia.com/download/index.aspx) for the GPU type

For Nvidia A100, Select A100 in Datacenter Tesla for Linux 64 bit with cuda toolkit 11.7, latest driver is 515.105.01

```
curl -fSsl -O https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
sudo sh NVIDIA-Linux-x86_64-515.105.01.run -s
```

Note: We don’t need to install CUDA toolkit separately as it is bundled with PyTorch installation. Just Nvidia driver installation is enough.


## Scripts

### Download model files and Generate MAR file
Run the following command for downloading model files and/or generating MAR file:
```
python3 llm/download.py [--no_download --repo_version <REPO_VERSION>] --model_name <MODEL_NAME> --model_path <MODEL_PATH> --mar_output <MAR_EXPORT_PATH> --hf_token <Your_HuggingFace_Hub_Token>
```
- no_download: Set flag to skip downloading the model files
- model_name: Name of model
- repo_version: Commit ID of model's repo from HuggingFace repository (optional, if not provided default set in model_config will be used)
- model_path: Absolute path of model files (should be empty if downloading)
- mar_output: Absolute path of export of MAR file (.mar)
- hf_token: Your HuggingFace token. Needed to download and verify LLAMA(2) models.

The available LLMs are mpt_7b, falcon_7b, llama2_7b

#### Examples
Download MPT-7B model files and generate model archive for it:
```
python3 llm/download.py --model_name mpt_7b --model_path /home/ubuntu/models/mpt_7b/model_files --mar_output /home/ubuntu/models/model_store
```
Download Falcon-7B model files and generate model archive for it:
```
python3 llm/download.py --model_name falcon_7b --model_path /home/ubuntu/models/falcon_7b/model_files --mar_output /home/ubuntu/models/model_store
```
Download Llama2-7B model files and generate model archive for it:
```
python3 llm/download.py --model_name llama2_7b --model_path /home/ubuntu/models/llama2_7b/model_files --mar_output /home/ubuntu/models/model_store --hf_token <Your_HuggingFace_Hub_Token>
```

### Start Torchserve and run inference
Run the following command for starting Torchserve and running inference on the given input:
```
bash llm/run.sh -n <MODEL_NAME> -a <MAR_EXPORT_PATH> [OPTIONAL -d <INPUT_PATH> -v <REPO_VERSION>]
```
- n: Name of model
- v: Commit ID of model's repo from HuggingFace repository (optional, if not provided default set in model_config will be used)
- d: Absolute path of input data folder (optional)
- a: Absolute path to the Model Store directory

For model names, we support MPT-7B, Falcon-7b and Llama2-7B.
Should print "Ready For Inferencing" as a message at the end

#### Examples
For Inference with official MPT-7B model:
```
bash llm/run.sh -n mpt_7b -d data/translate -a /home/ubuntu/models/model_store
```
For Inference with official Falcon-7B model:
```
bash llm/run.sh -n falcon_7b -d data/qa -a /home/ubuntu/models/model_store
```
For Inference with official Llama2-7B model:
```
bash llm/run.sh -n llama2_7b -d data/summarize -a /home/ubuntu/models/model_store
```

### Describe registered model
curl http://{inference_server_endpoint}:{management_port}/models/{model_name} <br />

For MPT-7B model
```
curl http://localhost:8081/models/mpt_7b
```
For Falcon-7B model
```
curl http://localhost:8081/models/falcon_7b
```
For Llama2-7B model
```
curl http://localhost:8081/models/llama2_7b
```

### Inference Check
For inferencing with a text file:<br />
curl -v -H "Content-Type: application/text" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.txt

For inferencing with a json file:<br />
curl -v -H "Content-Type: application/json" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.json

Test input files can be found in the data folder. <br />

For MPT-7B model
```
curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/mpt_7b -d @data/qa/sample_text1.txt
```
```
curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/mpt_7b -d @data/qa/sample_text4.json
```

For Falcon-7B model
```
curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/falcon_7b -d @data/summarize/sample_text1.txt
```
```
curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/falcon_7b -d @data/summarize/sample_text3.json
```

For Llama2-7B model
```
curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/llama2_7b -d @data/translate/sample_text1.txt
```
```
curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @data/translate/sample_text3.json
```

### Register additional models
For loading multiple unique models, make sure that the MAR files (.mar) for the concerned models are stored in the same directory <br />

curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={mar_name}.mar&initial_workers=1&synchronous=true"

For MPT-7B model
```
curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
```
For Falcon-7B model
```
curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
```
For Llama2-7B model
```
curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
```

### Edit registered model configuration
curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"

For MPT-7B model
```
curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
```
For Falcon-7B model
```
curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
```
For Llama2-7B model
```
curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
```

### Unregister a model
curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"

### Stop Torchserve and Cleanup
If keep alive flag was set in the bash script, then you can run the following command to stop the server and clean up temporary files
```
python3 llm/cleanup.py
```

## Model Version Support

We provide the capability to download and register various commits of the single model from HuggingFace. By specifying the commit ID as "repo_version", you can produce MAR files for multiple iterations of the same model and register them simultaneously. To transition between these versions, you can set a default version within Torchserve while it is running.

### Set Default Model Version
If multiple versions of the same model are registered, we can set a particular version as the default for inferencing<br />

curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/{model_name}/{repo_version}/set-default"

## Custom Model Support

We provide the capability to generate a MAR file with custom models and start an inference server using it with Torchserve.
A model is recognised as a custom model if it's model name is not present in the model_config file.<br />

### Generate MAR file for custom model
To generate the MAR file, run the following:
```
python3 download.py --no_download [--repo_version <REPO_VERSION> --handler <CUSTOM_HANDLER_PATH>] --model_name <CUSTOM_MODEL_NAME> --model_path <MODEL_PATH> --mar_output <MAR_EXPORT_PATH>
```
- no_download: Set flag to skip downloading the model files, must be set for custom models
- model_name: Name of custom model
- repo_version: Any model version, defaults to "1.0" (optional)
- model_path: Absolute path of custom model files (should be empty non empty)
- mar_output: Absolute path of export of MAR file (.mar)
- handler: Path to custom handler, defaults to llm/handler.py (optional)<br />

### Start Torchserve and run inference for custom model
To start Torchserve and run inference on the given input with a custom MAR file, run the following:
```
bash run.sh -n <CUSTOM_MODEL_NAME> -a <MAR_EXPORT_PATH> [OPTIONAL -d <INPUT_PATH>]
```
- n: Name of custom model
- d: Absolute path of input data folder (optional)
- a: Absolute path to the Model Store directory
Refer to the official [GPT-in-a-Box Documentation](https://opendocs.nutanix.com/gpt-in-a-box/vm/getting_started/) to deploy and validate the inference server on virtual machine
2 changes: 1 addition & 1 deletion llm/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def initialize(self, context):

def preprocess(self, data: str) -> torch.Tensor:
"""
This method tookenizes input text using the associated tokenizer.
This method tokenizes input text using the associated tokenizer.
Args:
text (str): The input text to be tokenized.
Returns:
Expand Down
4 changes: 2 additions & 2 deletions llm/utils/tsutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,8 @@ class with relevant information.
f'"minWorkers":{initial_workers or 1},'
f'"maxWorkers":{initial_workers or 1},'
f'"batchSize":{batch_size or 1},'
f'"maxBatchDelay":{max_batch_delay or 1},'
f'"responseTimeout":{response_timeout or 120}}}}}}}}}',
f'"maxBatchDelay":{max_batch_delay or 200},'
f'"responseTimeout":{response_timeout or 2000}}}}}}}}}',
]
with open(dst_config_path, "a", encoding="utf-8") as config_file:
config_file.writelines(config_info)
Expand Down

0 comments on commit 3b30121

Please sign in to comment.