Merge pull request nutanix#24 from saileshd1402/clean_readme

Clean README - Redirect to Opendocs
saileshd1402 · Nov 16, 2023 · 3b30121 · 3b30121
2 parents 5f0389a + fc0f0d8
commit 3b30121
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 212 deletions.
diff --git a/README.md b/README.md
@@ -1,214 +1,15 @@
 # NAI-LLM
+## Nutanix GPT-in-a-Box: Virtual Machine
 
-## Setup
+This is the official repository for the virtual machine version of Nutanix GPT-in-a-Box.
 
-Install openjdk, pip3:
-```
-sudo apt-get install openjdk-17-jdk python3-pip
-```
+Nutanix GPT-in-a-Box is a new turnkey solution that includes everything needed to build AI-ready infrastructure for organizations wanting to implement GPT capabilities while maintaining control of their data and applications.
 
-Install required packages:
+This new solution includes:
+- Software-defined Nutanix Cloud Platform™ infrastructure supporting GPU-enabled server nodes for seamless scaling of virtualized compute, storage, and networking supporting both traditional virtual machines and Kubernetes-orchestrated containers
+- Files and Objects storage; to fine-tune and run a choice of GPT models
+- Open source software to deploy and run AI workloads including PyTorch framework & KubeFlow MLOps platform
+- The management interface for enhanced terminal UI or standard CLI
+- Support for a curated set of LLMs including Llama2, Falcon and MPT
 
-```
-pip install -r requirements.txt
-```
-
-Install NVIDIA Drivers:
-
-[Installation Reference](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#runfile) <br />
-
-Download the latest [Datacenter Nvidia drivers](https://www.nvidia.com/download/index.aspx) for the GPU type
-
-For Nvidia A100, Select A100 in Datacenter Tesla for Linux 64 bit with cuda toolkit 11.7, latest driver is 515.105.01
-
-```
-curl -fSsl -O https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
-sudo sh NVIDIA-Linux-x86_64-515.105.01.run -s
-```
-
-Note: We don’t need to install CUDA toolkit separately as it is bundled with PyTorch installation. Just Nvidia driver installation is enough. 
-
-
-## Scripts
-
-### Download model files and Generate MAR file
-Run the following command for downloading model files and/or generating MAR file: 
-```
-python3 llm/download.py [--no_download --repo_version <REPO_VERSION>] --model_name <MODEL_NAME> --model_path <MODEL_PATH> --mar_output <MAR_EXPORT_PATH> --hf_token <Your_HuggingFace_Hub_Token>
-```
-- no_download:      Set flag to skip downloading the model files
-- model_name:       Name of model
-- repo_version:     Commit ID of model's repo from HuggingFace repository (optional, if not provided default set in model_config will be used)
-- model_path:       Absolute path of model files (should be empty if downloading)
-- mar_output:       Absolute path of export of MAR file (.mar)
-- hf_token:         Your HuggingFace token. Needed to download and verify LLAMA(2) models.
-
-The available LLMs are mpt_7b, falcon_7b, llama2_7b
-
-#### Examples
-Download MPT-7B model files and generate model archive for it:
-```
-python3 llm/download.py --model_name mpt_7b --model_path /home/ubuntu/models/mpt_7b/model_files --mar_output /home/ubuntu/models/model_store
-```
-Download Falcon-7B model files and generate model archive for it:
-```
-python3 llm/download.py --model_name falcon_7b --model_path /home/ubuntu/models/falcon_7b/model_files --mar_output /home/ubuntu/models/model_store
-```
-Download Llama2-7B model files and generate model archive for it:
-```
-python3 llm/download.py --model_name llama2_7b --model_path /home/ubuntu/models/llama2_7b/model_files --mar_output /home/ubuntu/models/model_store --hf_token <Your_HuggingFace_Hub_Token>
-```
-
-### Start Torchserve and run inference
-Run the following command for starting Torchserve and running inference on the given input:
-```
-bash llm/run.sh -n <MODEL_NAME> -a <MAR_EXPORT_PATH> [OPTIONAL -d <INPUT_PATH> -v <REPO_VERSION>]
-```
-- n:    Name of model
-- v:    Commit ID of model's repo from HuggingFace repository (optional, if not provided default set in model_config will be used)
-- d:    Absolute path of input data folder (optional)
-- a:    Absolute path to the Model Store directory
-
-For model names, we support MPT-7B, Falcon-7b and Llama2-7B.
-Should print "Ready For Inferencing" as a message at the end
-
-#### Examples
-For Inference with official MPT-7B model:
-```
-bash llm/run.sh -n mpt_7b -d data/translate -a /home/ubuntu/models/model_store
-```
-For Inference with official Falcon-7B model:
-```
-bash llm/run.sh -n falcon_7b -d data/qa -a /home/ubuntu/models/model_store
-```
-For Inference with official Llama2-7B model:
-```
-bash llm/run.sh -n llama2_7b -d data/summarize -a /home/ubuntu/models/model_store
-```
-
-### Describe registered model
-curl http://{inference_server_endpoint}:{management_port}/models/{model_name} <br />
-
-For MPT-7B model
-```
-curl http://localhost:8081/models/mpt_7b
-```
-For Falcon-7B model
-```
-curl http://localhost:8081/models/falcon_7b
-```
-For Llama2-7B model
-```
-curl http://localhost:8081/models/llama2_7b
-```
-
-### Inference Check
-For inferencing with a text file:<br />
-curl -v -H "Content-Type: application/text" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.txt
-
-For inferencing with a json file:<br />
-curl -v -H "Content-Type: application/json" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.json
-
-Test input files can be found in the data folder. <br />
-
-For MPT-7B model
-```
-curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/mpt_7b -d @data/qa/sample_text1.txt
-```
-```
-curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/mpt_7b -d @data/qa/sample_text4.json
-```
-
-For Falcon-7B model
-```
-curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/falcon_7b -d @data/summarize/sample_text1.txt
-```
-```
-curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/falcon_7b -d @data/summarize/sample_text3.json
-```
-
-For Llama2-7B model
-```
-curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/llama2_7b -d @data/translate/sample_text1.txt
-```
-```
-curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @data/translate/sample_text3.json
-```
-
-### Register additional models
-For loading multiple unique models, make sure that the MAR files (.mar) for the concerned models are stored in the same directory <br />
-
-curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={mar_name}.mar&initial_workers=1&synchronous=true"
-
-For MPT-7B model
-```
-curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
-```
-For Falcon-7B model
-```
-curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
-```
-For Llama2-7B model
-```
-curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
-```
-
-### Edit registered model configuration
-curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"
-
-For MPT-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
-```
-For Falcon-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
-```
-For Llama2-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
-```
-
-### Unregister a model
-curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"
-
-### Stop Torchserve and Cleanup
-If keep alive flag was set in the bash script, then you can run the following command to stop the server and clean up temporary files
-```
-python3 llm/cleanup.py
-```
-
-## Model Version Support
-
-We provide the capability to download and register various commits of the single model from HuggingFace. By specifying the commit ID as "repo_version", you can produce MAR files for multiple iterations of the same model and register them simultaneously. To transition between these versions, you can set a default version within Torchserve while it is running.
-
-### Set Default Model Version
-If multiple versions of the same model are registered, we can set a particular version as the default for inferencing<br />
-
-curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/{model_name}/{repo_version}/set-default"
-
-## Custom Model Support
-
-We provide the capability to generate a MAR file with custom models and start an inference server using it with Torchserve.
-A model is recognised as a custom model if it's model name is not present in the model_config file.<br />
-
-### Generate MAR file for custom model
-To generate the MAR file, run the following:
-```
-python3 download.py --no_download [--repo_version <REPO_VERSION> --handler <CUSTOM_HANDLER_PATH>] --model_name <CUSTOM_MODEL_NAME> --model_path <MODEL_PATH> --mar_output <MAR_EXPORT_PATH>
-```
-- no_download:      Set flag to skip downloading the model files, must be set for custom models
-- model_name:       Name of custom model
-- repo_version:     Any model version, defaults to "1.0" (optional)
-- model_path:       Absolute path of custom model files (should be empty non empty)
-- mar_output:       Absolute path of export of MAR file (.mar) 
-- handler:          Path to custom handler, defaults to llm/handler.py (optional)<br />
-
-### Start Torchserve and run inference for custom model
-To start Torchserve and run inference on the given input with a custom MAR file, run the following:
-```
-bash run.sh -n <CUSTOM_MODEL_NAME> -a <MAR_EXPORT_PATH> [OPTIONAL -d <INPUT_PATH>]
-```
-- n:    Name of custom model 
-- d:    Absolute path of input data folder (optional)
-- a:    Absolute path to the Model Store directory
+Refer to the official [GPT-in-a-Box Documentation](https://opendocs.nutanix.com/gpt-in-a-box/vm/getting_started/) to deploy and validate the inference server on virtual machine
diff --git a/llm/handler.py b/llm/handler.py
@@ -118,7 +118,7 @@ def initialize(self, context):
 
     def preprocess(self, data: str) -> torch.Tensor:
         """
-        This method tookenizes input text using the associated tokenizer.
+        This method tokenizes input text using the associated tokenizer.
         Args:
             text (str): The input text to be tokenized.
         Returns:

diff --git a/llm/utils/tsutils.py b/llm/utils/tsutils.py
@@ -154,8 +154,8 @@ class with relevant information.
         f'"minWorkers":{initial_workers or 1},'
         f'"maxWorkers":{initial_workers or 1},'
         f'"batchSize":{batch_size or 1},'
-        f'"maxBatchDelay":{max_batch_delay or 1},'
-        f'"responseTimeout":{response_timeout or 120}}}}}}}}}',
+        f'"maxBatchDelay":{max_batch_delay or 200},'
+        f'"responseTimeout":{response_timeout or 2000}}}}}}}}}',
     ]
     with open(dst_config_path, "a", encoding="utf-8") as config_file:
         config_file.writelines(config_info)