Skip to content

Commit

Permalink
support zipformer2 offline triton recipe (#639)
Browse files Browse the repository at this point in the history
  • Loading branch information
yuekaizhang authored Aug 23, 2024
1 parent 1de064e commit 771c1ca
Show file tree
Hide file tree
Showing 19 changed files with 175 additions and 963 deletions.
45 changes: 14 additions & 31 deletions triton/Dockerfile/Dockerfile.server
Original file line number Diff line number Diff line change
@@ -1,41 +1,24 @@
FROM nvcr.io/nvidia/tritonserver:22.12-py3
FROM nvcr.io/nvidia/tritonserver:24.07-py3
# https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
# Please choose previous tritonserver:xx.xx if you encounter cuda driver mismatch issue

LABEL maintainer="NVIDIA"
LABEL repository="tritonserver"

RUN apt-get update && apt-get -y install \
python3-dev \
cmake \
libsndfile1
RUN pip3 install \
torch==1.13.1+cu117 \
torchaudio==0.13.1+cu117 \
--index-url https://download.pytorch.org/whl/cu117
RUN pip3 install \
kaldialign \
tensorboard \
sentencepiece \
lhotse \
kaldifeat
RUN pip3 install \
k2==1.24.4.dev20240223+cuda11.7.torch1.13.1 -f https://k2-fsa.github.io/k2/cuda.html
# Dependency for client
RUN pip3 install soundfile grpcio-tools tritonclient pyyaml
RUN apt-get update && apt-get install -y cmake
RUN python3 -m pip install k2==1.24.4.dev20240725+cuda12.4.torch2.4.0 -f https://k2-fsa.github.io/k2/cuda.html && \
python3 -m pip install -r https://raw.githubusercontent.com/k2-fsa/icefall/master/requirements.txt && \
pip install -U "huggingface_hub[cli]" lhotse colored onnx_graphsurgeon polygraphy
# https://github.com/k2-fsa/k2/blob/master/k2/python/k2/__init__.py#L13 delete the cuda version check
RUN sed -i '/if (/,/^ )/d' /usr/local/lib/python3.10/dist-packages/k2/__init__.py
WORKDIR /workspace

# #install k2 from source
# #"sed -i ..." line tries to turn off the cuda check
# RUN git clone https://github.com/k2-fsa/k2.git && \
# cd k2 && \
# sed -i 's/FATAL_ERROR/STATUS/g' cmake/torch.cmake && \
# sed -i 's/in running_cuda_version//g' get_version.py && \
# python3 setup.py install && \
# cd -
RUN git clone https://github.com/csukuangfj/kaldifeat && \
cd kaldifeat && \
sed -i 's/in running_cuda_version//g' get_version.py && \
python3 setup.py install && \
cd -

RUN git clone https://github.com/k2-fsa/icefall.git
ENV PYTHONPATH "${PYTHONPATH}:/workspace/icefall"
# https://github.com/k2-fsa/icefall/issues/674
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION "python"

COPY ./scripts scripts
COPY ./scripts scripts
34 changes: 11 additions & 23 deletions triton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,14 @@ Build the server docker image:
cd $SHERPA_SRC/triton
docker build . -f Dockerfile/Dockerfile.server -t sherpa_triton_server:latest --network host
```
Alternatively, you could directly pull the pre-built image based on tritonserver 22.12.
Alternatively, you could directly pull the pre-built image based on tritonserver image.
```
docker pull soar97/triton-k2:22.12.1
```

If you are planning to use TRT to accelerate the inference speed, you can use the following prebuit image:
```
docker pull wd929/sherpa_wend_23.04:v1.1
docker pull soar97/triton-k2:24.07
```

Start the docker container:
```bash
docker run --gpus all -v $SHERPA_SRC:/workspace/sherpa --name sherpa_server --net host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it soar97/triton-k2:22.12.1
docker run --gpus all -v $SHERPA_SRC:/workspace/sherpa --name sherpa_server --net host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it soar97/triton-k2:24.07
```
Now, you should enter into the container successfully.

Expand All @@ -69,8 +64,7 @@ apt-get install git-lfs
pip3 install -r ./requirements.txt
export CUDA_VISIBLE_DEVICES="your_gpu_id"

bash scripts/build_wenetspeech_pruned_transducer_stateless5_streaming.sh
bash scripts/build_librispeech_pruned_transducer_stateless3_streaming.sh
bash scripts/build_wenetspeech_zipformer_offline_trt.sh
```

## Using TensorRT acceleration
Expand All @@ -83,26 +77,20 @@ You can directly use the following script to export TRT engine and start Triton
bash scripts/build_librispeech_pruned_transducer_stateless3_offline_trt.sh
```

### Export to TensorRT Step by Step

If you want to build TensorRT for your own model, you can try the following steps:
### Export to TensorRT

#### Preparation for TRT

First of all, you have to install the TensorRT. Here we suggest you to use docker container to run TRT. Just run the following command:

```bash
docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/tensorrt:23.04-py3
```
You can also see [here](https://github.com/NVIDIA/TensorRT#build) to build TRT on your machine.
If you want to build TensorRT for your own service, you can try the following steps:

#### Model export

You have to prepare the ONNX model by referring [here](https://github.com/k2-fsa/sherpa/blob/master/triton/scripts/build_librispeech_pruned_transducer_stateless3_offline.sh#L41C1-L41C1) to export your models into ONNX format. Assume you have put your ONNX model in the `$model_dir` directory.
You have to prepare the ONNX model by referring [here](https://icefall.readthedocs.io/en/latest/model-export/export-onnx.html#export-the-model-to-onnx) to export your models into ONNX format. Assume you have put your ONNX model in the `$model_dir` directory.
Then, just run the command:

```bash
bash scripts/build_trt.sh 128 $model_dir/encoder.onnx model_repo_offline/encoder/1/encoder.trt
# First, use polygraphy to simplify the onnx model.
polygraphy surgeon sanitize $model_dir/encoder.onnx --fold-constant -o encoder.trt
# Using /usr/src/tensorrt/bin/trtexec tool in the tritonserver docker image.
bash scripts/build_trt.sh 16 $model_dir/encoder.onnx model_repo_offline/encoder/1/encoder.trt
```

The generated TRT model will be saved into `model_repo_offline/encoder/1/encoder.trt`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ input [
},
{
name: "wav_lens"
data_type: TYPE_INT64
data_type: TYPE_INT32
dims: [1]
}
]
Expand Down
16 changes: 16 additions & 0 deletions triton/model_repo_offline/scorer/config.pbtxt.template
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,22 @@ parameters [
{
key: "decoding_method",
value: { string_value: "greedy_search"}
},
{
key: "beam",
value: { string_value: "4"}
},
{
key: "max_contexts",
value: { string_value: "4"}
},
{
key: "max_states",
value: { string_value: "32"}
},
{
key: "temperature",
value: { string_value: "1.0"}
}
]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ stop_stage=2

# change to your own model directory
pretrained_model_dir=/mnt/samsung-t7/wend/github/icefall/egs/librispeech/ASR/pruned_transducer_stateless7/exp/
model_repo_path=./zipformer/model_repo_offline
model_repo_path=./model_repo_offline

# modify model specific parameters according to $pretrained_model_dir/exp/onnx_export.log
VOCAB_SIZE=500
Expand Down
2 changes: 1 addition & 1 deletion triton/scripts/build_trt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# paramters for TRT engines
MIN_BATCH=1
OPT_BATCH=32
OPT_BATCH=4
MAX_BATCH=$1
onnx_model=$2
trt_model=$3
Expand Down
131 changes: 131 additions & 0 deletions triton/scripts/build_wenetspeech_zipformer_offline_trt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/bin/bash
stage=-1
stop_stage=3

export CUDA_VISIBLE_DEVICES=1

pretrained_model_dir=/workspace/icefall-asr-zipformer-wenetspeech-20230615
model_repo_path=./model_repo_offline

# modify model specific parameters according to $pretrained_model_dir/exp/ log files
VOCAB_SIZE=5537

DECODER_CONTEXT_SIZE=2
DECODER_DIM=512
ENCODER_DIM=512 # max(_to_int_tuple(params.encoder_dim)


if [ -d "$pretrained_model_dir/data/lang_char" ]
then
echo "pretrained model using char"
TOKENIZER_FILE=$pretrained_model_dir/data/lang_char
else
echo "pretrained model using bpe"
TOKENIZER_FILE=$pretrained_model_dir/data/lang_bpe_500/bpe.model
fi

MAX_BATCH=16
# model instance num
FEATURE_EXTRACTOR_INSTANCE_NUM=2
ENCODER_INSTANCE_NUM=1
JOINER_INSTANCE_NUM=1
DECODER_INSTANCE_NUM=1
SCORER_INSTANCE_NUM=2


icefall_dir=/workspace/icefall
export PYTHONPATH=$PYTHONPATH:$icefall_dir
recipe_dir=$icefall_dir/egs/wenetspeech/ASR/zipformer

if [ ${stage} -le -2 ] && [ ${stop_stage} -ge -2 ]; then
if [ -d "$pretrained_model_dir" ]
then
echo "skip download pretrained model"
else
echo "downloading pretrained model"
cd /workspace
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615
pushd icefall-asr-zipformer-wenetspeech-20230615
git lfs pull --include "exp/pretrained.pt"
ln -s ./exp/pretrained.pt ./exp/epoch-9999.pt
popd
cd -
fi
fi

if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
echo "export onnx"
cd ${recipe_dir}
# WAR: please comment https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/zipformer/zipformer.py#L1422-L1427
# if you would like to use the exported onnx to build trt engine later.
python3 ./export-onnx.py \
--tokens $TOKENIZER_FILE/tokens.txt \
--use-averaged-model 0 \
--epoch 9999 \
--avg 1 \
--exp-dir $pretrained_model_dir/exp/ \
--num-encoder-layers "2,2,3,4,3,2" \
--downsampling-factor "1,2,4,8,4,2" \
--feedforward-dim "512,768,1024,1536,1024,768" \
--num-heads "4,4,4,8,4,4" \
--encoder-dim "192,256,384,512,384,256" \
--query-head-dim 32 \
--value-head-dim 12 \
--causal False || exit 1

cd -
fi

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "auto gen config.pbtxt"
dirs="encoder decoder feature_extractor joiner scorer transducer"

if [ ! -d $model_repo_path ]; then
echo "Please cd to $model_repo_path"
exit 1
fi

cp -r $TOKENIZER_FILE $model_repo_path/scorer/
TOKENIZER_FILE=$model_repo_path/scorer/$(basename $TOKENIZER_FILE)
for dir in $dirs
do
cp $model_repo_path/$dir/config.pbtxt.template $model_repo_path/$dir/config.pbtxt

sed -i "s|VOCAB_SIZE|${VOCAB_SIZE}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_CONTEXT_SIZE|${DECODER_CONTEXT_SIZE}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_DIM|${DECODER_DIM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_LAYERS|${ENCODER_LAYERS}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_DIM|${ENCODER_DIM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_LEFT_CONTEXT|${ENCODER_LEFT_CONTEXT}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_RIGHT_CONTEXT|${ENCODER_RIGHT_CONTEXT}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|TOKENIZER_FILE|${TOKENIZER_FILE}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|MAX_BATCH|${MAX_BATCH}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|FEATURE_EXTRACTOR_INSTANCE_NUM|${FEATURE_EXTRACTOR_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_INSTANCE_NUM|${ENCODER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|JOINER_INSTANCE_NUM|${JOINER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_INSTANCE_NUM|${DECODER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|SCORER_INSTANCE_NUM|${SCORER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
done
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
cp $pretrained_model_dir/exp/encoder-epoch-9999-avg-1.onnx $model_repo_path/encoder/1/encoder.onnx
cp $pretrained_model_dir/exp/decoder-epoch-9999-avg-1.onnx $model_repo_path/decoder/1/decoder.onnx
cp $pretrained_model_dir/exp/joiner-epoch-9999-avg-1.onnx $model_repo_path/joiner/1/joiner.onnx
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "Buiding TRT engine..., skip the stage if you would like to use onnxruntime"
polygraphy surgeon sanitize $pretrained_model_dir/exp/encoder-epoch-9999-avg-1.onnx --fold-constant -o $pretrained_model_dir/exp/encoder.onnx
bash scripts/build_trt.sh $MAX_BATCH $pretrained_model_dir/exp/encoder.onnx $model_repo_path/encoder/1/encoder.trt || exit 1

sed -i "s|onnxruntime|tensorrt|g" $model_repo_path/encoder/config.pbtxt
sed -i "s|encoder.onnx|encoder.trt|g" $model_repo_path/encoder/config.pbtxt
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
tritonserver --model-repository=$model_repo_path --pinned-memory-pool-byte-size=512000000 --cuda-memory-pool-byte-size=0:1024000000 --http-port 10086
fi
Empty file.
44 changes: 0 additions & 44 deletions triton/zipformer/model_repo_offline/decoder/config.pbtxt.template

This file was deleted.

Empty file.
Loading

0 comments on commit 771c1ca

Please sign in to comment.