Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add refact llama.cpp tutorial #122

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Refact Inference

## llama.cpp (ggml)
We have integrated Refact into llama.cpp for efficient inference which can support Intel, Apple Chip and Nvidia hardwares. Please read through [llama.cpp](https://github.com/ggerganov/llama.cpp) to understand the design firstly.

### Setup
Change the repo to `https://github.com/ggerganov/llama.cpp` after [refact PR](https://github.com/ggerganov/llama.cpp/pull/3329) is officially merged. Please play with this forked one firstly on efficient inference.

```shell
git clone https://github.com/ds5t5/llama.cpp.git
cd llama.cpp
git checkout -b add.refact origin/add.refact
```

### Download the huggingface Refact model
Run the below script or manually download the model and tokenizer to the local path.
```shell
pip3 install transformers torch accelerate
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "smallcloudai/Refact-1_6B-fim"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, low_cpu_mem_usage=True)

model.save_pretrained("./Refact-1_6B-fim")
tokenizer.save_pretrained("./Refact-1_6B-fim")
```

### Convert the model to gguf
Please use python3.8+ environment.
```shell
pip3 install transformers torch sentencepiece
cd gguf-py && pip install -e . && cd ..
# use 0 at the end for fp32, 1 for fp16
python3 convert-refact-hf-to-gguf.py ./Refact-1_6B-fim 1
```

### Run the process
Find more advanced features in llama.cpp for inference parameters like quantization and sampling.

```shell
./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python" --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0
```


### Known Issues
- special tokens like `<fim_middle>` won't work as expected to be tokenized as one id in llama.cpp main binary examples. The community is adding a [fix](https://github.com/ggerganov/llama.cpp/issues/2820) to support special tokens.