Skip to content

An Easy-to-Use Toolkit for LLM Quantization on can be executed on Macbook

Notifications You must be signed in to change notification settings

NoakLiu/LLMEasyQuant

Repository files navigation

LLMEasyQuant

LLMEasyQuant is a package developed for Easy Quantization Deployment for LLM applications. Nowadays, packages like TensorRT and Quanto have many underlying structures and self-invoking internal functions, which are not conducive to developers' personalized development and learning for deployment. LLMEasyQuant is developed to tackle this problem.

Author: Dong Liu, Meng Jiang, Kaiser Pister

Deployment Methods:

Define the model

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig


# Set device to CPU for now
device = 'cpu'
# device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load model and tokenizer
model_id = 'gpt2'  # 137m F32 params
# model_id = 'facebook/opt-1.3b' # 1.3b f16 params
# model_id = 'mistralai/Mistral-7B-v0.1'  # 7.24b bf16 params, auth required
# model_id = 'meta-llama/Llama-2-7b-hf' # auth required

model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model_int8 = AutoModelForCausalLM.from_pretrained(model_id,
                                                  device_map='auto',
                                                  quantization_config=BitsAndBytesConfig(
                                                      load_in_8bit=True)
                                                  )
model_int8.name_or_path += "_int8"

mode deployment

absmax

absq = Quantizer(model, tokenizer, absmax_quantize)
quantizers.append(absq)

zeropoint

zpq = Quantizer(model, tokenizer, zeropoint_quantize)
quantizers.append(zpq)

smoothquant

smooth_quant = SmoothQuantMatrix(alpha=0.5)
smoothq = Quantizer (model, tokenizer, smooth_quant.smooth_quant_apply)
quantizers.append(smoothq)

simquant

simq = Quantizer(model, tokenizer, sim_quantize )
quantizers.append(simq)

simquant, zeroquant and knowledge distllation of both each

symq = Quantizer(model, tokenizer, sym_quantize_8bit)
zeroq = Quantizer(model, tokenizer, sym_quantize_8bit, zeroquant_func)
quantizers.extend([symq, zeroq])

AWQ

awq = Quantizer(model, tokenizer, awq_quantize )
quantizers.append(simq)

BiLLM

billmq = Quantizer(model, tokenizer, billm_quantize )
quantizers.append(simq)

QLora

qloraq = Quantizer(model, tokenizer, qlora_quantize )
quantizers.append(simq)

model computation

[q.quantize() for q in quantizers]

visualization

dist_plot([model, model_int8] + [q.quant for q in quantizers])

model comparision

generated = compare_generation([model, model_int8] + [q.quant for q in quantizers], tokenizer, max_length=200, temperature=0.8)

perplexity analysis

ppls = compare_ppl([model, model_int8] + [q.quant for q in quantizers], tokenizer, list(generated.values()))

Results:

Quant Weights Distribution

Performance Comparison

PPL Analysis

Conclusion:

In the research, we develop LLMEasyQuant, it is a package aiming to for easy quantization deployment which is user-friendly and easy to be deployed when computational resouces is limited.

Deployment Simplicity Comparison Table

Feature/Package AWQ BiLLM QLora TensorRT Quanto LLMEasyQuant
Hardware Requirements GPU required GPU required GPU required GPU required GPU required Supports CPU and GPU
Deployment Steps Multiple complex steps Detailed setup and tuning required Intricate steps and parameter adjustments Complex setup with CUDA dependencies Complex setup with multiple dependencies Streamlined, minimal setup, includes AWQ, BiLLM, QLora
Quantization Methods Manual adjustments and configurations Detailed configurations needed Specific configurations for each method Limited to specific optimizations Limited to specific optimizations Variety of methods with simple interface, includes AWQ, BiLLM, QLora
Supported Methods AWQ BiLLM QLora TensorRT-specific methods Quanto-specific methods Absmax, Zeropoint, SmoothQuant, SimQuant, SymQuant, ZeroQuant, AWQ, BiLLM, QLora
Integration Process Complex library installation and setup Extensive documentation and dependencies Intricate library setup Requires integration with NVIDIA stack Requires integration with specific frameworks Simple integration with transformers
Visualization Tools Additional setup required Additional setup required Additional setup required External tools needed External tools needed Built-in visualization functions
Performance Analysis External tools needed External tools needed External tools needed External tools needed External tools needed Built-in performance analysis functions

Summary of LLMEasyQuant Advantages

  1. Hardware Flexibility: Supports both CPU and GPU, providing flexibility for developers with different hardware resources.
  2. Simplified Deployment: Requires minimal setup steps, making it user-friendly and accessible.
  3. Comprehensive Quantization Methods: Offers a wide range of quantization methods, including AWQ, BiLLM, and QLora, with easy-to-use interfaces.
  4. Built-in Visualization and Analysis: Includes tools for visualizing and comparing model performance, simplifying the evaluation process.

Citation

If you find LLMEasyQuant useful or relevant to your project and research, please kindly cite our paper:

@misc{liu2024llmeasyquanteasyuse,
      title={LLMEasyQuant -- An Easy to Use Toolkit for LLM Quantization}, 
      author={Dong Liu and Meng Jiang and Kaiser Pister},
      year={2024},
      eprint={2406.19657},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.19657}, 
}

About

An Easy-to-Use Toolkit for LLM Quantization on can be executed on Macbook

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages