Skip to content

Latest commit

 

History

History
24 lines (15 loc) · 1.62 KB

File metadata and controls

24 lines (15 loc) · 1.62 KB

Blog post: A Hands-On Walkthrough on Model Quantization

Table of contents

What?

Quantization is a technique used to reduce the computational and memory overhead of a machine learning model by reducing the precision of the numbers used to represent the model's parameters. Typically, models use 32-bit floating-point numbers, but quantization converts these to 8-bit integers (or 4-bit integers). This can significantly reduce the model size and increase the inference speed, especially on CPUs and other hardware with limited computational resources. While this can lead to a slight reduction in model accuracy, the trade-off is often worthwhile for faster and more efficient deployments.

This GitHub repo contains the notebook from "A Hands-On Walkthrough on Model Quantization" blog post. This notebook demonstrates the process of quantizing and saving a Transformer model to improve the inference speed on a CPU and reduce the model size.

Notebook

Description Link
A Hands-On Walkthrough on Model Quantization Open In Colab

License

See our LICENSE for more details.