Divergent

Quantization

[!tip] Quantization methods are developed to lower the barrier in GPU resources in terms of serving LLM.

Prerequisites

Float32 Vs Float16

np.float32(2 ** 23)

[[#Reference]]

Now take a look at some of the commonly used quantization techniques.

AWQ

GPTQ

readme

On this page