Quantization

[!tip] Quantization methods are developed to lower the barrier in GPU resources in terms of serving LLM.

Prerequisites

Float32 Vs Float16

np.float32(2 ** 23)

[[#Reference]]

Now take a look at some of the commonly used quantization techniques.

AWQ

GPTQ

readme

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

PyTorch

This page serves as a PyTorch cheatsheet.

Speculative Decoding

Next Page

On this page

Float32 Vs Float16