Nov 28, 2022
Speaker · 1 follower
Speaker · 0 followers
Speaker · 5 followers
Large language models have been widely adopted but require significant GPU memory for inference and finetuning. We develop methods for Int8 matrix multiplication for transformer multi-layer perceptron (MLP) and attention projection layers, which cut the required memory for inference by half while retaining full precision performance. With our method, a 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation – no post-quantization training is required. The key challenge, which we empirically show for the first time, is that existing quantization methods perform poorly at scale due to emergent outlier feature dimensions. We find that standard quantization techniques for matrix multiplication fail beyond 1.3B parameters. To overcome this barrier, we develop vector-wise quantization, which keeps separate normalization constants for each inner product in the matrix multiplication. Additionally, we identify layer and input invariant feature dimensions in the hidden states, which heavily influence attention and disrupt quantization methods starting at 13B parameters. To scale to 13B, we develop a new mixed-precision matrix decomposition scheme, which allows scaling without performance degradation to at least 13B parameters. This result makes large transformers more accessible, for example, by enabling inference with GPT-J and T5-11B on a single free cloud GPU, GPT-NeoX-20B on a single gaming-grade GPU, and OPT-30B on a single data-center-grade GPU. We open source our software.Large language models have been widely adopted but require significant GPU memory for inference and finetuning. We develop methods for Int8 matrix multiplication for transformer multi-layer perceptron (MLP) and attention projection layers, which cut the required memory for inference by half while retaining full precision performance. With our method, a 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation – no post-quantization training is re…
Account · 958 followers
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Yuval Tassa, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%