Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Overview

Transformer Quantization

This repository contains the implementation and experiments for the paper presented in

Yelysei Bondarenko1, Markus Nagel1, Tijmen Blankevoort1, "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP 2021. [ACL Anthology] [ArXiv]

1 Qualcomm AI Research (Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.)

Reference

If you find our work useful, please cite

@inproceedings{bondarenko-etal-2021-understanding,
    title = "Understanding and Overcoming the Challenges of Efficient Transformer Quantization",
    author = "Bondarenko, Yelysei  and
      Nagel, Markus  and
      Blankevoort, Tijmen",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.627",
    pages = "7947--7969",
    abstract = "Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges {--} namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme {--} per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at \url{https://github.com/qualcomm-ai-research/transformer-quantization}.",
}

How to install

First, ensure locale variables are set as follows:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

Second, make sure to have Python ≥3.6 (tested with Python 3.6.8) and ensure the latest version of pip (tested with 21.2.4):

pip install --upgrade --no-deps pip

Next, install PyTorch 1.4.0 with the appropriate CUDA version (tested with CUDA 10.0, CuDNN 7.6.3):

pip install torch==1.4.0 torchvision==0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining dependencies using pip:

pip install -r requirements.txt

To run the code, the project root directory needs to be added to your pythonpath:

export PYTHONPATH="${PYTHONPATH}:/path/to/this/dir"

Running experiments

The main run file to reproduce all experiments is main.py. It contains 4 commands to train and validate FP32 and quantized model:

Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  train-baseline
  train-quantized
  validate-baseline
  validate-quantized

You can see the full list of options for each command using python main.py [COMMAND] --help.

A. FP32 fine-tuning

To start with, you need to get the fune-tuned model(s) for the GLUE task of interest. Example run command for fine-tuning:

python main.py train-baseline --cuda --save-model --model-name bert_base_uncased --task rte \
    --learning-rate 3e-05 --batch-size 8 --eval-batch-size 8 --num-epochs 3 --max-seq-length 128 \
    --seed 1000 --output-dir /path/to/output/dir/

You can also do it directly using HuggingFace library [examples]. In all experiments we used seeds 1000 - 1004 and reported the median score. The sample output directory looks as follows:

/path/to/output/dir
├── config.out
├── eval_results_rte.txt
├── final_score.txt
├── out
│   ├── config.json  # Huggingface model config
│   ├── pytorch_model.bin  # PyTorch model checkpoint
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json  # Huggingface tokenizer config
│   ├── training_args.bin
│   └── vocab.txt  # Vocabulary
└── tb_logs  # TensorBoard logs
    ├── 1632747625.1250594
    │   └── events.out.tfevents.*
    └── events.out.tfevents.*

For validation (both full-precision and quantized), it is assumed that these output directories with the fine-tuned checkpoints are aranged as follows (you can also use a subset of GLUE tasks):

/path/to/saved_models/
├── rte/rte_model_dir
│   ├── out
│   │   ├── config.json  # Huggingface model config
│   │   ├── pytorch_model.bin  # PyTorch model checkpoint
│   │   ├── tokenizer_config.json  # Huggingface tokenizer config
│   │   ├── vocab.txt  # Vocabulary
│   │   ├── (...)
├── cola/cola_model_dir
│   ├── out
│   │   ├── (...)
├── mnli/mnli_model_dir
│   ├── out
│   │   ├── (...)
├── mrpc/mrpc_model_dir
│   ├── out
│   │   ├── (...)
├── qnli/qnli_model_dir
│   ├── out
│   │   ├── (...)
├── qqp/qqp_model_dir
│   ├── out
│   │   ├── (...)
├── sst2/sst2_model_dir
│   ├── out
│   │   ├── (...)
└── stsb/stsb_model_dir
    ├── out
    │   ├── (...)

Note, that you have to create this file structure manually.

The model can then be validated as follows:

python main.py validate-baseline --eval-batch-size 32 --seed 1000 --model-name bert_base_uncased \
    --model-path /path/to/saved_models/ --task rte

You can also validate multiple or all checkpoints by specifying --task --task [...] or --task all, respectively.

B. Post-training quantization (PTQ)

1) Standard (naïve) W8A8 per-tensor PTQ / base run command for all PTQ experiments

python main.py validate-quantized --act-quant --weight-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --n-bits 8 --n-bits-act 8 --qmethod symmetric_uniform \
	--qmethod-act asymmetric_uniform --weight-quant-method MSE --weight-opt-method golden_section \
	--act-quant-method current_minmax --est-ranges-batch-size 1 --num-est-batches 1 \
	--quant-setup all

Note that the range estimation settings are slightly different for each task.

2) Mixed precision W8A{8,16} PTQ

Specify --quant-dict "{'y': 16, 'h': 16, 'x': 16}":

  • 'x': 16 will set FFN's input to 16-bit
  • 'h': 16 will set FFN's output to 16-bit
  • 'y': 16 will set FFN's residual sum to 16-bit

For STS-B regression task, you will need to specify --quant-dict "{'y': 16, 'h': 16, 'x': 16, 'P': 16, 'C': 16}" and --quant-setup MSE_logits, which will also quantize pooler and the final classifier to 16-bit and use MSE estimator for the output.

3) Per-embedding and per-embedding-group (PEG) activation quantization

  • --per-embd -- Per-embedding quantization for all activations
  • --per-groups [N_GROUPS] -- PEG quantization for all activations, no permutation
  • --per-groups [N_GROUPS] --per-groups-permute -- PEG quantization for all activations, apply range-based permutation (separate for each quantizer)
  • --quant-dict "{'y': 'ng6', 'h': 'ng6', 'x': 'ng6'}" -- PEG quantization using 6 groups for FFN's input, output and residual sum, no permutation
  • --quant-dict "{'y': 'ngp6', 'h': 'ngp6', 'x': 'ngp6'}" --per-groups-permute-shared-h -- PEG quantization using 6 groups for FFN's input, output and residual sum, apply range-based permutation (shared between tensors in the same layer)

4) W4A32 PTQ with AdaRound

python main.py validate-quantized --weight-quant --no-act-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --qmethod symmetric_uniform --qmethod-act asymmetric_uniform --n-bits 4 \
	--weight-quant-method MSE --weight-opt-method grid --num-candidates 100 --quant-setup all \
	--adaround all --adaround-num-samples 1024 --adaround-init range_estimator \
	--adaround-mode learned_hard_sigmoid --adaround-asym --adaround-iters 10000 \
	--adaround-act-quant no_act_quant

C. Quantization-aware training (QAT)

Base run command for QAT experiments (using W4A8 for example):

python main.py train-quantized --cuda --do-eval --logging-first-step --weight-quant --act-quant \
	--pad-to-max-length --learn-ranges --tqdm --batch-size 8 --seed 1000 \
	--model-name bert_base_uncased --learning-rate 5e-05 --num-epochs 6 --warmup-steps 186 \
	--weight-decay 0.0 --attn-dropout 0.0 --hidden-dropout 0.0 --max-seq-length 128 --n-bits 4 \
	--n-bits-act 8 --qmethod symmetric_uniform --qmethod-act asymmetric_uniform \
	--weight-quant-method MSE --weight-opt-method golden_section --act-quant-method current_minmax \
	--est-ranges-batch-size 16 --num-est-batches 1 --quant-setup all \
	--model-path /path/to/saved_models/rte/out --task rte --output-dir /path/to/qat_output/dir

Note that the settings are slightly different for each task (see Appendix).

To run mixed-precision QAT with 2-bit embeddings and 4-bit weights, add --quant-dict "{'Et': 2}".

Owner
An initiative of Qualcomm Technologies, Inc.
deep learning for image processing including classification and object-detection etc.

深度学习在图像处理中的应用教程 前言 本教程是对本人研究生期间的研究内容进行整理总结,总结的同时也希望能够帮助更多的小伙伴。后期如果有学习到新的知识也会与大家一起分享。 本教程会以视频的方式进行分享,教学流程如下: 1)介绍网络的结构与创新点 2)使用Pytorch进行网络的搭建与训练 3)使用Te

WuZhe 13.6k Jan 04, 2023
Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

Yunyao 35 Oct 16, 2022
Database Reasoning Over Text project for ACL paper

Database Reasoning over Text This repository contains the code for the Database Reasoning Over Text paper, to appear at ACL2021. Work is performed in

Facebook Research 320 Dec 12, 2022
Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

50 Dec 11, 2022
Official implementation of Densely connected normalizing flows

Densely connected normalizing flows This repository is the official implementation of NeurIPS 2021 paper Densely connected normalizing flows. Poster a

Matej Grcić 31 Dec 12, 2022
Facial recognition project

Facial recognition project documentation Project introduction This project is developed by linuxu. It is a face model recognition project developed ba

Jefferson 2 Dec 04, 2022
This program will stylize your photos with fast neural style transfer.

Neural Style Transfer (NST) Using TensorFlow Demo TensorFlow TensorFlow is an end-to-end open source platform for machine learning. It has a comprehen

Ismail Boularbah 1 Aug 08, 2022
State of the Art Neural Networks for Generative Deep Learning

pyradox-generative State of the Art Neural Networks for Generative Deep Learning Table of Contents pyradox-generative Table of Contents Installation U

Ritvik Rastogi 8 Sep 29, 2022
Estimating Example Difficulty using Variance of Gradients

Estimating Example Difficulty using Variance of Gradients This repository contains source code necessary to reproduce some of the main results in the

Chirag Agarwal 48 Dec 26, 2022
The pytorch implementation of DG-Font: Deformable Generative Networks for Unsupervised Font Generation

DG-Font: Deformable Generative Networks for Unsupervised Font Generation The source code for 'DG-Font: Deformable Generative Networks for Unsupervised

130 Dec 05, 2022
Builds a LoRa radio frequency fingerprint identification (RFFI) system based on deep learning techiniques

This project builds a LoRa radio frequency fingerprint identification (RFFI) system based on deep learning techiniques.

20 Dec 30, 2022
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

DV Lab 21 Nov 28, 2022
Learning to Identify Top Elo Ratings with A Dueling Bandits Approach

Learning to Identify Top Elo Ratings We propose two algorithms MaxIn-Elo and MaxIn-mElo to solve the top players identification on the transitive and

2 Jan 14, 2022
Official implementation of paper Gradient Matching for Domain Generalization

Gradient Matching for Domain Generalisation This is the official PyTorch implementation of Gradient Matching for Domain Generalisation. In our paper,

94 Dec 23, 2022
An elaborate and exhaustive paper list for Named Entity Recognition (NER)

Named-Entity-Recognition-NER-Papers by Pengfei Liu, Jinlan Fu and other contributors. An elaborate and exhaustive paper list for Named Entity Recognit

Pengfei Liu 388 Dec 18, 2022
GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training Original implementation for paper GCC: Graph Contrastive Coding for Graph Neural N

THUDM 274 Dec 27, 2022
A basic duplicate image detection service using perceptual image hash functions and nearest neighbor search, implemented using faiss, fastapi, and imagehash

Duplicate Image Detection Getting Started Install dependencies pip install -r requirements.txt Run service python main.py Testing Test with pytest How

Matthew Podolak 21 Nov 11, 2022
Public repository created to store my custom-made tools for Just Dance (UbiArt Engine)

Woody's Just Dance Tools Public repository created to store my custom-made tools for Just Dance (UbiArt Engine) Development and updates Almost all of

Wodson de Andrade 8 Dec 24, 2022
A library for hidden semi-Markov models with explicit durations

hsmmlearn hsmmlearn is a library for unsupervised learning of hidden semi-Markov models with explicit durations. It is a port of the hsmm package for

Joris Vankerschaver 69 Dec 20, 2022
Official code release for: EditGAN: High-Precision Semantic Image Editing

Official code release for: EditGAN: High-Precision Semantic Image Editing

565 Jan 05, 2023