ACL'22: Structured Pruning Learns Compact and Accurate Models

Overview

CoFiPruning: Structured Pruning Learns Compact and Accurate Models

This repository contains the code and pruned models for our ACL'22 paper Structured Pruning Learns Compact and Accurate Models.

**************************** Updates ****************************

  • 05/09/2022: We release the pruned model checkpoints on RTE, MRPC and CoLA!
  • 04/01/2022: We released our paper along with pruned model checkpoints on SQuAD, SST-2, QNLI and MNLI. Check it out!

Quick Links

Overview

We propose CoFiPruning, a task-specific, structured pruning approach (Coarse and Fine-grained Pruning) and show that structured pruning can achieve highly compact subnetworks and obtain large speedups and competitive accuracy as distillation approaches, while requiring much less computation. Our key insight is to jointly prune coarse-grained units (e.g., self-attention or feed-forward layers) and fine-grained units (e.g., heads, hidden dimensions) simultaneously. Different from existing works, our approach controls the pruning decision of every single parameter by multiple masks of different granularity. This is the key to large compression, as it allows the greatest flexibility of pruned structures and eases the optimization compared to only pruning small units. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization.

Main Results

We show the main results of CoFiPruning along with results of popular pruning and distillation methods including Block Pruning, DynaBERT, DistilBERT and TinyBERT. Please see more detailed results in our paper.

Model List

Our released models are listed as following. You can download these models with the following links. We use a batch size of 128 and V100 32GB GPUs for speedup evaluation. We show F1 score for SQuAD and accuracy score for GLUE datasets. s60 denotes that the sparsity of the model is roughly 60%.

model name task sparsity speedup score
princeton-nlp/CoFi-MNLI-s60 MNLI 60.2% 2.1 × 85.3
princeton-nlp/CoFi-MNLI-s95 MNLI 94.3% 12.1 × 80.6
princeton-nlp/CoFi-QNLI-s60 QNLI 60.3% 2.1 × 91.8
princeton-nlp/CoFi-QNLI-s95 QNLI 94.5% 12.1 × 86.1
princeton-nlp/CoFi-SST2-s60 SST-2 60.1% 2.1 × 93.0
princeton-nlp/CoFi-SST2-s95 SST-2 94.5% 12.2 × 90.4
princeton-nlp/CoFi-SQuAD-s60 SQuAD 59.8% 2.0 × 89.1
princeton-nlp/CoFi-SQuAD-s93 SQuAD 92.4% 8.7 × 82.6
princeton-nlp/CoFi-RTE-s60 RTE 60.2% 2.0 x 72.6
princeton-nlp/CoFi-RTE-s96 RTE 96.2% 12.8 x 66.1
princeton-nlp/CoFi-CoLA-s60 CoLA 60.4% 2.0 x 60.4
princeton-nlp/CoFi-CoLA-s95 CoLA 95.1% 12.3 x 38.9
princeton-nlp/CoFi-MRPC-s60 MRPC 61.5% 2.0 x 86.8
princeton-nlp/CoFi-MRPC-s95 MRPC 94.9% 12.2 x 83.6

You can use these models with the huggingface interface:

from CoFiPruning.models import CoFiBertForSequenceClassification
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95") 
output = model(**inputs)

Train CoFiPruning

In the following section, we provide instructions on training CoFi with our code.

Requirements

Try runing the following script to install the dependencies.

pip install -r requirements.txt

Training

Training scripts

We provide example training scripts for training with CoFiPruning with different combination of training units and objectives in scripts/run_CoFi.sh. The script only supports single-GPU training and we explain the arguments in following:

  • --task_name: we support sequence classification tasks and extractive question answer tasks. You can input a glue task name, e.g., MNLI or use --train_file and --validation_file arguments with other tasks (supported by HuggingFace).
  • --ex_name_suffix: experiment name (for output dir)
  • --ex_cate: experiment category name (for output dir)
  • --pruning_type: we support all combinations of the following four types of pruning units. Default pruning type is structured_heads+structured_mlp+hidden+layer. Setting it to None falls back to standard fine-tuning.
    • structured_heads: head pruning
    • structured_mlp: mlp intermediate dimension pruning
    • hidden: hidden states pruning
    • layer: layer pruning
  • --target_sparsity: target sparsity of the pruned model
  • --distillation_path: the directory of the teacher model
  • --distillation_layer_loss_alpha: weight for layer distillation
  • --distillation_ce_loss_alpha: weight for cross entropy distillation
  • --layer_distill_version: we recommend using version 4 for small-sized datasets to impose an explicit restriction on layer orders but for relatively larger datasets, version 3 and version 4 do not make much difference.

After pruning the model, the same script could be used for further fine-tuning the pruned model with following arguments:

  • --pretrained_pruned_model: directory of the pruned model
  • --learning_rate: learning rate of the fine-tuning stage Note that during fine-tuning stage, pruning_type should be set to None.

An example for training (pruning) is as follows:

TASK=MNLI
SUFFIX=sparsity0.95
EX_CATE=CoFi
PRUNING_TYPE=structured_head+structured_mlp+hidden+layer
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION

An example for fine_tuning after pruning is as follows:

PRUNED_MODEL_PATH=$proj_dir/$TASK/$EX_CATE/${TASK}_${SUFFIX}/best
PRUNING_TYPE=None # Setting the pruning type to be None for standard fine-tuning.
LEARNING_RATE=3e-5

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION [PRUNED_MODEL_PATH] $LEARNING_RATE

The training process will save the model with the best validation accuracy under $PRUNED_MODEL_PATH/best. And you can use the evaluation.py script for evaluation.

Evaluation

Our pruned models are served on Huggingface's model hub. You can use the script evalution.py to get the sparsity, inference time and development set results of a pruned model.

python evaluation.py [TASK] [MODEL_NAME_OR_DIR]

An example use of evaluating a sentence classification model is as follows:

python evaluation.py MNLI princeton-nlp/CoFi-MNLI-s95 

The expected output of the model is as follows:

Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
seconds/example: 0.010151

Hyperparameters

We use the following hyperparamters for training CoFiPruning:

GLUE (small) GLUE (large) SQuAD
Batch size 32 32 16
Pruning learning rate 2e-5 2e-5 3e-5
Fine-tuning learning rate 1e-5, 2e-5, 3e-5 1e-5, 2e-5, 3e-5 1e-5, 2e-5, 3e-5
Layer distill. alpha 0.9, 0.7, 0.5 0.9, 0.7, 0.5 0.9, 0.7, 0.5
Cross entropy distill. alpha 0.1, 0.3, 0.5 0.1, 0.3, 0.5 0.1, 0.3, 0.5
Pruning epochs 100 20 20
Pre-finetuning epochs 4 1 1
Sparsity warmup epochs 20 2 2
Finetuning epochs 20 20 20

GLUE (small) denotes the GLUE tasks with a relatively smaller size including CoLA, STS-B, MRPC and RTE and GLUE (large) denotes the rest of the GLUE tasks including SST-2, MNLI, QQP and QNLI. Note that hyperparameter search is essential for small-sized datasets but is less important for large-sized datasets.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Mengzhou ([email protected]) and Zexuan ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use CoFiPruning in your work:

@inproceedings{xia2022structured,
   title={Structured Pruning Learns Compact and Accurate Models},
   author={Xia, Mengzhou and Zhong, Zexuan and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2022}
}
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
Code for the paper PermuteFormer

PermuteFormer This repo includes codes for the paper PermuteFormer: Efficient Relative Position Encoding for Long Sequences. Directory long_range_aren

Peng Chen 42 Mar 16, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Trained T5 and T5-large model for creating keywords from text

text to keywords Trained T5-base and T5-large model for creating keywords from text. Supported languages: ru Pretraining Large version | Pretraining B

Danil 61 Nov 24, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
Nmt - TensorFlow Neural Machine Translation Tutorial

Neural Machine Translation (seq2seq) Tutorial Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tut

6.1k Dec 29, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023