Awesome Efficient PLM Papers

Must-read papers on improving efficiency for pre-trained language models.

The paper list is mainly mantained by Lei Li and Shuhuai Ren.

Knowledge Distillation

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter NeurIPS workshop

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf [pdf] [project]
Patient Knowledge Distillation for BERT Model Compression EMNLP 2019

Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu [pdf] [project]
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models Preprint

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova [pdf] [project]
TinyBERT: Distilling BERT for Natural Language Understanding Findings of EMNLP 2020

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu [pdf] [project]
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing EMNLP 2020

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou [pdf] [project]
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers NeurIPS 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou [pdf] [project]
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance EMNLP 2020

Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin [pdf] [project]
MixKD: Towards Efficient Distillation of Large-scale Language Models ICLR 2021

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, Lawrence Carin [pdf]
Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains ACL-IJCNLP 2021

Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li, Jun Huang [pdf]
MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation ACL-IJCNLP 2021

Ahmad Rashid, Vasileios Lioutas, Mehdi Rezagholizadeh [pdf]
Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor ACL-IJCNLP 2021

Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, Kewei Tu [pdf] [project]
Weight Distillation: Transferring the Knowledge in Neural Network Parameters ACL-IJCNLP 2021

Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu [pdf]
Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation ACL-IJCNLP 2021

Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie Zhou [pdf]
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers Findings of ACL-IJCNLP 2021

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, Furu Wei [pdf] [project]
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers Findings of ACL-IJCNLP 2021

Chuhan Wu, Fangzhao Wu, Yongfeng Huang [pdf]

Dynamic Early Exiting

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference ACL 2020

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin [pdf] [project]
FastBERT: a Self-distilling BERT with Adaptive Inference Time ACL 2020

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, Qi Ju [pdf] [project]
The Right Tool for the Job: Matching Model and Instance Complexities ACL 2020

Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, Noah A. Smith [pdf] [project]
A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models NAACL 2021

Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He [pdf] [project]
CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade Preprint

Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li, Jie Zhou, Xu Sun [pdf] [project]
Early Exiting BERT for Efficient Document Ranking SustaiNLP 2020

Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin [pdf] [project]
BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression EACL 2021

Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin [pdf] [project]
Accelerating BERT Inference for Sequence Labeling via Early-Exit ACL 2021

Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang [pdf] [project]
BERT Loses Patience: Fast and Robust Inference with Early Exit NeurIPS 2020

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei [pdf] [project]
Early Exiting with Ensemble Internal Classifiers Preprint

Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu [pdf]

Quantization

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT AAAI 2020

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer [pdf] [project]
TernaryBERT: Distillation-aware Ultra-low Bit BERT EMNLP 2020

Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu [pdf] [project]
Q8BERT: Quantized 8Bit BERT NeurIPS 2019 Workshop

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat [pdf] [project]
BinaryBERT: Pushing the Limit of BERT Quantization EMNLP 2020

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King [pdf] [project]
I-BERT: Integer-only BERT Quantization ICML 2021

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer [pdf] [project]

Pruning

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned ACL 2019

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov [pdf] [project]
Are Sixteen Heads Really Better than One? NeurIPS 2019

Paul Michel, Omer Levy, Graham Neubig [pdf] [project]
The Lottery Ticket Hypothesis for Pre-trained BERT Networks NeurIPS 2020

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin [pdf] [project]
Movement Pruning: Adaptive Sparsity by Fine-Tuning NeurIPS 2020

Victor Sanh, Thomas Wolf, Alexander M. Rush [pdf] [project]
Reducing Transformer Depth on Demand with Structured Dropout Preprint

Angela Fan, Edouard Grave, Armand Joulin [pdf]
When BERT Plays the Lottery, All Tickets Are Winning EMNLP 2020

Sai Prasanna, Anna Rogers, Anna Rumshisky [pdf] [project]
Structured Pruning of a BERT-based Question Answering Model Preprint

J.S. McCarley, Rishav Chakravarti, Avirup Sil [pdf]
Structured Pruning of Large Language Models EMNLP 2020

Ziheng Wang, Jeremy Wohlwend, Tao Lei [pdf] [project]
Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm NAACL 2021

Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao [pdf]
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization ACL 2021

Chen Liang, Simiao Zuo, Minshuo Chen, Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, Weizhu Chen [pdf] [project]

Contribution

If you find any related work not included in the list, do not hesitate to raise a PR to help us complete the list.

Must-read papers on improving efficiency for pre-trained language models.

Related tags

Overview

Awesome Efficient PLM Papers

Knowledge Distillation

Dynamic Early Exiting

Quantization

Pruning

Contribution

Owner

Tobias Lee

ConvBERT-Prod

DeepPavlov Tutorials

A Facebook Messenger Chatbot using NLP

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Simple GUI where you can enter an article and get a crisp summarized version.

Unsupervised Language Modeling at scale for robust sentiment classification

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Natural Language Processing Best Practices & Examples

Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Lattice methods in TensorFlow

Utilizing RBERT model for KLUE Relation Extraction task

SAINT PyTorch implementation

Snowball compiler and stemming algorithms

NLP, Machine learning

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

The Sudachi synonym dictionary in Solar format.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.