This is a repository with the code for the ACL 2019 paper

Overview

The Story of Heads

This is the official repo for the following papers:

In this README, we discuss the ACL 2019 heads paper. Read the official blog post for the details!

For the contributions paper, go to the source_target_contributions folder.

Bibtex

@inproceedings{voita-etal-2019-analyzing,
    title = "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned",
    author = "Voita, Elena  and
      Talbot, David  and
      Moiseev, Fedor  and
      Sennrich, Rico  and
      Titov, Ivan",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1580",
    pages = "5797--5808",
}

Table of Contents

Introduction

In the paper, we:

  • evaluate the importance of attention heads in Transformer,

  • identify functions of the most important encoder heads,

  • prune the vast majority of attention heads in Transformer without seriously affecting quality using a method based on stochastic gates and a differentiable relaxation of the L0 penalty,

  • show which types of model attention are most sensitive to the number of attention heads and on which layers.

In this repo, we provide code and describe steps needed to reproduce our experiments with the L0 head pruning.

Pruning Attention Heads

In the standard Transformer, results of different attention heads in a layer are concatenated:

MultiHead(Q, K, V ) = Concat(head_i)W^O.

We modify the original Transformer architecture by multiplying the representation computed by each head_i by a scalar gate g_i:

MultiHead(Q, K, V ) = Concat(g_i * head_i)W^O.

Unlike usual gates, g_i are parameters specific to heads and are independent of the input (i.e. the sentence). As we would like to disable less important heads completely, we would ideally apply L0 regularization to the scalars g_i. The L0 norm equals the number of non-zero components and would push the model to switch off less important heads.

Unfortunately, the L0 norm is nondifferentiable and so cannot be directly incorporated as a regularization term in the objective function. Instead, we use a stochastic relaxation. Each gate g_i is a random variable drawn independently from a head-specific Hard Concrete distribution. The distributions have non-zero probability mass at 0 and 1; look at the illustration.

concrete_gif

We use the sum of the probabilities of heads being non-zero (L_C) as a stochastic relaxation of the non-differentiable L0 norm. The resulting training objective is:

L = L_xent + λ * L_C.

When applying the regularizer, we start from the converged model trained without the L_C penalty (i.e. the parameters are initialized with the parameters of the converged model) and then add the gates and continue training the full objective. By varying the coefficient λ in the optimized objective, we obtain models with different numbers of retained heads. Below is shown how the probabilities of encoder heads being completely closed (P(g_i)=0) change in training for different values of λ (pruning starts from a converged model). White color denotes P(g_i=0) = 1, which means that a head is completely removed from the model.

enc_head_gif

(Gif is for the model trained on EN-RU WMT. For other datasets, values of λ can be different.)

We observe that the model converges to solutions where gates are either almost completely closed or completely open. This means that at test time we can treat the model as a standard Transformer and use only a subset of heads.


Experiments

Requirements

Operating System: This implementation works on the most popular Linux distributions (tested on Ubuntu 14, 16). It will also likely to work on Mac OS. For other operating systems we recommend using Docker.

Hardware: The model can be trained on one or several GPUs. Training on CPU is also supported.

OpenMPI(optional): To train on several GPUs, you have to install OpenMPI. The code was tested on OpenMPI 3.1.2(download). See build instructions here.

Python: The code works with Python 3.5 and 3.6; we recommend using anaconda. Install the rest of python packages with pip install -r requirements.txt. If you haven't build OpenMPI, remove horovod from the list of requirements.

Data preprocessing

The model training config requires the data to be preprocessed, i.e. tokenized and bpeized.

Tokenization

Here is an example of how to tokenize (and lowercase) you data:

text_lines.en.tok ">
cat text_lines.en | moses-tokenizer en | python3 -c "import sys; print(sys.stdin.read().lower())" > text_lines.en.tok

For the OpenSubtitles18 dataset, you do not need this step since the data is already tokenized (you can just lowercase it).

BPE-ization

Learn BPE rules:

subword-nmt learn-bpe -s 32000 < text_lines.en.tok > bpe_rules.en

Apply BPE rules to your data:

/path_to_this_repo/lib/tools/apply_bpe.py  --bpe_rules ./bpe_rules.en  < text_lines.en.tok > text_lines.en.bpeized

Model training

In the scripts folder you can find files train_baseline.sh, train_concrete_heads.sh and train_fixed_alive_heads.sh with configs for training baseline, model with heads pruning using relaxation of the L0 penalty, and model with a fixed configuration of open and closed heads.

To launch an experiment, do the following (example is for the heads pruning experiment):

mkdir exp_dir_name && cd exp_dir_name
cp the-story-of-heads_dir/scripts/train_concrete_heads.sh .
bash train_concrete_heads.sh

After that, checkpoints will be in the exp_dir_name/build/checkpoint directory, summary for tensorboard - in exp_dir_name/build/summary, translations of dev set for checkpoints (if specified; see below) in exp_dir_name/build/translations.


Notebooks: how to use a model

In the notebooks folder you can find notebooks showing how to deal with your trained model. From a notebook name it's content has to be clear, but I'll write this just in case.

1_Load_model_and_translate - how to load model and translate sentences;

2_Look_at_attention_maps - how to draw attention maps for encoder heads;

3_Look_which_heads_are_dead - if you are pruning heads, you might want to know which ended up dead; this notebook shows you how to do this.


Training config tour

Each training script has a thorough description of the parameters and explanation of the things you need to change for your experiment. Here we'll provide a tour of the config files and explain the parameters once again.

Data

First, you need to specify your directory with the the-story-of-heads repo, data directory and train/dev file names.

REPO_DIR="../" # insert the dir to the the-story-of-heads repo
DATA_DIR="../" # insert your datadir

NMT="${REPO_DIR}/scripts/nmt.py"

# path to preprocessed data (tokenized, bpe-ized)
train_src="${DATA_DIR}/train.src"
train_dst="${DATA_DIR}/train.dst"
dev_src="${DATA_DIR}/dev.src"
dev_dst="${DATA_DIR}/dev.dst"

After that, in the config you'll see the code for creating vocabularies from your data and shuffling the data.


Model

params=(
...
--model lib.task.seq2seq.models.transformer_head_gates.Model
...)

This is the Transformer model with extra options for attention head gates: stochastic, fixed or no extra parameters for the baseline. Model hyperparameters are split into groups:

  • main model hp,
  • minor model hp (probably you do not want to change them)
  • regularization and label smoothing
  • inference params (beam search with a beam of 4)
  • head gates parameters (for the baseline, nothing is here)

For the baseline, the parameters are as follows:

hp = {
     "num_layers": 6,
     "num_heads": 8,
     "ff_size": 2048,
     "ffn_type": "conv_relu",
     "hid_size": 512,
     "emb_size": 512,
     "res_steps": "nlda", 
    
     "rescale_emb": True,
     "inp_emb_bias": True,
     "normalize_out": True,
     "share_emb": False,
     "replace": 0,
    
     "relu_dropout": 0.1,
     "res_dropout": 0.1,
     "attn_dropout": 0.1,
     "label_smoothing": 0.1,
    
     "translator": "ingraph",
     "beam_size": 4,
     "beam_spread": 3,
     "len_alpha": 0.6,
     "attn_beta": 0,
    }

This set of parameters corresponds to Transformer-base (Vaswani et al., 2017).

To train the model with heads pruning, you need to specify the types of attention heads you want to prune. For encoder self-attention heads only,

    "concrete_heads": {"enc-self"},

and for all attention types, it's

    "concrete_heads": {"enc-self", "dec-self", "dec-enc"},

For fixed head configuration, specify gate values for each head:

     "alive_heads": {"enc-self": [[1,0,1,0,1,0,1,0],
                                  [1,1,1,1,1,1,1,1],
                                  [0,0,0,0,0,0,0,0],
                                  [1,1,1,0,0,1,0,0],
                                  [0,0,0,0,1,1,1,1],
                                  [0,0,1,1,0,0,1,1]],
                    },

In this case, only encoder self-attention heads will be masked. For all attention types, specify all gates:

     "alive_heads": {"enc-self": [[1,0,1,0,1,0,1,0],
                                  [1,1,1,1,1,1,1,1],
                                   ...
                                  [0,0,1,1,0,0,1,1]],
                     "dec-self": [[...],
                                   ...,
                                  [...]],
                     "dec-enc": [[...],
                                  ...,
                                 [...]],
                    },

Problem (loss function)

You need to set the training objective for you model. For the baseline and fixed head configuration, it's the standard cross-entropy loss with no extra options:

params=(
    ...
    --problem lib.task.seq2seq.problems.default.DefaultProblem
    --problem-opts '{}'
    ...)

For pruning heads, loss function is L = L_xent + λ * L_C.. You need to set another problem and specify the value of λ:

params=(
    ...
     --problem lib.task.seq2seq.problems.concrete.ConcreteProblem
     --problem-opts '{'"'"'concrete_coef'"'"': 0.1,}'
    ...)

Starting checkpoint

If you start model training from already trained model (for example, we start pruning heads from the trained baseline model), specify the initial checkpoint:

params=(
    ...
     --pre-init-model-checkpoint 'dir_to_your_trained_baseline_checkpoint.npz'
    ...)

You do not need this if you start from scratch.


Variables to optimize

If you want to freeze some sets of parameters in the model (for example, when pruning encoder heads we freeze the decoder parameters to ensure that heads functions do not move to the decoder), you have to specify which parameters you want to optimize. To optimize only encoder, add variables to --optimizer-opts:

params=(
    ...
    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
                       '"'"'variables'"'"': ['"'"'mod/emb_inp*'"'"',
                                             '"'"'mod/enc*'"'"',],}'
    ...)

(Here beta1 and beta2 are parameters of the adam optimizer).


Batch size

It has been shown that Transformer’s performance depends heavily on a batch size (see for example Popel and Bojar, 2018), and we chose a large value of batch size to ensure that models show their best performance. In our experiments, each training batch contained a set of translation pairs containing approximately 16000 source tokens. This can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update. Our implementation enables both these options.

Batch size per one gpu is set like this:

params=(
    ...
     --batch-len 4000
    ...)

The effective batch size will be then batch-len * num_gpus. For example, with --batch-len 4000 and 4 gpus you would get the desirable batch size of 16000.

If you do not have several gpus (often, we don't have either :) ), you still have to have models of a proper quality. For this, accumulate the gradients for several batches and then make an update. Add average_grads: True and sync_every_steps: N to the optimizer options like this:

params=(
    ...
    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
                       '"'"'sync_every_steps'"'"': 4,
                       '"'"'average_grads'"'"': True, }'
    ...)

The effective batch size will be then batch-len * sync_every_steps. For example, with --batch-len 4000 and sync_every_steps: 4 you would get the desirable batch size of 16000.


Other options

If you want to see dev BLEU score on your tensorboard:

params=(
    ...
      --translate-dev
      --translate-dev-every 2048
    ...)

Specify how often you want to save a checkpoint:

params=(
    ...
      --checkpoint-every-steps 2048
    ...)

Specify how often you want to score the dev set (eval loss values):

params=(
    ...
      --score-dev-every 256
    ...)

How many last checkpoints to keep:

params=(
    ...
       --keep-checkpoints-max 10
    ...)

Comments

  • lib.task.seq2seq.models.transformer_head_gates model enables you to train baseline as well as other versions, but if you want Transformer model without any modifications, you can find it here: lib.task.seq2seq.models.transformer.
Owner
PhD student at Edinburgh Uni and Amsterdam Uni, ex-research scientist at Yandex Research
A Closer Look at Reference Learning for Fourier Phase Retrieval

A Closer Look at Reference Learning for Fourier Phase Retrieval This repository contains code for our NeurIPS 2021 Workshop on Deep Learning and Inver

Tobias Uelwer 1 Oct 28, 2021
SuperSDR: multiplatform KiwiSDR + CAT transceiver integrator

SuperSDR SuperSDR integrates a realtime spectrum waterfall and audio receive from any KiwiSDR around the world, together with a local (or remote) cont

Marco Cogoni 30 Nov 29, 2022
The code of Zero-shot learning for low-light image enhancement based on dual iteration

Zero-shot-dual-iter-LLE The code of Zero-shot learning for low-light image enhancement based on dual iteration. You can get the real night image tests

1 Mar 18, 2022
Out of Distribution Detection on Natural Adversarial Examples

OOD-on-NAE Research project on out of distribution detection for the Computer Vision course by Prof. Rob Fergus (CSCI-GA 2271) Paper out on arXiv - ht

Anugya 1 Jun 08, 2022
This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

🗣️ aspeak A simple text-to-speech client using azure TTS API(trial). 😆 TL;DR: This program uses trial auth token of Azure Cognitive Services to do s

Levi Zim 359 Jan 05, 2023
Codes for “A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection”

DSAMNet The pytorch implementation for "A Deeply-supervised Attention Metric-based Network and an Open Aerial Image Dataset for Remote Sensing Change

Mengxi Liu 41 Dec 14, 2022
Classical OCR DCNN reproduction based on PaddlePaddle framework.

Paddle-SVHN Classical OCR DCNN reproduction based on PaddlePaddle framework. This project reproduces Multi-digit Number Recognition from Street View I

1 Nov 12, 2021
Versatile Generative Language Model

Versatile Generative Language Model This is the implementation of the paper: Exploring Versatile Generative Language Model Via Parameter-Efficient Tra

Zhaojiang Lin 17 Dec 02, 2022
Conversational text Analysis using various NLP techniques

PyConverse Let me try first Installation pip install pyconverse Usage Please try this notebook that demos the core functionalities: basic usage noteb

Rita Anjana 158 Dec 25, 2022
Tensorflow implementation of "Learning Deep Features for Discriminative Localization"

Weakly_detector Tensorflow implementation of "Learning Deep Features for Discriminative Localization" B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and

Taeksoo Kim 363 Jun 29, 2022
Code for: https://berkeleyautomation.github.io/bags/

DeformableRavens Code for the paper Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks. Here is the

Daniel Seita 121 Dec 30, 2022
This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation).

FlatGCN This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation, submitted to ICASSP2022). Req

Dreamer 2 Aug 09, 2022
Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

MotionCLIP Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space". Please visit our webpage for mor

Guy Tevet 173 Dec 26, 2022
A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

🌟 HNSW + PostgreSQL Indexer HNSWPostgreSQLIndexer Jina is a production-ready, scalable Indexer for the Jina neural search framework. It combines the

Jina AI 25 Oct 14, 2022
Ranking Models in Unlabeled New Environments (iccv21)

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

14 Dec 17, 2021
Toontown: Galaxy, a new Toontown game based on Disney's Toontown Online

Toontown: Galaxy The official archive repo for Toontown: Galaxy, a new Toontown

1 Feb 15, 2022
Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

Sarthak Mittal 26 May 26, 2022
Practical Single-Image Super-Resolution Using Look-Up Table

Practical Single-Image Super-Resolution Using Look-Up Table [Paper] Dependency Python 3.6 PyTorch glob numpy pillow tqdm tensorboardx 1. Training deep

Younghyun Jo 116 Dec 23, 2022
Doosan robotic arm, simulation, control, visualization in Gazebo and ROS2 for Reinforcement Learning.

Robotic Arm Simulation in ROS2 and Gazebo General Overview This repository includes: First, how to simulate a 6DoF Robotic Arm from scratch using GAZE

David Valencia 12 Jan 02, 2023
An addon uses SMPL's poses and global translation to drive cartoon character in Blender.

Blender addon for driving character The addon drives the cartoon character by passing SMPL's poses and global translation into model's armature in Ble

犹在镜中 153 Dec 14, 2022