When BERT Plays the Lottery, All Tickets Are Winning

Last update: Nov 10, 2022

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

Environment

Install the requirements in your python 3.7.7 virtual environment.

pip install -r requirements.txt

These experiments were done on multi-gpu environment, were some experiments, benchmarks were run parallel. So some changes to the bash scripts to make it work for your environment.

Dataset

Download the GLUE dataset using data/download_glue.py and data/download_mnli_data.py. Follow the instructions in data/download_glue.py docstring for MRPC.
All data for the tasks should be organized in data/glue/task_name/ structure.
Extract the attention pattern classification labelled data.
```
cd data
tar -xvf head_classification_data.tar.gz
```

Training, Masking, and Evaluation

Switch cwd to src (cd src) as many paths are relative from that directory.

Fine-tune the BERT on GLUE tasks

./train.sh

Obtain the masks

./find_masks.sh

Train models with the masks applied in good, random and bad settings.

./train_with_masks.sh

Evaluate the trained models

./evaluate.sh

Note: These experiments were run through course of time and now stiched together into single scripts. So it might be better to run the training and evaluation commands in them one by one.

Train the CNN classifier on attention patterns normed and raw.

python classify_attention_patterns.py
python classify_normed_patterns.py

These only train the classifier.

Evaluation Analysis and Final Results

These are primarily done in jupyter notebooks in experiment_analysis directory. There are many experimental notebooks there. Here are the important ones used to generate results included in the paper.

Importance pruning Heatmaps. Ignore the final "train_subset" and "hans" settings.
Magnitude pruning Heatmap
Overlap of surviving components
Generate the random baseline
Attention Classification Patterns
Evaluation Result Comparisons and table
Statistics on mask correlation across seeds

When BERT Plays the Lottery, All Tickets Are Winning

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Environment

Dataset

Training, Masking, and Evaluation

Evaluation Analysis and Final Results

Owner

Sai

A PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detection.

Discord Multi Tool that focuses on design and easy usage

Adaptable tools to make reinforcement learning and evolutionary computation algorithms.

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

A knowledge base construction engine for richly formatted data

SCU OlympicsRunning Baseline

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

Instance Semantic Segmentation List

ProjectOxford-ClientSDK - This repo has moved :house: Visit our website for the latest SDKs & Samples

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Chunkmogrify: Real image inversion via Segments

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

Official implementation of the paper Label-Efficient Semantic Segmentation with Diffusion Models

Implementation of Uformer, Attention-based Unet, in Pytorch

Ganilla - Official Pytorch implementation of GANILLA