DeLighT: Very Deep and Light-Weight Transformers

Last update: Dec 18, 2022

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

This repository contains the source code of our work on building efficient sequence models: DeFINE (ICLR'20) and DeLighT (preprint).

Table of contents

Overview
Requirements and installation
Training, evaluation, and results
Multiplication-addition operations
Citation
Acknowledgement
Issues

Overview

In this repository, we share the source code of our paper DeLight, that delivers similar or better performance than transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. For details, see our papers: DeFINE and and DeLighT.

Requirements and Installation

PyTorch version >= 1.4.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To use DeLighT, you need to install fairseq and develop locally:

git clone https://github.com/sacmehta/delight
cd delight
pip install --editable ./

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Training, Evaluation, and Results

For training, evaluation, and results, see below links. To ease reproduction of our results, we also provide links to training logs.

Neural machine translation

Language Modeling

WikiText-103

Multiplication-Addition Operations

We have added module profiling for both Transformer and DeLight networks. This can be enabled using --print-stats argument. A model summary will be printed (by default for 20 tokens), similar to below screenshot. To use larger sequence lengths for source and target for profiling statistics, you can use --src-len-ps and --tgt-len-ps flags.

Citation

If you find our work useful, please consider citing following works:

@misc{mehta2020delight,
    title={DeLighT: Very Deep and Light-weight Transformer},
    author={Sachin Mehta and Marjan Ghazvininejad and Srinivasan Iyer and Luke Zettlemoyer and Hannaneh Hajishirzi},
    year={2020},
    eprint={2008.00623},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@inproceedings{mehta2019define,
  title={DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling},
  author={Mehta, Sachin and Koncel-Kedziorski, Rik and Rastegari, Mohammad and Hajishirzi, Hannaneh},
  booktitle={International Conference on Learning Representations},
  year={2019}
}

Acknowledgements

We would like to thank Fairseq team for building easy-to-use sequence library.

Issues

Thanks for your interest in our work. For any issues, please raise a request.

DeLighT: Very Deep and Light-Weight Transformers

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

Overview

Requirements and Installation

Training, Evaluation, and Results

Neural machine translation

Language Modeling

Multiplication-Addition Operations

Citation

Acknowledgements

Issues

Owner

Sachin Mehta

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

This repository includes different versions of the prescribed-time controller as Simulink blocks and MATLAB script codes for engineering applications.

Robust fine-tuning of zero-shot models

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021))

Incremental Cross-Domain Adaptation for Robust Retinopathy Screening via Bayesian Deep Learning

GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks

3D Avatar Lip Syncronization from speech (JALI based face-rigging)

“Robust Lightweight Facial Expression Recognition Network with Label Distribution Training”, AAAI 2021.

TensorFlow (Python) implementation of DeepTCN model for multivariate time series forecasting.

😊 Python module for face feature changing

Implementation of CVPR 2020 Dual Super-Resolution Learning for Semantic Segmentation

这是一个unet-pytorch的源码，可以训练自己的模型

Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)