PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Last update: Jul 27, 2022

Overview

ALiBi

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Quickstart

Clone this repository.

git clone https://github.com/jaketae/alibi.git

Navigate to the cloned directory. You can use the bare-bone ALiBi decoder via

>>> import torch; from alibi import ALiBiConfig, ALiBiTransformer
>>> config  = ALiBiConfig()
>>> model = ALiBiTransformer(config)
>>> x = torch.randn(8, 100, 256)
>>> model(x).shape
torch.Size([8, 100, 256])

By default, the model comes with the following parameters:

ALiBiConfig(
    num_layers=6, 
    d_model=256, 
    num_heads=8, 
    max_len=256, 
    dropout=0.1, 
    causal=True, 
    expansion_factor=1
)

To use an encoder instead of a decoder, simply toggle causal=False.

Abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Citation

@misc{press2021train,
	title        = {Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
	author       = {Ofir Press and Noah A. Smith and Mike Lewis},
	year         = 2021,
	eprint       = {2108.12409},
	archiveprefix = {arXiv},
	primaryclass = {cs.CL}
}

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Related tags

Overview

ALiBi

Quickstart

Abstract

Citation

Owner

Jake Tae

An LSTM based GAN for Human motion synthesis

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Implementation of ViViT: A Video Vision Transformer

Official repository for the ICCV 2021 paper: UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model.

Generalized and Efficient Blackbox Optimization System.

Official Repsoitory for "Activate or Not: Learning Customized Activation." [CVPR 2021]

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

PyTorch implementation of Densely Connected Time Delay Neural Network

This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

Thermal Control of Laser Powder Bed Fusion using Deep Reinforcement Learning

Neural Module Network for VQA in Pytorch

The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

Distributing reference energies for SMIRNOFF implementations

The source code of the paper "SHGNN: Structure-Aware Heterogeneous Graph Neural Network"

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Catbird is an open source paraphrase generation toolkit based on PyTorch.

NALSM: Neuron-Astrocyte Liquid State Machine