Implementation of a Transformer using ReLA (Rectified Linear Attention)

Last update: Oct 14, 2022

Related tags

Overview

ReLA (Rectified Linear Attention) Transformer

Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt to combine the feedforward into the ReLA layer as memory key / values, as proposed in All Attention, suggestion made by Charles Foster.

Install

$ pip install rela-transformer

Usage

import torch
from rela_transformer.rela_transformer import ReLATransformer

model = ReLATransformer(
    num_tokens = 20000,
    dim = 512,
    depth = 8,
    max_seq_len = 1024,
    dim_head = 64,
    heads = 8,
    causal = True
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Enwik8

$ python train.py

Citations

@misc{zhang2021sparse,
    title   = {Sparse Attention with Linear Units},
    author  = {Biao Zhang and Ivan Titov and Rico Sennrich},
    year    = {2021},
    eprint  = {2104.07012},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

You might also like...

Attention for PyTorch with Linear Memory Footprint

Attention for PyTorch with Linear Memory Footprint Unofficially implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention (+

11 Jan 9, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

hierarchical-transformer-1d Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers In Progress!! 2021.

7 Nov 6, 2022

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Comments

LayerNorm/GatedRMS inconsistency
Hi! looking through pipeline it seems there are some inconsistencies with normalisation

# ReLA input to GRMSNorm # att code output: Linear(inner_dim, dim) + GRMSNorm # next in FF module input to LayerNorm

here we have problem with double norm since we have last layer GRMSNorm in att and first layer LayerNorm in FF.

looking at the paper it seems that in ReLA GRMSNorm is applied to result of mult(attn, v) before output projection not after projection like in this code. I also confused about usage of LayerNorm in FF should it be GRMSNorm instead? not clear from the paper as well
opened by inspirit 6

Releases(0.0.7)

0.0.7(Apr 6, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Jan 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Jan 11, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Implementation of a Transformer using ReLA (Rectified Linear Attention)

Related tags

Overview

ReLA (Rectified Linear Attention) Transformer

Install

Usage

Enwik8

Citations

You might also like...

Attention for PyTorch with Linear Memory Footprint

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Comments

LayerNorm/GatedRMS inconsistency

Releases(0.0.7)

0.0.7(Apr 6, 2022)

0.0.6(Feb 22, 2022)

0.0.5(Jan 13, 2022)

0.0.4(Jan 11, 2022)

0.0.3(Jan 10, 2022)

0.0.2a(Jan 10, 2022)

0.0.2(Jan 10, 2022)

0.0.1(Jan 10, 2022)

Owner

Phil Wang

TensorFlow implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

Tensorflow 2.x implementation of Vision-Transformer model

Convnet transfer - Code for paper How transferable are features in deep neural networks?

Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

Code and project page for ICCV 2021 paper "DisUnknown: Distilling Unknown Factors for Disentanglement Learning"

🍷 Gracefully claim weekly free games and monthly content from Epic Store.

Cluttered MNIST Dataset

Certified Patch Robustness via Smoothed Vision Transformers

This is a simple plugin for Vim that allows you to use OpenAI Codex.

Code of the paper "Deep Human Dynamics Prior" in ACM MM 2021.

SAN for Product Attributes Prediction

A DCGAN to generate anime faces using custom mined dataset

Meaningful titles for tabs and PDF downloads! Also supports tab search.

Quick program made to generate alpha and delta tables for Hidden Markov Models

A library for finding knowledge neurons in pretrained transformer models.

Final project code: Implementing BicycleGAN, for CIS680 FA21 at University of Pennsylvania

Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]