Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Last update: Jan 05, 2023

Overview

Memory Efficient Attention Pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(n²) Memory. In addition, the module will take care of masking, causal masking, as well as cross attention.

Install

$ pip install memory-efficient-attention-pytorch

Usage

For autoregressive language model

import torch
from memory_efficient_attention_pytorch import Attention

attn = Attention(
    dim = 512,
    dim_head = 64,                # dimension per head
    heads = 8,                    # number of attention heads
    causal = True,                # autoregressive or not
    memory_efficient = True,      # whether to use memory efficient attention (can be turned off to test against normal attention)
    q_bucket_size = 1024,         # bucket size along queries dimension
    k_bucket_size = 2048          # bucket size along key / values dimension
).cuda()

x = torch.randn(1, 65536, 512).cuda()
out = attn(x) # (1, 65536, 512)

Cross attention

import torch
from memory_efficient_attention_pytorch import Attention

cross_attn = Attention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    memory_efficient = True,
    q_bucket_size = 1024,
    k_bucket_size = 2048
).cuda()

x = torch.randn(1, 65536, 512).cuda()
context = torch.randn(1, 65536, 512).cuda()
mask = torch.ones(1, 65536).bool().cuda()

out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512)

benchmark and see how much torch jit helps
look at Triton and Keops and see if either can be a fit

Citations

@misc{rabe2021selfattention,
    title   = {Self-attention Does Not Need $O(n^2)$ Memory}, 
    author  = {Markus N. Rabe and Charles Staats},
    year    = {2021},
    eprint  = {2112.05682},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

[feature request] Combining with flash attention?

There is a new algorithm to optimize the qkv attention, https://github.com/HazyResearch/flash-attention https://arxiv.org/abs/2205.14135 It optimises the qkv attention part. Maybe you can look into integrating it with this.

opened by Vbansal21 15
i did this, we could build on top

Hi there!

It seems I did already some of the code... https://github.com/CHARM-Tx/linear_mem_attention_pytorch could we build on top of this? I talked to https://github.com/Chillee about an experimental functionality from functorch: https://github.com/pytorch/functorch that would allow for increased speed (mainly i want to match jax perofmance but its just difficult w/ pytorch imperative style).

I would love to collaborate on this if you want!

opened by hypnopump 5
Added dropout support to memory efficient variant

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

opened by usryokousha 2
Making this work with relative position bias from XTransformers

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

opened by pfeatherstone 5
save_for_backward can only save variables, but argument 5 is of type bool

Hi,

Thank you for your indescribable work. I was trying to test your method specifically for cross-attention but It seems I get the error " save_for_backward can only save variables, but argument 5 is of type bool". I am not sure what I am doing wrong. I tried your own examples too but get the same error.

Can you please help me out?

Code:

import torch from memory_efficient_attention_pytorch import Attention

cross_attn = Attention( dim = 512, dim_head = 64, heads = 8, memory_efficient = True, q_bucket_size = 1024, k_bucket_size = 2048 ).cuda() (# out = sm_mod(inp1)) did this to avoid being a header x = torch.randn(1, 65536, 512).cuda() context = torch.randn(1, 65536, 512).cuda() (# mask = torch.ones(1, 65536).bool().cuda()) did this to avoid being a heading out = cross_attn(x

ERROR:

File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/stars/user/abali/Phd_work/ISBI2023/X3D-Multigrid/CrossAttn_X3d_v2.py", line 872, in out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512) print(out) File "/home/abali/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 215, in forward out = attn_fn(q, k, v, mask = mask, attn_bias = attn_bias, causal = self.causal, q_bucket_size = q_bucket_size, k_bucket_size = k_bucket_size) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 127, in memory_efficient_attention exp_weight_chunk, weighted_value_chunk, weight_max_chunk = summarize_qkv_fn( File "/home/abali/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint return CheckpointFunction.apply(function, preserve, *args) TypeError: save_for_backward can only save variables, but argument 5 is of type bool

opened by aliabid2243 1
Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/35559a05572f9d4eb982a8e2e399b40a2d61b85c/memory_efficient_attention_pytorch/memory_efficient_attention.py#L95

Should this be: summarize_qkv_fn = summarize_qkv_chunk if needs_backwards else checkpointed_summarize_qkv_chunk instead of: summarize_qkv_fn = checkpointed_summarize_qkv_chunk if needs_backwards else summarize_qkv_chunk

opened by vrobot 0

Releases(0.1.1)

0.1.1(Dec 30, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 30, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 1, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.26(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.25(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.24(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.23(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.22(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.21(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.20(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.19(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.18(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.17(Mar 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.16(Mar 21, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Neural network-based build time estimation for additive manufacturing

Neural network-based build time estimation for additive manufacturing Oh, Y., Sharp, M., Sprock, T., & Kwon, S. (2021). Neural network-based build tim

1 Nov 15, 2021

Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

110 Dec 20, 2022

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

ImageBART NeurIPS 2021 Patrick Esser*, Robin Rombach*, Andreas Blattmann*, Björn Ommer * equal contribution arXiv | BibTeX | Poster Requirements A sui

110 Jan 01, 2023

An Implementation of Transformer in Transformer in TensorFlow for image classification, attention inside local patches

Transformer-in-Transformer An Implementation of the Transformer in Transformer paper by Han et al. for image classification, attention inside local pa

40 Jul 25, 2022

Non-Homogeneous Poisson Process Intensity Modeling and Estimation using Measure Transport

Non-Homogeneous Poisson Process Intensity Modeling and Estimation using Measure Transport This GitHub page provides code for reproducing the results i

1 Nov 08, 2021

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning This repository contains the code for our ICCV 202

28 Nov 08, 2022

Practical Single-Image Super-Resolution Using Look-Up Table

Practical Single-Image Super-Resolution Using Look-Up Table [Paper] Dependency Python 3.6 PyTorch glob numpy pillow tqdm tensorboardx 1. Training deep

116 Dec 23, 2022

Free like Freedom

This is all very much a work in progress! More to come! ( We're working on it though! Stay tuned!) Installation Open an Anaconda Prompt (in Windows, o

2.3k Jan 04, 2023

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

LDNet Author: Wen-Chin Huang (Nagoya University) Email: Wen-Chin Huang (unilight) 40 Nov 20, 2022

Source code for our CVPR 2019 paper - PPGNet: Learning Point-Pair Graph for Line Segment Detection

PPGNet: Learning Point-Pair Graph for Line Segment Detection PyTorch implementation of our CVPR 2019 paper: PPGNet: Learning Point-Pair Graph for Line

170 Oct 25, 2022

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

flownet2-pytorch Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. Multiple GPU training is supported, a

2.8k Dec 27, 2022

TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Domain Transfer Network (DTN) TensorFlow implementation of Unsupervised Cross-Domain Image Generation. Requirements Python 2.7 TensorFlow 0.12 Pickle

865 Nov 17, 2022

Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020

Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020 BibTeX @INPROCEEDINGS{punnappurath2020modeling, author={Abhi

22 Oct 01, 2022

Pytorch implementation of COIN, a framework for compression with implicit neural representations 🌸

COIN 🌟 This repo contains a Pytorch implementation of COIN: COmpression with Implicit Neural representations, including code to reproduce all experim

104 Dec 14, 2022

A Simple Key-Value Data-store written in Python

mercury-db This is a File Based Key-Value Datastore that supports basic CRUD (Create, Read, Update, Delete) operations developed using Python. The dat

1 Jan 09, 2022

Implementation of our paper 'RESA: Recurrent Feature-Shift Aggregator for Lane Detection' in AAAI2021.

RESA PyTorch implementation of the paper "RESA: Recurrent Feature-Shift Aggregator for Lane Detection". Our paper has been accepted by AAAI2021. Intro

137 Jan 02, 2023

A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021)

GDN A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021) Abstract In this paper, we consider an inverse problem i

4 Sep 13, 2022

Author Disambiguation using Knowledge Graph Embeddings with Literals

Author Name Disambiguation with Knowledge Graph Embeddings using Literals This is the repository for the master thesis project on Knowledge Graph Embe

12 Oct 19, 2022

Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior. The code will release soon. Implementation Python3 PyTorch=1.0 NVIDIA GPU+

34 Dec 04, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Related tags

Overview

Memory Efficient Attention Pytorch

Install

Usage

Citations

Comments

[feature request] Combining with flash attention?

i did this, we could build on top

Added dropout support to memory efficient variant

Making this work with relative position bias from XTransformers

save_for_backward can only save variables, but argument 5 is of type bool

Code:

ERROR:

Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

Releases(0.1.1)

0.1.1(Dec 30, 2022)

0.1.0(Dec 30, 2022)

0.0.27(Nov 1, 2022)

0.0.26(Jul 23, 2022)

0.0.25(Jul 23, 2022)

0.0.24(Jul 23, 2022)

0.0.23(Jul 23, 2022)

0.0.22(Jul 23, 2022)

0.0.21(Jul 23, 2022)

0.0.20(Jul 23, 2022)

0.0.19(Jul 23, 2022)

0.0.18(Jul 23, 2022)

0.0.17(Mar 22, 2022)

0.0.16(Mar 21, 2022)

0.0.15(Mar 13, 2022)

0.0.14(Mar 4, 2022)

0.0.12(Mar 4, 2022)

0.0.11(Mar 4, 2022)

0.0.10(Mar 4, 2022)

0.0.9(Mar 4, 2022)

0.0.8(Mar 4, 2022)

0.0.7(Mar 4, 2022)

0.0.6(Mar 4, 2022)

0.0.5(Mar 4, 2022)

0.0.4(Mar 4, 2022)

0.0.2(Mar 4, 2022)

0.0.1(Mar 3, 2022)