Explainability for Vision Transformers (in PyTorch)

Last update: Jan 04, 2023

Overview

Explainability for Vision Transformers (in PyTorch)

This repository implements methods for explainability in Vision Transformers.

Currently implemented:

Attention Rollout.
Gradient Attention Rollout for class specific explainability. This is our attempt to further build upon and improve Attention Rollout.
TBD Attention flow is work in progress.

Includes some tweaks and tricks to get it working:

Different Attention Head fusion methods,
Removing the lowest attentions.

Usage

From code

from vit_grad_rollout import VITAttentionGradRollout

model = torch.hub.load('facebookresearch/deit:main', 
'deit_tiny_patch16_224', pretrained=True)
grad_rollout = VITAttentionGradRollout(model, discard_ratio=0.9, head_fusion='max')
mask = grad_rollout(input_tensor, category_index=243)

From the command line:

python vit_explain.py --image_path  --head_fusion  --discard_ratio  --category_index

If category_index isn't specified, Attention Rollout will be used, otherwise Gradient Attention Rollout will be used.

Notice that by default, this uses the 'Tiny' model from Training data-efficient image transformers & distillation through attention hosted on torch hub.

Where did the Transformer pay attention to in this image?

Image	Vanilla Attention Rollout	With discard_ratio+max fusion

Gradient Attention Rollout for class specific explainability

The Attention that flows in the transformer passes along information belonging to different classes. Gradient roll out lets us see what locations the network paid attention too, but it tells us nothing about if it ended up using those locations for the final classification.

We can multiply the attention with the gradient of the target class output, and take the average among the attention heads (while masking out negative attentions) to keep only attention that contributes to the target category (or categories).

Where does the Transformer see a Dog (category 243), and a Cat (category 282)?

Where does the Transformer see a Musket dog (category 161) and a Parrot (category 87):

Tricks and Tweaks to get this working

Filtering the lowest attentions in every layer

--discard_ratio

Removes noise by keeping the strongest attentions.

Results for dIfferent values:

Different Attention Head Fusions

The Attention Rollout method suggests taking the average attention accross the attention heads,

but emperically it looks like taking the Minimum value, Or the Maximum value combined with --discard_ratio, works better.

--head_fusion

Image	Mean Fusion	Min Fusion

References

Quantifying Attention Flow in Transformers
timm: a great collection of models in PyTorch and especially the vision transformer implementation
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Credit for https://github.com/jeonsworld/ViT-pytorch for being a good starting point.

Requirements

pip install timm

Explainability for Vision Transformers (in PyTorch)

Related tags

Overview

Explainability for Vision Transformers (in PyTorch)

Currently implemented:

Usage

Where did the Transformer pay attention to in this image?

Gradient Attention Rollout for class specific explainability

Where does the Transformer see a Dog (category 243), and a Cat (category 282)?

Where does the Transformer see a Musket dog (category 161) and a Parrot (category 87):

Tricks and Tweaks to get this working

Filtering the lowest attentions in every layer

Different Attention Head Fusions

References

Requirements

Owner

Jacob Gildenblat

Deep Anomaly Detection with Outlier Exposure (ICLR 2019)

EMNLP 2021 paper The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers.

TextBPN Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection

Code for the paper Hybrid Spectrogram and Waveform Source Separation

Py4fi2nd - Jupyter Notebooks and code for Python for Finance (2nd ed., O'Reilly) by Yves Hilpisch.

Tooling for GANs in TensorFlow

🔥3D-RecGAN in Tensorflow (ICCV Workshops 2017)

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Amazing-Python-Scripts - 🚀 Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.

Emblaze - Interactive Embedding Comparison

Reference implementation for Structured Prediction with Deep Value Networks

BasicNeuralNetwork - This project looks over the basic structure of a neural network and how machine learning training algorithms work

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

Unsupervised Domain Adaptation for Nighttime Aerial Tracking (CVPR2022)

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

A script that trains a model to recognize handwritten digits using the MNIST data set.

Boosted CVaR Classification (NeurIPS 2021)

AAAI 2022: Stationary diffusion state neural estimation