Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Overview

Attention Is All You Need Paper Implementation

This is my from-scratch implementation of the original transformer architecture from the following paper: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

Table of Contents

About

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. " - Abstract

Transformers came to be a groundbreaking advance in neural network architectures which revolutionized what we can do with NLP and beyond. To name a few applications consider the application of BERT to Google search and GPT to Github Copilot. Those architectures are upgrades on the original transformer architecture described in this seminal paper. The goal of this repository is to provide an implementation that is easy to follow and understand while reading the paper. Setup is easy and everything is runnable on CPU for learning purposes.

✔️ Highly customizable configuration and training loop
✔️ Runnable on CPU and GPU
✔️ W&B integration for detailed logging of every metric
✔️ Pretrained models and their training details
✔️ Gradient Accumulation
✔️ Label smoothing
✔️ BPE and WordLevel Tokenizers
✔️ Dynamic Batching
✔️ Batch Dataset Processing
✔️ Bleu-score calculation during training
✔️ Documented dimensions for every step of the architecture
✔️ Shown progress of translation for an example after every epoch
✔️ Tutorial notebook (Coming soon...)

Setup

Environment

Using Miniconda/Anaconda:

  1. cd path_to_repo
  2. conda env create
  3. conda activate attention-is-all-you-need-paper

Note: Depending on your GPU you might need to switch cudatoolkit to version 10.2

Pretrained Models

To download the pretrained model and tokenizer run:

python scripts/download_pretrained.py

Note: If prompted about wandb setting select option 3

Usage

Training

Before starting training you can either choose a configuration out of available ones or create your own inside a single file src/config.py. The available parameters to customize, sorted by categories, are:

  • Run 🚅 :
    • RUN_NAME - Name of a training run
    • RUN_DESCRIPTION - Description of a training run
    • RUNS_FOLDER_PTH - Saving destination of a training run
  • Data 🔡 :
    • DATASET_SIZE - Number of examples you want to include from WMT14 en-de dataset (max 4,500,000)
    • TEST_PROPORTION - Test set proportion
    • MAX_SEQ_LEN - Maximum allowed sequence length
    • VOCAB_SIZE - Size of the vocabulary (good choice is dependant on the tokenizer)
    • TOKENIZER_TYPE - 'wordlevel' or 'bpe'
  • Training 🏋️‍♂️ :
    • BATCH_SIZE - Batch size
    • GRAD_ACCUMULATION_STEPS - Over how many batches to accumulate gradients before optimizing the parameters
    • WORKER_COUNT - Number of workers used in dataloaders
    • EPOCHS - Number of epochs
  • Optimizer 📉 :
    • BETAS - Adam beta parameter
    • EPS - Adam eps parameter
  • Scheduler ⏲️ :
    • N_WARMUP_STEPS - How many warmup steps to use in the scheduler
  • Model 🤖 :
    • D_MODEL - Model dimension
    • N_BLOCKS - Number of encoder and decoder blocks
    • N_HEADS - Number of heads in the Multi-Head attention mechanism
    • D_FF - Dimension of the Position Wise Feed Forward network
    • DROPOUT_PROBA - Dropout probability
  • Other 🧰 :
    • DEVICE - 'gpu' or 'cpu'
    • MODEL_SAVE_EPOCH_CNT - After how many epochs to save a model checkpoint
    • LABEL_SMOOTHING - Whether to apply label smoothing

Once you decide on the configuration edit the config_name in train.py and do:

$ cd src
$ python train.py

Inference

For inference I created a simple app with Streamlit which runs in your browser. Make sure to train or download the pretrained models beforehand. The app looks at the model directory for model and tokenizer checkpoints.

$ streamlit run app/inference_app.py
app.mp4

Data

Same WMT 2014 data is used for the English-to-German translation task. Dataset contains about 4,500,000 sentence pairs but you can manually specify the dataset size if you want to lower it and see some results faster. When training is initiated the dataset is automatically downloaded, preprocessed, tokenized and dataloaders are created. Also, a custom batch sampler is used for dynamic batching and padding of sentences of similar lengths which speeds up training. HuggingFace 🤗 datasets and tokenizers are used to achieve this very fast.

Architecture

The original transformer architecture presented in this paper consists of an encoder and decoder part purposely included to match the seq2seq problem type of machine translation. There are also encoder-only (e.g. BERT) and decoder-only (e.g. GPT) transformer architectures, those won't be covered here. One of the main features of transformers , in general, is parallelized sequence processing which RNN's lack. Main ingredient here is the attention mechanism which enables creating modified word representations (attention representations) that take into account the word's meaning in relation to other words in a sequence (e.g. the word "bank" can represent a financial institution or land along the edge of a river as in "river bank"). Depending on how we think about a word we may choose to represent it differently. This transcends the limits of traditional word embeddings.

For a detailed walkthrough of the architecture check the notebooks/tutorial.ipynb

Weights and Biases Logs

Weights and Biases is a very powerful tool for MLOps. I integrated it with this project to automatically provide very useful logs and visualizations when training. In fact, you can take a look at how the training looked for the pretrained models at this project link. All logs and visualizations are synced real time to the cloud.

When you start training you will be asked:

wandb: (1) Create W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 

For creating and syncing the visualizations to the cloud you will need a W&B account. Creating an account and using it won't take you more than a minute and it's free. If don't want to visualize results select option 3.

Citation

Please use this bibtex if you want to cite this repository:

@misc{Koch2021attentionisallyouneed,
  author = {Koch, Brando},
  title = {attention-is-all-you-need},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bkoch4142/MISSING}},
}

License

This repository is under an MIT License

License: MIT

Owner
Brando Koch
Machine Learning Engineer with experience in ML, DL , NLP & CV specializing in ConversationalAI & NLP.
Brando Koch
Neural Style and MSG-Net

PyTorch-Style-Transfer This repo provides PyTorch Implementation of MSG-Net (ours) and Neural Style (Gatys et al. CVPR 2016), which has been included

Hang Zhang 904 Dec 21, 2022
The official repository for Deep Image Matting with Flexible Guidance Input

FGI-Matting The official repository for Deep Image Matting with Flexible Guidance Input. Paper: https://arxiv.org/abs/2110.10898 Requirements easydict

Hang Cheng 51 Nov 10, 2022
Pixel-level Crack Detection From Images Of Levee Systems : A Comparative Study

PIXEL-LEVEL CRACK DETECTION FROM IMAGES OF LEVEE SYSTEMS : A COMPARATIVE STUDY G

Manisha Panta 2 Jul 23, 2022
这是一个yolox-pytorch的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Pytorch当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤

Bubbliiiing 613 Jan 05, 2023
This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

39 Jul 21, 2022
PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis This is a PyTorch implementation of the Deep Streaming Linear Discriminant

Tyler Hayes 41 Dec 25, 2022
Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

Daniel Roich 58 Dec 24, 2022
Implementation for Panoptic-PolarNet (CVPR 2021)

Panoptic-PolarNet This is the official implementation of Panoptic-PolarNet. [ArXiv paper] Introduction Panoptic-PolarNet is a fast and robust LiDAR po

Zixiang Zhou 126 Jan 01, 2023
In the AI for TSP competition we try to solve optimization problems using machine learning.

AI for TSP Competition Goal In the AI for TSP competition we try to solve optimization problems using machine learning. The competition will be hosted

Paulo da Costa 11 Nov 27, 2022
.NET bindings for the Pytorch engine

TorchSharp TorchSharp is a .NET library that provides access to the library that powers PyTorch. It is a work in progress, but already provides a .NET

Matteo Interlandi 17 Aug 30, 2021
An implementation of the WHATWG URL Standard in JavaScript

whatwg-url whatwg-url is a full implementation of the WHATWG URL Standard. It can be used standalone, but it also exposes a lot of the internal algori

314 Dec 28, 2022
This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

TransMix: Attend to Mix for Vision Transformers This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transf

Jie-Neng Chen 130 Jan 01, 2023
Capsule endoscopy detection DACON challenge

capsule_endoscopy_detection (DACON Challenge) Overview Yolov5, Yolor, mmdetection기반의 모델을 사용 (총 11개 모델 앙상블) 모든 모델은 학습 시 Pretrained Weight을 yolov5, yolo

MAILAB 11 Nov 25, 2022
This is the official code of our paper "Diversity-based Trajectory and Goal Selection with Hindsight Experience Relay" (PRICAI 2021)

Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay This is the official implementation of our paper "Diversity-based Traje

Tianhong Dai 6 Jul 18, 2022
Constructing interpretable quadratic accuracy predictors to serve as an objective function for an IQCQP problem that represents NAS under latency constraints and solve it with efficient algorithms.

IQNAS: Interpretable Integer Quadratic programming Neural Architecture Search Realistic use of neural networks often requires adhering to multiple con

0 Oct 24, 2021
The (Official) PyTorch Implementation of the paper "Deep Extraction of Manga Structural Lines"

MangaLineExtraction_PyTorch The (Official) PyTorch Implementation of the paper "Deep Extraction of Manga Structural Lines" Usage model_torch.py [sourc

Miaomiao Li 82 Jan 02, 2023
PyTorch(Geometric) implementation of G^2GNN in "Imbalanced Graph Classification via Graph-of-Graph Neural Networks"

This repository is an official PyTorch(Geometric) implementation of G^2GNN in "Imbalanced Graph Classification via Graph-of-Graph Neural Networks". Th

Yu Wang (Jack) 13 Nov 18, 2022
Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

Mike K. 55 Dec 06, 2022
Forecasting with Gradient Boosted Time Series Decomposition

ThymeBoost ThymeBoost combines time series decomposition with gradient boosting to provide a flexible mix-and-match time series framework for spicy fo

131 Jan 08, 2023
Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology (LMRL Workshop, NeurIPS 2021)

Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology Self-Supervised Vision Transformers Learn Visual Concepts in Histopatholog

Richard Chen 95 Dec 24, 2022