A Transformer Implementation that is easy to understand and customizable.

Last update: Jan 20, 2022

Overview

Simple Transformer

I've written a series of articles on the transformer architecture and language models on Medium.

This repository contains an implementation of the Transformer architecture presented in the paper Attention Is All You Need by Ashish Vaswani, et. al.

My goal is to write an implementation that is easy to understand and dig into nitty-gritty details where the devil is.

Python environment

You can use any Python virtual environment like venv and conda.

For example, with venv:

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install -e.

Spacy Tokenizer Data Preparation

To use Spacy's tokenizer, make sure to download required languages.

For example, English and Germany tokenizers can be downloaded as below:

python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

Text Data from Torchtext

This project uses text datasets from Torchtext.

from torchtext import datasets

The default configuration uses Multi30k dataset.

Training

python train.py config_path

The default config path is config/config.yaml.

It is possible to resume training from a checkpoint.

python train.py --checkpoint_path runs/20220108-164720-Multi30k-Transformer/checkpoint-010-2.3343.pt

You can run tensorboard to see the training progress.

tensorboard --logdir=runs

The logs are created under runs.

Test

python test.py checkpoint_path

Example,

python test.py runs/20220108-164720-Multi30k-Transformer/checkpoint-010-2.3343.pt

config.yaml is copied to the model folder when training starts, and the test.py assumes the existence of a config yaml file.

Unit tests

There are some unit tests in the tests folder.

pytest tests

A Transformer Implementation that is easy to understand and customizable.

Related tags

Overview

Simple Transformer

Python environment

Spacy Tokenizer Data Preparation

Text Data from Torchtext

Training

Test

Unit tests

References:

Owner

Naoki Shibuya

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Label data using HuggingFace's transformers and automatically get a prediction service

Unsupervised text tokenizer for Neural Network-based text generation.

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

Smart discord chatbot integrated with Dialogflow

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Interpretable Models for NLP using PyTorch

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Unsupervised text tokenizer focused on computational efficiency

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

Maha is a text processing library specially developed to deal with Arabic text.

A CSRankings-like index for speech researchers

List of GSoC organisations with number of times they have been selected.

Subtitle Workshop (subshop): tools to download and synchronize subtitles

Turkish Stop Words Türkçe Dolgu Sözcükleri

Natural Language Processing with transformers

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.