Pytorch library for end-to-end transformer models training and serving

Overview

Russian GPT-2

Google colab notebook for finetuning.

https://colab.research.google.com/drive/1jwFks82BLyy8x3oxyKpiNdlL1PfKSQwW?usp=sharing

Google colab notebook for generating text corpus.

https://colab.research.google.com/drive/1Hsp2508TXMR0ihYOLjKYOzWm9byqg9ue

1. I just want to play with your models

You can try writing with the model here https://porfirevich.ru and with Telegram chat bot @PorfBot

You can try poetry with Telegram chat bot @NeuroPoetBot

2. What are results?

Your perplexity will be different, depending on the tokenizer, the vocab and the dataset. The better your tokenizer the worse your perplexity, actually.

Values in the table are perplexity on the validation set.

Huge dataset

GPT-2 Small, 124M. BS 64 Medium, 355M. BS 32
Unfreeze 0, LR 24e-4 80 epoch, 85-90 80 epoch, 81-85
Unfreeze 0, LR 3e-4 80 epoch, 75-76 100 epoch, 64-65
Unfreeze 0, LR 6e-5 80 epoch, 73-73.5 40 epoch, 63-63.5
Unfreeze 1, LR 3e-4 118 epoch, 51-52 142 epoch, 42.3-43.7
Unfreeze 1, LR 6e-5 80 epoch, 49-49.5 40 epoch, 41.-41.6
Unfreeze 2, LR 3e-4 70 epoch, 45.5 68 epoch, 37.2-38.6
Unfreeze 2, LR 6e-5 200 epoch, 41.18-42.19 87 epoch, 35.4-35.9
Unfreeze 7, LR 3e-4 90 epoch, 35.3 - 35.9 163 epoch, 28.6-29.6
Unfreeze 7, LR 6e-5 88 epoch, 32.6-33. 90 epoch, 27.2-27.5
Unfreeze -1 (all), LR 6e-5 160 epoch, 30.5-30.9 163 epoch, 23.8-24.15

Classics dataset. It's only 500Mb and GPT-2 overfits it pretty fast.

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 28 epoch, 26.22 7 epoch, 20.9722

Poetry dataset

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 25 epoch, 26.22 7 epoch, 48.36

Pelevin dataset

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 5 epoch, 44.55 3 epoch, 33.38

I've trained the model using gradual unfreezing with '--unfreeze_level' parameter. The sequence was 0,1,2,7,-1 (as in the table with results). When loss don't improve for a day I switch to next value (like from 2 to 7). You can find my exact scripts in tpu/schedule_small.txt and tpu/schedule_medium.txt.

3. I'd like to download your models

The model that isn't fine-tuned on any author is here

pip install awscli
aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru/unfreeze_all gpt2

Folders with s_ prefix contain Small (124M) model, m_ - for Medium (355M) model.

To understand how to generate text you should start by looking at rest.py.

Also, you can download all fine-tuned models.

aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru all

The one with which you can play on the site is located in the Pelevin folder.

4. I've got a small Russian dataset and I want to finetune your model on it

Download the models (intructions above), choose the model and put it in your output folder. Use validation set and be careful with overfitting. On small dataset it will overfit very fast - 3-7 epoch. Follow instructions below, except you don't need to train you tokenization dictionary, because you already have one.

5. I've got a big dataset on my lang and I want to train GPT-2 on it

I'd suggest that if you don't have a bunch of GPU's you should consider renting a Google TPU. On my Nvidia Titan RTX an epoch takes 70 minutes and the same epoch takes 12.5 minutes on TPU v3-8. I've used fp16 on GPU, but I can't use bfloat16 on TPU, because it's training poorly on bfloat16 at the moment (it could have been 8 minutes if implemented properly).

You can ask for access to Google's TensorFlow Research Cloud and use TPUs for free for one month.

In the process, I've switched tokenization library from SentencePiece to YTTM. YTTM is better (10% smaller files) and much faster. If you for some reason want to use SentencePiece then the code is here, just change the tokenizer in the command line.

First, the GPT-2 model will learn Russian on a huge dataset (230 GB), and then it will learn good Russian on the Russian classical literature (500 MB). I use progressive layer unfreezing to use transfer training. Validation set is the correspondence between Leo Tolstoy with young Mahatma Gandhi.

5.1. Download a fb2 library

Main link

For finetuning first second Dostoyevskiy Tolstoy Pushkin Bulgakov Gogol Pelevin

5.2. Install dependencies

sudo xargs -a apt.txt apt install
conda env create -f environment.yml

5.3. Build and Install SentencePiece (skip if use YTTM)

Follow instructions here https://github.com/google/sentencepiece

5.4. Prepare the dataset files

Use corpus/corpus.ipynb on your dataset.

Or in google colab: https://colab.research.google.com/drive/1Hsp2508TXMR0ihYOLjKYOzWm9byqg9ue

5.5. Create vocabulary for the YTTM (and SentencePiece) tokenizer

You can skip this step if you want only to finetune the model with the existing vocab.

yttm bpe --data ./corpus/tmp/russian_corpus_for_vocab.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999

# SentencePiece
spm_train --input=./corpus/tmp/russian_corpus_for_vocab.txt --model_prefix=bpe/m50 --vocab_size=50257 --user_defined_symbols='<|n|>'

5.6. If you want to use Google TPU, go here https://github.com/mgrankin/ru_transformers/tree/master/tpu

5.7. Install fp16 support

Mixed precision training with opt_level O2 gives the exact same loss but much faster and with less memory. The downside - APEX with O2 doesnt work with DataParallel yet, see https://github.com/NVIDIA/apex/issues/227

5.7.1 Make sure to install proper bare metal cuda.

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run -O cuda.run
chmod +x cuda.run
sudo ./cuda.run

5.7.2 Apex

export CUDA_HOME=/usr/local/cuda-10.2
git clone https://github.com/NVIDIA/apex
cd apex
# fix setup.py if complains for version mismatch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

5.8. Train your model!

cd ru_transformers
conda activate gpt
export TRAIN_FILE=./data/classic

# GPT-2 124M, final perplexity ?

export CUDA_VISIBLE_DEVICES=1
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=8
export LR=5e-5

# GPT-2 355M, final perplexity 18.99?

export CUDA_VISIBLE_DEVICES=2
export MODEL_SIZE=gpt2-medium
export OUTPUT=output_yt/m
export BS=3
export LR=3e-5

# GPT-2 774M, final perplexity 21.09?

export CUDA_VISIBLE_DEVICES=3
export MODEL_SIZE=gpt2-large
export OUTPUT=output_yt/l
export BS=1
export LR=1e-5

# training script

# You shouldn't use --model_name_or_path=$MODEL_SIZE if you want to start with pre-trained Russian GPT-2. If you set --model_name_or_path=gpt2 you'll start with English GPT-2. For Russian GPT-2 you should download the model, put it in the output dir and use --model_name_or_path=$OUTPUT.
# This step will download an English GPT-2 to the $OUTPUT and start training it.
# If you want to start from Russian GPT-2 then skip this step. Instead download the Russian GPT-2, put it to $OUTPUT manually. 
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_SIZE \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=1 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --unfreeze_level 0

# My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
    python run_lm_finetuning.py \
        --output_dir=$OUTPUT \
        --model_type=gpt2 \
        --model_name_or_path=$OUTPUT \
        --do_train \
        --train_data_file=$TRAIN_FILE \
        --per_gpu_train_batch_size $BS \
        --save_steps=10000 \
        --logging_steps=10 \
        --fp16 \
        --fp16_opt_level O2 \
        --warmup_samples 16000 \
        --learning_rate $LR \
        --overwrite_output_dir \
        --tokenizer_class YTEncoder \
        --tokenizer_name bpe/yt.model \
        --do_eval \
        --evaluate_during_training \
        --eval_steps 1000 \
        --eval_data_file=./data/classic/valid \
        --save_total_limit 30 \
        --num_train_epochs 10.0 \
        --unfreeze_level 0

    sleep 1
done


# with decay
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=10 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --save_total_limit 30 \
    --num_train_epochs 3.0 \
    --unfreeze_level 0 \
    --lr_decay

# and then repeat with unfreeze_level 1,2,3...

5.9. Save trained model

aws s3 cp output_s/config.json s3://models.dobro.ai/gpt2/ru/small/
aws s3 cp output_s/encoder.model s3://models.dobro.ai/gpt2/ru/small/
aws s3 cp output_s/pytorch_model.bin s3://models.dobro.ai/gpt2/ru/small/

5.10. Deploy the model

git clone https://github.com/mgrankin/ru_transformers.git
cd ru_transformers
mkdir logs
aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru gpt2
cp -R gpt2/pelevin/m_checkpoint-3365357 gpt2/medium
cp -R gpt2/poetry/m_checkpoint-3397989 gpt2/medium/poetry
conda env create -f environment.yml
conda activate gpt
uvicorn rest:app --reload --host 0.0.0.0
# crontab  DEVICE="cuda:1"
# @reboot /bin/bash -c "cd ru_transformers; git pull; source ~/.bashrc; conda activate gpt; DEVICE="cuda:1" uvicorn rest:app --reload --host 0.0.0.0"

6. Additional scripts

evaluate_model.py - to evaluate your model using input file or prompt.

text_processing.py - to process your dataset.

to_token_convertor.py - to convert your string to tokens. In case if you curious.

Owner
Mikhail Grankin
Mikhail Grankin
A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Requirements pytorch 1.1+ torchvision 0.3+ pyclipper opencv3 gcc

zhoujun 400 Dec 26, 2022
MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions Project Page | Paper If you find our work useful for your research, please con

96 Jan 04, 2023
TensorFlow implementation of Deep Reinforcement Learning papers

Deep Reinforcement Learning in TensorFlow TensorFlow implementation of Deep Reinforcement Learning papers. This implementation contains: [1] Playing A

Taehoon Kim 1.6k Jan 03, 2023
CL-Gym: Full-Featured PyTorch Library for Continual Learning

CL-Gym: Full-Featured PyTorch Library for Continual Learning CL-Gym is a small yet very flexible library for continual learning research and developme

Iman Mirzadeh 36 Dec 25, 2022
This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Observe then Incentivize Experiments This is the code used for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents",

Cong Shen Research Group 0 Mar 08, 2022
Google Recaptcha solver.

byerecaptcha - Google Recaptcha solver. Model and some codes takes from embium's repository -Installation- pip install byerecaptcha -How to use- from

Vladislav Zenkevich 21 Dec 19, 2022
Artificial intelligence technology inferring issues and logically supporting facts from raw text

개요 비정형 텍스트를 학습하여 쟁점별 사실과 논리적 근거 추론이 가능한 인공지능 원천기술 Artificial intelligence techno

6 Dec 29, 2021
BMVC 2021: This is the github repository for "Few Shot Temporal Action Localization using Query Adaptive Transformers" accepted in British Machine Vision Conference (BMVC) 2021, Virtual

FS-QAT: Few Shot Temporal Action Localization using Query Adaptive Transformer Accepted as Poster in BMVC 2021 This is an official implementation in P

Sauradip Nag 14 Dec 09, 2022
A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019)

Graph Wavelet Neural Network ⠀⠀ A PyTorch implementation of Graph Wavelet Neural Network (ICLR 2019). Abstract We present graph wavelet neural network

Benedek Rozemberczki 490 Dec 16, 2022
[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression

Delving into Deep Imbalanced Regression This repository contains the implementation code for paper: Delving into Deep Imbalanced Regression Yuzhe Yang

Yuzhe Yang 568 Dec 30, 2022
Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference This repo is the implementation for SD

Microsoft 36 Nov 28, 2022
Julia and Matlab codes to simulated all problems in El-Hachem, McCue and Simpson (2021)

Substrate_Mediated_Invasion Julia and Matlab codes to simulated all problems in El-Hachem, McCue and Simpson (2021) 2DSolver.jl reproduces the simulat

Matthew Simpson 0 Nov 09, 2021
Multi-Agent Reinforcement Learning (MARL) method to learn scalable control polices for multi-agent target tracking.

scalableMARL Scalable Reinforcement Learning Policies for Multi-Agent Control CD. Hsu, H. Jeong, GJ. Pappas, P. Chaudhari. "Scalable Reinforcement Lea

Christopher Hsu 17 Nov 17, 2022
CLNTM - Contrastive Learning for Neural Topic Model

Contrastive Learning for Neural Topic Model This repository contains the impleme

Thong Thanh Nguyen 25 Nov 24, 2022
Sandbox for training deep learning networks

Deep learning networks This repo is used to research convolutional networks primarily for computer vision tasks. For this purpose, the repo contains (

Oleg Sémery 2.7k Jan 01, 2023
Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022
Meta Learning Backpropagation And Improving It (VSML)

Meta Learning Backpropagation And Improving It (VSML) This is research code for the NeurIPS 2021 publication Kirsch & Schmidhuber 2021. Many concepts

Louis Kirsch 22 Dec 21, 2022
Hierarchical Uniform Manifold Approximation and Projection

HUMAP Hierarchical Manifold Approximation and Projection (HUMAP) is a technique based on UMAP for hierarchical non-linear dimensionality reduction. HU

Wilson Estécio Marcílio Júnior 160 Jan 06, 2023
Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis

Readme File for "Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis" by Ham, Imai, and Janson. (2022) All scripts were written and

0 Jan 27, 2022
Semantic Segmentation with SegFormer on Drone Dataset.

SegFormer_Segmentation Semantic Segmentation with SegFormer on Drone Dataset. You can check out the blog on Medium You can also try out the model with

Praneet 8 Oct 20, 2022