Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Last update: Dec 17, 2022

Related tags

Overview

Training COMET using seq2seq setting

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.

The ./deepspeed/ folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .

The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.

Dependencies

python

torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10

others

GCC/G++ 5.2.0 (to complie deepspeed ops)

Usage

1. Normal training without memory optimization:

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --gradient_checkpointing

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
    --deepspeed deepspeed/ds_config_zero2.json \
    --model_name_or_path google/t5-xl-lm-adapt \
    --do_train \
    --train_file data/kg/atomic2020_data-feb2021/train.csv \
    --source_prefix "" \
    --output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
    --overwrite_output_dir \
    --gradient_accumulation_steps=1 \
    --per_device_train_batch_size=16 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --fp16

4. Comparison of memory usage of different memory optimization methods

Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).

1. fp16

T5-3B: effects of fp16. A 20% reduce of memory size.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	A6000	False	8x4x1	47.5k M	1.5s/32ex
vanilla	A6000	True	8x4x1	31k M	1.0s/32ex
vanilla	3090	False	1x32x1	❌	-
vanilla	3090	True	1x32x1	❌	-

2. gradient_checkpointing

T5-3B: Effects of gradient_checkpointing.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	A6000	False	8x4x1	47k M	1.5s/32ex
vanilla	A6000	True	8x4x1	31k M	1.0s/32ex
grad-ckpt	A6000	False	8x4x1	46.4k M	1.3s/32ex
grad-ckpt	A6000	True	8x4x1	23.9k M	1.1/32ex
vanilla	3090	True	1x32x1	❌	-
grad-ckpt	3090	True	1x32x1	23.8k M	15s/32ex

3. Deepspeed stage 2

T5-3B: Effects of deepspeed.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	3090	True	1x32x1	❌	-
grad-ckpt	3090	True	1x32x1	23k M	13.5s/32ex
stage2	3090	True	32x1x1	20.3k M	7.5s/32ex
stage2	3090	True	16x1x2	20.3k M	6.36s/32ex
stage2	3090	True	32x1x2	20.3k M	3.75s/32ex

4. Deepspeed stage 3

stage3 will lead to smaller usage of memory but way smaller training speed.

5. Automatic Evaluation Result on ATOMIC2020 data

	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
T5-3B (no deepspeed), lr1e-5, epoch 3	0.346	0.184	0.12	0.084	0.19	0.422	0.646
T5-3B (no deepspeed), lr1e-5, epoch 2	0.348	0.185	0.121	0.085	0.19	0.424	0.651
T5-3B (no deepspeed), lr1e-5, epoch 1	0.343	0.177	0.113	0.079	0.186	0.416	0.629
T5-3B (ds_stage2, fp16) epoch 3	0.340	0.182	0.118	0.083	0.189	0.418	0.637
T5-3B (ds_stage2, fp16) epoch 2	0.337	0.177	0.114	0.078	0.189	0.419	0.633
T5-3B (ds_stage2, fp16) epoch 1	0.335	0.174	0.112	0.076	0.186	0.415	0.632

Useful discussions regarding environment setups

Errors building DeepSpeed Ops: https://github.com/microsoft/DeepSpeed/issues/885

TODO

DeepSpeed without Trainer(): https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Related tags

Overview

Training COMET using seq2seq setting

Dependencies

Usage

1. Normal training without memory optimization:

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

4. Comparison of memory usage of different memory optimization methods

1. fp16

2. gradient_checkpointing

3. Deepspeed stage 2

4. Deepspeed stage 3

5. Automatic Evaluation Result on ATOMIC2020 data

Useful discussions regarding environment setups

TODO

Owner

tqfang

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

🤕 spelling exceptions builder for lazy people

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

A library for finding knowledge neurons in pretrained transformer models.

Repositório da disciplina no semestre 2021-2

End-2-end speech synthesis with recurrent neural networks

Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

NLP tool to extract emotional phrase from tweets 🤩

Pytorch implementation of Tacotron

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Anomaly Detection 이상치 탐지 전처리 모듈

Multiple implementations for abstractive text summurization , using google colab

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System