Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Overview

Training COMET using seq2seq setting

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.

The ./deepspeed/ folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .

The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.

Dependencies

python

torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10

others

GCC/G++ 5.2.0 (to complie deepspeed ops)

Usage

1. Normal training without memory optimization:

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --gradient_checkpointing

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
    --deepspeed deepspeed/ds_config_zero2.json \
    --model_name_or_path google/t5-xl-lm-adapt \
    --do_train \
    --train_file data/kg/atomic2020_data-feb2021/train.csv \
    --source_prefix "" \
    --output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
    --overwrite_output_dir \
    --gradient_accumulation_steps=1 \
    --per_device_train_batch_size=16 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --fp16

4. Comparison of memory usage of different memory optimization methods

Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).

1. fp16

T5-3B: effects of fp16. A 20% reduce of memory size.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47.5k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
vanilla 3090 False 1x32x1 -
vanilla 3090 True 1x32x1 -

2. gradient_checkpointing

T5-3B: Effects of gradient_checkpointing.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
grad-ckpt A6000 False 8x4x1 46.4k M 1.3s/32ex
grad-ckpt A6000 True 8x4x1 23.9k M 1.1/32ex
vanilla 3090 True 1x32x1 -
grad-ckpt 3090 True 1x32x1 23.8k M 15s/32ex

3. Deepspeed stage 2

T5-3B: Effects of deepspeed.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla 3090 True 1x32x1 -
grad-ckpt 3090 True 1x32x1 23k M 13.5s/32ex
stage2 3090 True 32x1x1 20.3k M 7.5s/32ex
stage2 3090 True 16x1x2 20.3k M 6.36s/32ex
stage2 3090 True 32x1x2 20.3k M 3.75s/32ex

4. Deepspeed stage 3

stage3 will lead to smaller usage of memory but way smaller training speed.

5. Automatic Evaluation Result on ATOMIC2020 data

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
T5-3B (no deepspeed), lr1e-5, epoch 3 0.346 0.184 0.12 0.084 0.19 0.422 0.646
T5-3B (no deepspeed), lr1e-5, epoch 2 0.348 0.185 0.121 0.085 0.19 0.424 0.651
T5-3B (no deepspeed), lr1e-5, epoch 1 0.343 0.177 0.113 0.079 0.186 0.416 0.629
T5-3B (ds_stage2, fp16) epoch 3 0.340 0.182 0.118 0.083 0.189 0.418 0.637
T5-3B (ds_stage2, fp16) epoch 2 0.337 0.177 0.114 0.078 0.189 0.419 0.633
T5-3B (ds_stage2, fp16) epoch 1 0.335 0.174 0.112 0.076 0.186 0.415 0.632

Useful discussions regarding environment setups

TODO

DeepSpeed without Trainer(): https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration

Owner
tqfang
Ph.D. at HKUST, interested in commonsense in NLP
tqfang
Script and models for clustering LAION-400m CLIP embeddings.

clustering-laion400m Script and models for clustering LAION-400m CLIP embeddings. Models were fit on the first million or so image embeddings. A subje

Peter Baylies 22 Oct 04, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
keras implement of transformers for humans

keras implement of transformers for humans

苏剑林(Jianlin Su) 4.8k Jan 03, 2023
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Seonghwan Kim 24 Sep 11, 2022
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022