Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

A list of NLP(Natural Language Processing) tutorials

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

TalkNet: Audio-visual active speaker detection Model

Text vectorization tool to outperform TFIDF for classification tasks

Code for PED: DETR For (Crowd) Pedestrian Detection

lightweight, fast and robust columnar dataframe for data analytics with online update

Unsupervised text tokenizer focused on computational efficiency

GooAQ 🥑 : Google Answers to Google Questions!

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Pytorch version of BERT-whitening

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Large-scale Knowledge Graph Construction with Prompting

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

🏆 • 5050 most frequent words in 109 languages

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"