The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Last update: Dec 22, 2022

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

This repository contains source code necessary to reproduce the results presented in the following paper:

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

This project is maintained by Dinghan Shen. Feel free to contact [email protected] for any relevant issues.

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

CUDA, cudnn
Python 3.7
PyTorch 1.4.0

Run

Install Huggingface Transformers according to the instructions here: https://github.com/huggingface/transformers.
Download the datasets from the GLUE benchmark:

python download_glue_data.py --data_dir glue_data --tasks all

Fine-tune the RoBERTa-base or RoBERTa-large model with the Cutoff data augmentation strategies:

>>> chmod +x run_glue.sh
>>> ./run_glue.sh

Options: different settings and hyperparameters can be selected and specified in the run_glue.sh script:

do_aug: whether augmented examples are used for training.
aug_type: the specific strategy to synthesize Cutoff samples, which can be chosen from: 'span_cutoff', 'token_cutoff' and 'dim_cutoff'.
aug_cutoff_ratio: the ratio corresponding to the span length, token number or number of dimensions to be cut.
aug_ce_loss: the coefficient for the cross-entropy loss over the cutoff examples.
aug_js_loss: the coefficient for the Jensen-Shannon (JS) Divergence consistency loss over the cutoff examples.
TASK_NAME: the downstream GLUE task for fine-tuning.
model_name_or_path: the pre-trained for initialization (both RoBERTa-base or RoBERTa-large models are supported).
output_dir: the folder results being saved to.

Natural Language Generation (e.g. Translation, etc.)

Please refer to Neural Machine Translation with Data Augmentation for more details

IWSLT'14 German to English (Transformers)

Task	Setting	Approach	BLEU
iwslt14 de-en	transformer-small	w/o cutoff	36.2
iwslt14 de-en	transformer-small	w/ cutoff	37.6

WMT'14 English to German (Transformers)

Task	Setting	Approach	BLEU
wmt14 en-de	transformer-base	w/o cutoff	28.6
wmt14 en-de	transformer-base	w/ cutoff	29.1
wmt14 en-de	transformer-big	w/o cutoff	29.5
wmt14 en-de	transformer-big	w/ cutoff	30.3

Citation

Please cite our paper in your publications if it helps your research:

@article{shen2020simple,
  title={A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation},
  author={Shen, Dinghan and Zheng, Mingzhi and Shen, Yelong and Qu, Yanru and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.13818},
  year={2020}
}

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

Run

Natural Language Generation (e.g. Translation, etc.)

IWSLT'14 German to English (Transformers)

WMT'14 English to German (Transformers)

Citation

Owner

Dinghan Shen

【ACMMM 2021】DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Code for "Universal inference meets random projections: a scalable test for log-concavity"

PyTorch implementation of image classification models for CIFAR-10/CIFAR-100/MNIST/FashionMNIST/Kuzushiji-MNIST/ImageNet

Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

A computer vision pipeline to identify the "icons" in Christian paintings

Explainability for Vision Transformers (in PyTorch)

Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery"

Generic Event Boundary Detection: A Benchmark for Event Segmentation

Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]

NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and Size

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Soomvaar is the repo which 🏩 contains different collection of 👨‍💻🚀code in Python and 💫✨Machine 👬🏼 learning algorithms📗📕 that is made during 📃 my practice and learning of ML and Python✨💥

Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

Repository of best practices for deep learning in Julia, inspired by fastai

A annotation of yolov5-5.0

Go from graph data to a secure and interactive visual graph app in 15 minutes. Batteries-included self-hosting of graph data apps with Streamlit, Graphistry, RAPIDS, and more!

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

RMNA: A Neighbor Aggregation-Based Knowledge Graph Representation Learning Model Using Rule Mining