Code for text augmentation method leveraging large-scale language models

Overview

HyperMix

Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation.

Getting Started

Installing Packages

The main depedencies can be installed via pip install -r requirements.txt.

Usage

The main code is run through main.py. Check out --help for full list of commands.

python main.py --help

The code will automatically use the first GPU device, if detected.

A typical command to run BERT-base 10 times on the 1% subsample set of the SST-2 dataset and computing the average of all run is as follows.

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter none \
    --save-dir out

The script will create a directory named out in the current working directory and save the script log as out/run.log. It will also save any augmentations created during the experiments (if any augmentation is enabled).

To test GPT3Mix, prepare an OpenAI API key as described at the bottom of this README file, then use the following command:

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter gpt3-mix \
    --save-dir out

Managing Seeds

In the command above, the script will automatically generate seeds for sampling data and optimizing models. The seed used to generate each individual seed is called "master seed" and can be set using --master-data-seed and --master-exp-seed options. As evident from the option names, they are responsible for sampling data and optimizing a freshly initialized models respectively.

Sometimes, we need to manually set the seeds and not rely on automatically generated seeds from the master seeds. Manually seeding can be achieved via --data-seeds option. If this option is given, the master data seed will be ignored. We only support manualy data seeding for now.

OpenAI Key

Store OpenAI API Key under the current working directory as a file named openai-key. When running the main script, it will automatically detect the api key.

API keys can be provided to the script by --api-key option (not recommended) or from a file named openai-key in the current working directory.

Other Notes

At the moment we only support data augmentation leveraging OpenAI GPT-3 (GPT3Mix), but we will release an update that supports HyperCLOVA as soon as it becomes available to the public (HyperMix).

Citation

To cite our code or work, please use the following bibtex:

@inproceedings{yoo2021gpt3mix,
	title = "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation",
	author = "Yoo, Kang Min  and
	  Park, Dongju  and
	  Kang, Jaewook  and
	  Lee, Sang-Woo  and
	  Park, Woomyoung",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
	month = nov,
	year = "2021",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-emnlp.192",
	pages = "2225--2239",
}
Owner
NAVER AI
Official account of NAVER AI, Korea No.1 Industrial AI Research Group
NAVER AI
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Stanford CoreNLP provides a set of natural language analysis tools written in Java

Stanford CoreNLP Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and giv

Stanford NLP 8.8k Jan 07, 2023
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 209 Dec 30, 2022