The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Last update: Dec 12, 2022

Overview

SuperGen

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Requirements

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

SuperGen is a Supervision Generation method for zero-shot learning on NLU tasks. Instead of training on task-specific data, SuperGen generates training data guided by label-descriptive prompts with a unidirectional language model and fine-tunes another language model on the generated data.

Training and Test Data: Our method does not use any task-specific data (e.g., original training set). We provide our generated training set and original dev set (used as the test set) of each GLUE task under the data directory: train.json files are the generated training set (after data selection); test.tsv files are the original GLUE dev set (used as the test set for evaluation purpose).
Pretraining Corpus: We provide the processed pretraining corpus (Wikipedia and OpenWebText) for generating training data for sequence-pair tasks under the pretrain_corpus directory; see the README file there for details.

Generating Training Data

The generated training set used in the paper are provided as train.json files under each task directory; you should be able to obtain very similar generated data by following the steps below:

Data Generation: The entry script for generating training data for GLUE tasks is gen_train_data.py. The basic usage is

python gen_train_data.py --task $TASK --label $LABEL --save_dir $SAVE_DIR --num_gen $NUM_GEN

You can generate training data of each label either by setting individual label name $LABEL one at a time or by setting $LABEL=all to generate data for all labels (this will still be done sequentially). You may want to set $NUM_GEN to be larger than the desired training set size, as only those texts with the highest generated probability will be used to form the final training set.

Data Selection: After generating the training data, the final training set can be constructed by running the following:

python src/gen_utils.py --task $TASK --num_select_samples $NUM_SELECT \
                        --read_dir $SAVE_DIR --save_dir $DATA_DIR

Example: We provide an example script run_gen.sh that includes the entire generation process for all GLUE tasks under the setting described in the paper.

Fine-Tuning

The entry script for fine-tuning on generated data is finetune.py. The basic usage is

python finetune.py \
    --task_name $TASK \
    --data_dir data/$TASK \
    --overwrite_output_dir \
    --do_train \
    --do_predict \
    --smooth $SM \
    --momentum $MOMENT \
    --eval_steps $INTERVAL \
    --threshold $TH \
    --reg_weight $REG \
    --temp_ensemble_rampup $RAMP \
    --model_name_or_path $MODEL \
    --max_seq_length 128 \
    --first_sent_limit 100 \
    --per_device_train_batch_size $BS \
    --learning_rate $LR \
    --num_train_epochs 3 \
    --output_dir $OUT_DIR \
    --template $TEMPLATE \
    --mapping $MAPPING \
    --warmup_ratio 0.1 \
    --save_at_last \

Example: We provide an example script run_finetune.sh with command line arguments set up for all GLUE tasks under the setting described in the paper.

Results: When using the same prompt-based fine-tuning pipeline (with the same manual prompts and label words), zero-shot SuperGen even achieves better performance than few-shot LM-BFF using 32 annotated samples per class across seven GLUE classification tasks:

Method	MNLI-m/mm	QQP	QNLI	SST-2	CoLA	RTE	MRPC	AVG
LM-BFF 32-Sample Few-Shot	68.3/70.5	65.5	64.5	92.7	9.3	69.1	74.5	63.6
SuperGen Zero-Shot	72.3/73.8	66.1	73.3	92.8	32.7	65.3	82.2	69.4

Acknowledgement

Some scripts in this repository are adapted from COCO-LM (for COCO-LM model), LM-BFF (for prompt-based fine-tuning) and huggingface transformers (for text generation and GLUE processor/trainer).

Citations

Please cite the following paper if you find the code helpful for your research.

@article{meng2022generating,
  title={Generating Training Data with Language Models: Towards Zero-Shot Language Understanding},
  author={Meng, Yu and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  journal={arXiv preprint arXiv:2202.04538},
  year={2022}
}

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Related tags

Overview

SuperGen

Requirements

Overview

Generating Training Data

Fine-Tuning

Acknowledgement

Citations

Owner

Yu Meng

a pytorch implementation of auto-punctuation learned character by character

Simple cross-platform application for DaVinci surgical video frame annotation

The software associated with a paper accepted at EMNLP 2021 titled "Open Knowledge Graphs Canonicalization using Variational Autoencoders".

Pytorch implementation of Implicit Behavior Cloning.

CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes

Expand human face editing via Global Direction of StyleCLIP, especially to maintain similarity during editing.

This code provides various models combining dilated convolutions with residual networks

PECOS - Prediction for Enormous and Correlated Spaces

Analyzing basic network responses to novel classes

Official Pytorch and JAX implementation of "Efficient-VDVAE: Less is more"

A collection of scripts I developed for personal and working projects.

Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

Training Cifar-10 Classifier Using VGG16

Libraries, tools and tasks created and used at DeepMind Robotics.

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Pairwise learning neural link prediction for ogb link prediction

A style-based Quantum Generative Adversarial Network

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

Metric learning algorithms in Python