ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Overview

ZEROGEN

This repository contains the code for our paper “ZeroGen: Efficient Zero-shot Learning via Dataset Generation”. Our implementation is built on the source code from dino. Thanks for their work.

If you use this code, please cite our paper:

@article{ye2022zerogen,
      title={ZeroGen: Efficient Zero-shot Learning via Dataset Generation}, 
      author={Jiacheng Ye and Jiahui Gao and Qintong Li and Hang Xu and Jiangtao Feng and Zhiyong Wu and Tao Yu and Lingpeng Kong},
      year={2022},
      eprint={2202.07922},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Setup

All requirements for ZEROGEN can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

Usage

The scripts/run_cls.sh and scripts/run_qa.sh scripts contain the running commands for the following settings:

  • supervised learning with human annotations (SUPERVISED)
  • prompt-based zero-shot learning (PROMPTING)
  • efficient zero-shot learning via dataset generation (ZEROGEN)

For text classification (TC) tasks (e.g., SST-2 and IMDb) and natural language inference (NLI) tasks (e.g., QNLI and RTE), run with bash scripts/run_cls.sh. For question answering (QA) tasks, run with bash scripts/run_qa.sh

When generating X (i.e., denotes text in TC, hypothesis in NLI and question in QA) in the final stage of the scripts, we also train the small model and evaluate it on human annotations. Specifically, after generating log_every number of examples, we perform training on the synthetic dataset and evaluation on the gold validation set. This gives as a trend graph similar to Figure 2 in the paper, which is shown by wandb, a powerful toolkit to track experiments.

Before running, you need to reset the following parameters to yours:

  • home_dir: path to ZeroGen
  • gpu: gpu id
  • batch_size: the batch size for generating with PLM. For SST-2, it costs ~16G when using a batch size of 32 with gpt2-xl. While for SQuAD, it costs ~60G using the same batch size and PLM because of the longer contexts. So decrease the batch size if needed.
  • WANDB_PROJECT: project name, by default ZeroGen
  • WANDB_ENTITY: your wandb username
  • WANDB_API_KEY: your api-key

By default we use GPT2-XL as pre-trained language model (PLM) and DistilBERT as tiny-task model (TAM), to modify the size of PLM and TAM, you can change model_name and small_model_name in run_xxx.sh scripts.

Run with a synthesized dataset

After dataset generation, we save the synthetic dataset at:

  • For TC and NLI: out-${task_name}-x2/${dataset}/${task_name}-dataset.jsonl (e.g., out-sst-2-x2/gpt2-xl_topk0_topp0.9_sst-2-x2/sst-2-dataset.jsonl). The file is in json line format (e.g., {"C": "The Book of Mormon Musical", "X": "The Book of Mormon Musical brings all the drama and excitement of a real revival of the Broadway production to the big screen.", "Y": 0}).
  • For QA: out-${task_name}-x2/${dataset}. We save the dataset in huggingface Dataset format.

To run DistilBERT given a generated dataset, you can use the scripts/run_distilbert.sh script.

To run a LSTM-based model given a generated dataset, you can use the scripts/run_cls_lstm.sh script. Before that, you have to download the datasets from google drive link, which contain the standard test files.

Diversity and Correctness of a synthesized dataset

Divesity

We use Self-BLEU to measure the diversity of a synthesized dataset. To calculate the Self-BLEU for a given dataset, you can see the example in scripts/run_self_bleu.sh script.

Correctness

To calculate the Correctness, you can take the following steps:

  1. Replace the following parameters in scripts/run_distilbert.sh script with:

    • small_model_name=roberta-large
    • dataset=: empty means using standard training set
    • limit=: empty means using full standard training set

    This will give you a RoBERTa-Large trained with full human annotations, which can be used as an evaluator.

  2. Replace the following parameters in scripts/run_distilbert.sh script with:

    • small_model_ckpt=tmp/checkpoint-xxx: the final RoBERTa-Large checkpoint saved in step 1.
    • limit=10000: the number of samples to use, by default 10000
    • dataset=xxx: the name of synthetic dataset (e.g., gpt2-xl_topk0_topp0.9_sst-2-x2)
    • no_train=true: disable training

    Run the script, and you will get Metric on standard dataset and Metric on synthetic dataset, which represents the Correctness of standard dataset and synthetic dataset, respectively.

Resources

We provide some synthetic datasets and standard datasets for training LSTM in this google drive link. When training DistilBERT, the standard dataset is directly downloaded by huggingface Dataset package. Note we use the same prompt for IMDb/SST-2, and SQuAD/AdversarialQA, therefore the synthetic datasets are also the same.

Submission to Twitter's algorithmic bias bounty challenge

Twitter Ethics Challenge: Pixel Perfect Submission to Twitter's algorithmic bias bounty challenge, by Travis Hoppe (@metasemantic). Abstract We build

Travis Hoppe 4 Aug 19, 2022
This repository contains the implementation of the following paper: Cross-Descriptor Visual Localization and Mapping

Cross-Descriptor Visual Localization and Mapping This repository contains the implementation of the following paper: "Cross-Descriptor Visual Localiza

Mihai Dusmanu 81 Oct 06, 2022
Local trajectory planner based on a multilayer graph framework for autonomous race vehicles.

Graph-Based Local Trajectory Planner The graph-based local trajectory planner is python-based and comes with open interfaces as well as debug, visuali

TUM - Institute of Automotive Technology 160 Jan 04, 2023
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 123 Dec 23, 2022
Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

2 Nov 15, 2021
Repository for GNSS-based position estimation using a Deep Neural Network

Code repository accompanying our work on 'Improving GNSS Positioning using Neural Network-based Corrections'. In this paper, we present a Deep Neural

32 Dec 13, 2022
Implementation of Uformer, Attention-based Unet, in Pytorch

Uformer - Pytorch Implementation of Uformer, Attention-based Unet, in Pytorch. It will only offer the concat-cross-skip connection. This repository wi

Phil Wang 72 Dec 19, 2022
PyTorch Live is an easy to use library of tools for creating on-device ML demos on Android and iOS.

PyTorch Live is an easy to use library of tools for creating on-device ML demos on Android and iOS. With Live, you can build a working mobile app ML demo in minutes.

559 Jan 01, 2023
Reinforcement Learning for Automated Trading

Reinforcement Learning for Automated Trading This thesis has been realized for the obtention of the Master's in Mathematical Engineering at the Polite

Pierpaolo Necchi 80 Jun 19, 2022
Air Pollution Prediction System using Linear Regression and ANN

AirPollution Pollution Weather Prediction System: Smart Outdoor Pollution Monitoring and Prediction for Healthy Breathing and Living Publication Link:

Dr Sharnil Pandya, Associate Professor, Symbiosis International University 19 Feb 07, 2022
Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

CorrNet This project provides the code and results for 'Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation'

Gongyang Li 13 Nov 03, 2022
Datasets and source code for our paper Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Introduction Datasets and source code for our paper Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach Datasets: WebFG-496

21 Sep 30, 2022
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

Dynamic Systems Lab 300 Dec 28, 2022
Neural Message Passing for Computer Vision

Neural Message Passing for Quantum Chemistry Implementation of different models of Neural Networks on graphs as explained in the article proposed by G

Pau Riba 310 Nov 07, 2022
Unofficial PyTorch implementation of Fastformer based on paper "Fastformer: Additive Attention Can Be All You Need"."

Fastformer-PyTorch Unofficial PyTorch implementation of Fastformer based on paper Fastformer: Additive Attention Can Be All You Need. Usage : import t

Hong-Jia Chen 126 Dec 06, 2022
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022
Code for "Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation" ICCV'21

Skeletal-GNN Code for "Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation" ICCV'21 Various deep learning techniques have been propose

37 Oct 23, 2022
Header-only library for using Keras models in C++.

frugally-deep Use Keras models in C++ with ease Table of contents Introduction Usage Performance Requirements and Installation FAQ Introduction Would

Tobias Hermann 927 Jan 05, 2023
we propose EfficientDerain for high-efficiency single-image deraining

EfficientDerain we propose EfficientDerain for high-efficiency single-image deraining Requirements python 3.6 pytorch 1.6.0 opencv-python 4.4.0.44 sci

Qing Guo 126 Dec 07, 2022
Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

Code for Neural Reflectance Surfaces (NeRS) [arXiv] [Project Page] [Colab Demo] [Bibtex] This repo contains the code for NeRS: Neural Reflectance Surf

Jason Y. Zhang 234 Dec 30, 2022