CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

Related tags

Deep LearningCLUES
Overview

License: MIT

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

This repo contains the data and source code for baseline models in the NeurIPS 2021 benchmark paper for Constrained Language Understanding Evaluation Standard (CLUES) under MIT License.

Overview

The benchmark data is located in the data directory. We also release source codes for two fine-tuning strategies on CLUES, one with classic fine-tuning and the other with prompt-based fine-tuning.

Classic finetuning

Setup Environment

  1. > git clone [email protected]:microsoft/CLUES.git
  2. > git clone [email protected]:namisan/mt-dnn.git
  3. > cp -rf CLUES/classic_finetuning/ mt-dnn/
  4. > cd mt-dnn/

Run Experiments

  1. Preprocess data
    > bash run_clues_data_process.sh

  2. Train/test Models
    > bash run_clues_batch.sh

Prompt fine-tuning

Setup

  1. cd prompt_finetuning
  2. Run sh setup.sh to automatically fetch dependency codebase and apply our patch for CLUES

Run Experiments

All prompt-based funetuning baselines run commands are in experiments.sh, simple run by sh experiments.sh

Leaderboard

Here we maintain a leaderboard, allowing researchers to submit their results as entries.

Submission Instructions

  • Each submission must be submitted as a pull request modifying the markdown file underlying the leaderboard.
  • The submission must attach an accompanying public paper and public source code for reproducing their results on our dataset.
  • A submission can be toward any subset of tasks in our benchmark, or toward the aggregate leaderboard.
  • For any task targeted by the submission, we require evaluation on (1) 10, 20, and 30 shots, and (2) all 5 splits of the corresponding dataset and a report of their mean and standard deviation.
  • Each leaderboard will be sorted by the 30-shot mean S1 score (where S1 score is a variant of F1 score defined in our paper).
  • The submission should not use data from the 4 other splits during few-shot finetuning of any 1 split, either as extra training set or as validation set for hyperparameter tuning.
  • However, we allow external data, labeled or unlabeled, to be used for such purposes. Each submission using external data must mark the corresponding columns "external labeled" and/or "external unlabeled". Note, in this context, "external data" refers to data used after pretraining (e.g., for task-specific tuning); in particular, methods using existing pretrained models only, without extra data, should not mark either column. For obvious reasons, models cannot be trained on the original labeled datasets from where we sampled the few-shot CLUES data.
  • In the table entry, the submission should include a method name and a citation, hyperlinking to their publicly released source code reproducing the results. See the last entry of the table below for an example.

Abbreviations

  • FT = (classic) finetuning
  • PT = prompt based tuning
  • ICL = in-context learning, in the style of GPT-3
  • μ±σ = mean μ and standard deviation σ across our 5 splits. Aggregate standard deviation is calculated using the sum-of-variance formula from individual tasks' standard deviations.

Benchmarking CLUES for Aggregate 30-shot Evaluation

Shots (K=30) external labeled external unlabeled Average ▼ SST-2 MNLI CoNLL03 WikiANN SQuAD-v2 ReCoRD
Human N N 81.4 83.7 69.4 87.4 82.6 73.5 91.9
T5-Large-770M-FT N N 43.1±6.7 52.3±2.9 36.8±3.8 51.2±0.1 62.4±0.6 43.7±2.7 12±3.8
BERT-Large-336M-FT N N 42.1±7.8 55.4±2.5 33.3±1.4 51.3±0 62.5±0.6 35.3±6.4 14.9±3.4
BERT-Base-110M-FT N N 41.5±9.2 53.6±5.5 35.4±3.2 51.3±0 62.8±0 32.6±5.8 13.1±3.3
DeBERTa-Large-400M-FT N N 40.1±17.8 47.7±9.0 26.7±11 48.2±2.9 58.3±6.2 38.7±7.4 21.1±3.6
RoBERTa-Large-355M-FT N N 40.0±10.6 53.2±5.6 34.0±1.1 44.7±2.6 48.4±6.7 43.5±4.4 16±2.8
RoBERTa-Large-355M-PT N N 90.2±1.8 61.6±3.5
DeBERTa-Large-400M-PT N N 88.4±3.3 62.9±3.1
BERT-Large-336M-PT N N 82.7±4.1 45.3±2.0
GPT3-175B-ICL N N 91.0±1.6 33.2±0.2
BERT-Base-110M-PT N N 79.4±5.6 42.5±3.2
LiST (Wang et al.) N Y 91.3 ±0.7 67.9±3.0
Example (lastname et al.) Y/N Y/N 0±0 0±0 0±0 0±0 0±0 0±0 0±0

Individual Task Performance over Multiple Shots

SST-2

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
GPT-3 (175B) ICL N N 85.9±3.7 92.0±0.7 91.0±1.6 -
RoBERTa-Large PT N N 88.8±3.9 89.0±1.1 90.2±1.8 93.8
DeBERTa-Large PT N N 83.4±5.3 87.8±3.5 88.4±3.3 91.9
Human N N 79.8 83 83.7 -
BERT-Large PT N N 63.2±11.3 78.2±9.9 82.7±4.1 91
BERT-Base PT N N 63.9±10.0 76.7±6.6 79.4±5.6 91.9
BERT-Large FT N N 46.3±5.5 55.5±3.4 55.4±2.5 99.1
BERT-Base FT N N 46.2±5.6 54.0±2.8 53.6±5.5 98.1
RoBERTa-Large FT N N 38.4±21.7 52.3±5.6 53.2±5.6 98.6
T5-Large FT N N 51.2±1.8 53.4±3.2 52.3±2.9 97.6
DeBERTa-Large FT N N 43.0±11.9 40.8±22.6 47.7±9.0 100
Example (lastname et al.) Y/N Y/N 0±0 0±0 0±0 -

MNLI

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N Y 78.1 78.6 69.4 -
LiST (wang et al.) N N 60.5±8.3 67.2±4.5 67.9±3.0 -
DeBERTa-Large PT N N 44.5±8.2 60.7±5.3 62.9±3.1 88.1
RoBERTa-Large PT N N 57.7±3.6 58.6±2.9 61.6±3.5 87.1
BERT-Large PT N N 41.7±1.0 43.7±2.1 45.3±2.0 81.9
BERT-Base PT N N 40.4±1.8 42.1±4.4 42.5±3.2 81
T5-Large FT N N 39.8±3.3 37.9±4.3 36.8±3.8 85.9
BERT-Base FT N N 37.0±5.2 35.2±2.7 35.4±3.2 81.6
RoBERTa-Large FT N N 34.3±2.8 33.4±0.9 34.0±1.1 85.5
BERT-Large FT N N 33.7±0.4 28.2±14.8 33.3±1.4 80.9
GPT-3 (175B) ICL N N 33.5±0.7 33.1±0.3 33.2±0.2 -
DeBERTa-Large FT N N 27.4±14.1 33.6±2.5 26.7±11.0 87.6

CoNLL03

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 87.7 89.7 87.4 -
BERT-Base FT N N 51.3±0 51.3±0 51.3±0 -
BERT-Large FT N N 51.3±0 51.3±0 51.3±0 89.3
T5-Large FT N N 46.3±6.9 50.0±0.7 51.2±0.1 92.2
DeBERTa-Large FT N N 50.1±1.2 47.8±2.5 48.2±2.9 93.6
RoBERTa-Large FT N N 50.8±0.5 44.6±5.1 44.7±2.6 93.2

WikiANN

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 81.4 83.5 82.6 -
BERT-Base FT N N 62.8±0 62.8±0 62.8±0 88.8
BERT-Large FT N N 62.8±0 62.6±0.4 62.5±0.6 91
T5-Large FT N N 61.7±0.7 62.1±0.2 62.4±0.6 87.4
DeBERTa-Large FT N N 58.5±3.3 57.9±5.8 58.3±6.2 91.1
RoBERTa-Large FT N N 58.5±8.8 56.9±3.4 48.4±6.7 91.2

SQuAD v2

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 71.9 76.4 73.5 -
T5-Large FT N N 43.6±3.5 28.7±13.0 43.7±2.7 87.2
RoBERTa-Large FT N N 38.1±7.2 40.1±6.4 43.5±4.4 89.4
DeBERTa-Large FT N N 41.4±7.3 44.4±4.5 38.7±7.4 90
BERT-Large FT N N 42.3±5.6 35.8±9.7 35.3±6.4 81.8
BERT-Base FT N N 46.0±2.4 34.9±9.0 32.6±5.8 76.3

ReCoRD

Shots (K) external labeled external unlabeled 10 20 30 ▼ All
Human N N 94.1 94.2 91.9 -
DeBERTa-Large FT N N 15.7±5.0 16.8±5.7 21.1±3.6 80.7
RoBERTa-Large FT N N 12.0±1.9 9.9±6.2 16.0±2.8 80.3
BERT-Large FT N N 9.9±5.2 11.8±4.9 14.9±3.4 66
BERT-Base FT N N 10.3±1.8 11.7±2.4 13.1±3.3 54.4
T5-Large FT N N 11.9±2.7 11.7±1.5 12.0±3.8 77.3

How do I cite CLUES?

@article{cluesteam2021,
  title={Few-Shot Learning Evaluation in Natural Language Understanding},
  author={Mukherjee, Subhabrata and Liu, Xiaodong and Zheng, Guoqing and Hosseini, Saghar and Cheng, Hao and Yang, Greg and Meek, Christopher and Awadallah, Ahmed Hassan and Gao, Jianfeng},
  year={2021}
}

Acknowledgments

MT-DNN: https://github.com/namisan/mt-dnn
LM-BFF: https://github.com/princeton-nlp/LM-BFF

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Cooperative multi-agent reinforcement learning for high-dimensional nonequilibrium control

Cooperative multi-agent reinforcement learning for high-dimensional nonequilibrium control Official implementation of: Cooperative multi-agent reinfor

0 Nov 16, 2021
CMT: Convolutional Neural Networks Meet Vision Transformers

CMT: Convolutional Neural Networks Meet Vision Transformers [arxiv] 1. Introduction This repo is the CMT model which impelement with pytorch, no refer

FlyEgle 83 Dec 30, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 03, 2023
A data-driven maritime port simulator

PySeidon - A Data-Driven Maritime Port Simulator 🌊 Extendable and modular software for maritime port simulation. This software uses entity-component

6 Apr 10, 2022
[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

MosaicKD Code for NeurIPS-21 paper "Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data" 1. Motivation Natural images share common l

ZJU-VIPA 37 Nov 10, 2022
An auto discord account and token generator. Automatically verifies the phone number. Works without proxy. Bypasses captcha.

JOIN DISCORD SERVER https://discord.gg/uAc3agBY FREE HCAPTCHA SOLVING API Discord-Token-Gen An auto discord token generator. Auto verifies phone numbe

3kp 271 Jan 01, 2023
A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Spiking Neural Network training with EventProp This is an unofficial PyTorch implemenation of EventProp, a method to compute exact gradients for Spiki

Pedro Savarese 35 Jul 29, 2022
Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models (published in ICLR2018)

Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models Pouya Samangouei*, Maya Kabkab*, Rama Chellappa [*: authors co

Maya Kabkab 212 Dec 07, 2022
Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

Sungha Choi 173 Dec 21, 2022
Sematic-Segmantation - Semantic Segmentation on MIT ADE20K dataset in PyTorch

Semantic Segmentation on MIT ADE20K dataset in PyTorch This is a PyTorch impleme

Berat Eren Terzioğlu 4 Mar 22, 2022
Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets

Crowd-Kit: Computational Quality Control for Crowdsourcing Documentation Crowd-Kit is a powerful Python library that implements commonly-used aggregat

Toloka 125 Dec 30, 2022
Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍 基于SEGAN模型的改进版本,使用自主设计的非

Nitin 53 Nov 17, 2022
U-Net Brain Tumor Segmentation

U-Net Brain Tumor Segmentation 🚀 :Feb 2019 the data processing implementation in this repo is not the fastest way (code need update, contribution is

Hao 448 Jan 02, 2023
SSPNet: Scale Selection Pyramid Network for Tiny Person Detection from UAV Images.

SSPNet: Scale Selection Pyramid Network for Tiny Person Detection from UAV Images (IEEE GRSL 2021) Code (based on mmdetection) for SSPNet: Scale Selec

Italian Cannon 37 Dec 28, 2022
Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Robust Object Detection via Instance-Level Temporal Cycle Confusion This repo contains the implementation of the ICCV 2021 paper, Robust Object Detect

Xin Wang 69 Oct 13, 2022
This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

Jungdae Kim 320 Jan 08, 2023
Background Matting: The World is Your Green Screen

Background Matting: The World is Your Green Screen By Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman Th

Soumyadip Sengupta 4.6k Jan 04, 2023
Efficient and intelligent interactive segmentation annotation software

Efficient and intelligent interactive segmentation annotation software

294 Dec 30, 2022
Simulation of the solar system using various nummerical methods

solar-system Simulation of the solar system using various nummerical methods Download the repo Make shure matplotlib, scipy etc. are installed execute

Caspar 7 Jul 15, 2022
It helps user to learn Pick-up lines and share if he has a better one

Pick-up-Lines-Generator(Open Source) It helps user to learn Pick-up lines Share and Add one or many to the DataBase Unique SQLite DataBase AI Undercon

knock_nott 0 May 04, 2022