Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Last update: Dec 03, 2022

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

We propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously drives progress in language generation tasks and their evaluation. We accept two types of submissions:

Generator developers submit output text. A Billboard computes all metric scores.
Metric developers submit an executable program. A Billboard computes correlations with the human judgments, updates the ensemble metric, and measures how much it overrates machine over human generations.

Anonymous submissions are allowed!!

Submit

Submission guides and examples are available here.

Scoring Results

Scoring results for all past public submissions are available here. We have generator-name||metric-name.csv files from the Cartesian product between the generators and metrics: each contains instance-level scores.

Citations

Bidimesional Leaderboards

@misc{kasai2021billboard,
    title   = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2112.04139}, 
}

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

@misc{kasai2021thumb,
    title   = {Transparent Human Evaluation for Image Captioning},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Lavinia Dunagan and Jacob Morrison and Ronan Le Bras and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2111.08940}, 
}

CNNDM Summarization Evaluations

@article{fabbri2021summeval,
    title   = {{SummEval}: Re-evaluating Summarization Evaluation},
    author  = {Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
    journal = {TACL},
    year    = {2021},
    url     = {https://arxiv.org/abs/2007.12626},
}

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      url={https://arxiv.org/abs/2104.14478},
}

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

Submit

Scoring Results

Citations

Bidimesional Leaderboards

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

CNNDM Summarization Evaluations

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

Owner

AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation

Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomaly Detection

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

Mesh TensorFlow: Model Parallelism Made Easier

Image classification for projects and researches

👨‍💻 run nanosaur in simulation with Gazebo/Ingnition

QICK: Quantum Instrumentation Control Kit

Learning Continuous Image Representation with Local Implicit Image Function

Multi-Modal Machine Learning toolkit based on PaddlePaddle.

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

A Fast Sequence Transducer Implementation with PyTorch Bindings

Vehicle Detection Using Deep Learning and YOLO Algorithm

Segmentation models with pretrained backbones. PyTorch.

Official implementation of the NeurIPS'21 paper 'Conditional Generation Using Polynomial Expansions'.

GPU-Accelerated Deep Learning Library in Python

NIMA: Neural IMage Assessment

MlTr: Multi-label Classification with Transformer

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks