Source code for the paper: Variance-Aware Machine Translation Test Sets (NeurIPS 2021 Datasets and Benchmarks Track)

Overview

Variance-Aware-MT-Test-Sets

Variance-Aware Machine Translation Test Sets

License

See LICENSE. We follow the data licensing plan as the same as the WMT benchmark.

VAT Data

We release 70 lightweight and discriminative test sets for machine translation evaluation, covering 35 translation directions from WMT16 to WMT20 competitions. See VAT_data folder for detailed information.

For each translation direction of a specific year, both source and reference are provided for different types of evaluation metrics. For example,

VAT_data/
├── wmt20
    ├── ...
    ├── vat_newstest2020-zhen-ref.en.txt
    └── vat_newstest2020-zhen-src.zh.txt

Meta-Information of VAT

We also provide the meta-inforamtion of reserved data. Each json file contains the IDs of retained data in the original test set. For instance, file wmt20/bert-r_filter-std60.json describes:

{
	...
	"en-de": [4, 6, 10, 13, 14, 15, ...],
	"de-en": [0, 3, 4, 5, 7, 9, ...],
	...
}

Reproduce & Create VAT

The reported results in the paper were produced by single NVIDIA GeForce 1080Ti card.

We will keep updating the code and related documentation after the response.

Requirements

  • sacreBLEU version >= 1.4.14
  • BLEURT version >= 0.0.2
  • COMET version >= 0.1.0
  • BERTScore version >= 0.3.7 (hug_trans==4.2.1)
  • PyTorch version >= 1.5.1
  • Python version >= 3.8
  • CUDA & cudatoolkit >= 10.1

Note: the minimal version is for reproducing the results

Pipeline

  1. Use score_xxx.py to generate the CSV files that stores the sentence-level scores evaluated by the corresponding metrics. For example, evaluating all the WMT20 submissions of all the language pairs using BERTScore:
    CUDA_VISIBLE_DEVICES=0 python score_bert.py -b 128 -s -r dummy -c dummy --rescale_with_baseline \
    	--hypos-dir ${WMT_DATA_PATH}/system-outputs \
    	--refs-dir ${WMT_DATA_PATH}/references \
    	--scores-dir ${WMT_DATA_PATH}/results/system-level/scores_ALL \
    	--testset-name newstest2020 --score-dump wmt20-bertscore.csv
    (Alternative Option) You can use your implementation for dumping the scores given by the metrics. But the CSV header should contain:
    ,TESTSET,LP,ID,METRIC,SYS,SCORE
    
  2. Use cal_filtering.py to filter the test set based on the score warehouse calculated in the last step. For example,
    python cal_filtering.py --score-dump wmt20-bertscore.csv --output VAT_meta/wmt20-test/ --filter-per 60
    It will produce the json files which contain the IDs of reserved sentences.

Statistics of VAT (References)

Benchmark Translation Direction # Sentences # Words # Vocabulary
wmt20 km-en 928 17170 3645
wmt20 cs-en 266 12568 3502
wmt20 en-de 567 21336 5945
wmt20 ja-en 397 10526 3063
wmt20 ps-en 1088 20296 4303
wmt20 en-zh 567 18224 5019
wmt20 en-ta 400 7809 4028
wmt20 de-en 314 16083 4046
wmt20 zh-en 800 35132 6457
wmt20 en-ja 400 12718 2969
wmt20 en-cs 567 16579 6391
wmt20 en-pl 400 8423 3834
wmt20 en-ru 801 17446 6877
wmt20 pl-en 400 7394 2399
wmt20 iu-en 1188 23494 3876
wmt20 ru-en 396 6966 2330
wmt20 ta-en 399 7427 2148
wmt19 zh-en 800 36739 6168
wmt19 en-cs 799 15433 6111
wmt19 de-en 800 15219 4222
wmt19 en-gu 399 8494 3548
wmt19 fr-de 680 12616 3698
wmt19 en-zh 799 20230 5547
wmt19 fi-en 798 13759 3555
wmt19 en-fi 799 13303 6149
wmt19 kk-en 400 9283 2584
wmt19 de-cs 799 15080 6166
wmt19 lt-en 400 10474 2874
wmt19 en-lt 399 7251 3364
wmt19 ru-en 800 14693 3817
wmt19 en-kk 399 6411 3252
wmt19 en-ru 799 16393 6125
wmt19 gu-en 406 8061 2434
wmt19 de-fr 680 16181 3517
wmt19 en-de 799 18946 5340
wmt18 en-cs 1193 19552 7926
wmt18 cs-en 1193 23439 5453
wmt18 en-fi 1200 16239 7696
wmt18 en-tr 1200 19621 8613
wmt18 en-et 800 13034 6001
wmt18 ru-en 1200 26747 6045
wmt18 et-en 800 20045 5045
wmt18 tr-en 1200 25689 5955
wmt18 fi-en 1200 24912 5834
wmt18 zh-en 1592 42983 7985
wmt18 en-zh 1592 34796 8579
wmt18 en-ru 1200 22830 8679
wmt18 de-en 1199 28275 6487
wmt18 en-de 1199 25473 7130
wmt17 en-lv 800 14453 6161
wmt17 zh-en 800 20590 5149
wmt17 en-tr 1203 17612 7714
wmt17 lv-en 800 18653 4747
wmt17 en-de 1202 22055 6463
wmt17 ru-en 1200 24807 5790
wmt17 en-fi 1201 17284 7763
wmt17 tr-en 1203 23037 5387
wmt17 en-zh 800 18001 5629
wmt17 en-ru 1200 22251 8761
wmt17 fi-en 1201 23791 5300
wmt17 en-cs 1202 21278 8256
wmt17 de-en 1202 23838 5487
wmt17 cs-en 1202 22707 5310
wmt16 tr-en 1200 19225 4823
wmt16 ru-en 1199 23010 5442
wmt16 ro-en 800 16200 3968
wmt16 de-en 1200 22612 5511
wmt16 en-ru 1199 20233 7872
wmt16 fi-en 1200 20744 5176
wmt16 cs-en 1200 23235 5324
Owner
NLP2CT Lab, University of Macau
Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory
NLP2CT Lab, University of Macau
Haze Removal can remove slight to extreme cases of haze affecting an image

Haze Removal can remove slight to extreme cases of haze affecting an image. Its most typical use is for landscape photography where the haze causes low contrast and low saturation, but it can also be

Grace Ugochi Nneji 3 Feb 15, 2022
A pytorch reprelication of the model-based reinforcement learning algorithm MBPO

Overview This is a re-implementation of the model-based RL algorithm MBPO in pytorch as described in the following paper: When to Trust Your Model: Mo

Xingyu Lin 93 Jan 05, 2023
A quantum game modeling of pandemic (QHack 2022)

Contributors: @JongheumJung, @YoonjaeChung, @GyunghunKim Abstract In the regime of a global pandemic, leaders around the world need to consider variou

Yoonjae Chung 8 Apr 03, 2022
Voice assistant - Voice assistant with python

🌐 Python Voice Assistant 🌵 - User's greeting 🌵 - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
Clustering is a popular approach to detect patterns in unlabeled data

Visual Clustering Clustering is a popular approach to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a data

Tarek Naous 24 Nov 11, 2022
Lightweight stereo matching network based on MobileNetV1 and MobileNetV2

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

Cognitive Systems Research Group 139 Nov 30, 2022
The hippynn python package - a modular library for atomistic machine learning with pytorch.

The hippynn python package - a modular library for atomistic machine learning with pytorch. We aim to provide a powerful library for the training of a

Los Alamos National Laboratory 37 Dec 29, 2022
Drone Task1 - Drone Task1 With Python

Drone_Task1 Matching Results 3.mp4 1.mp4

MLV Lab (Machine Learning and Vision Lab at Korea University) 11 Nov 14, 2022
This repo includes the CUB-GHA (Gaze-based Human Attention) dataset and code of the paper "Human Attention in Fine-grained Classification".

HA-in-Fine-Grained-Classification This repo includes the CUB-GHA (Gaze-based Human Attention) dataset and code of the paper "Human Attention in Fine-g

16 Oct 29, 2022
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 02, 2023
A more easy-to-use implementation of KPConv based on PyTorch.

A more easy-to-use implementation of KPConv This repo contains a more easy-to-use implementation of KPConv based on PyTorch. Introduction KPConv is a

Zheng Qin 36 Dec 29, 2022
A series of Jupyter notebooks with Chinese comment that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.

Hands-on-Machine-Learning 目的 这份笔记旨在帮助中文学习者以一种较快较系统的方式入门机器学习, 是在学习Hands-on Machine Learning with Scikit-Learn and TensorFlow这本书的 时候做的个人笔记: 此项目的可取之处 原书的

Baymax 1.5k Dec 21, 2022
A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

jedibobo 3 Dec 28, 2022
Learn about Spice.ai with in-depth samples

Samples Learn about Spice.ai with in-depth samples ServerOps - Learn when to run server maintainance during periods of low load Gardener - Intelligent

Spice.ai 16 Mar 23, 2022
Pytorch implementation of PCT: Point Cloud Transformer

PCT: Point Cloud Transformer This is a Pytorch implementation of PCT: Point Cloud Transformer.

Yi_Zhang 265 Dec 22, 2022
A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

this is a simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

crispengari 5 Dec 09, 2021
Course content and resources for the AIAIART course.

AIAIART course This repo will house the notebooks used for the AIAIART course. Part 1 (first four lessons) ran via Discord in September/October 2021.

Jonathan Whitaker 492 Jan 06, 2023
An Api for Emotion recognition.

PLAYEMO Playemo was built from the ground-up with Flask, a python tool that makes it easy for developers to build APIs. Use Cases Is Python your langu

greek geek 2 Jul 16, 2022
Delta Conformity Sociopatterns Analysis - Delta Conformity Sociopatterns Analysis

Delta_Conformity_Sociopatterns_Analysis ∆-Conformity is a local homophily measur

2 Jan 09, 2022
Benchmark spaces - Benchmarks of how well different two dimensional spaces work for clustering algorithms

benchmark_spaces Benchmarks of how well different two dimensional spaces work fo

Bram Cohen 6 May 07, 2022