Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation.

Overview

Understanding Minimum Bayes Risk Decoding

This repo provides code and documentation for the following paper:

Müller and Sennrich (2021): Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation.

@inproceedings{muller2021understanding,
      title={Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation}, 
      author = {M{\"u}ller, Mathias  and
      Sennrich, Rico},
      year={2021},
      eprint={2105.08504},
      booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)"
}

Basic Setup

Clone this repo in the desired place:

git clone https://github.com/ZurichNLP/understanding-mbr
cd understanding-mbr

then proceed to install software before running any experiments.

Install required software

Create a new virtualenv that uses Python 3. Please make sure to run this command outside of any virtual Python environment:

./scripts/create_venv.sh

Important: Then activate the env by executing the source command that is output by the shell script above.

Download and install required software:

./scripts/download.sh

The download script makes several important assumptions, such as: your OS is Linux, you have CUDA 10.2 installed, you have access to a GPU for training and translation, your folder for temp files is /var/tmp. Edit the script before running it to fit to your needs.

Running experiments in general

Definition of "run"

We define a "run" as one complete experiment, in the sense that a run executes a pipeline of steps. Every run is completely self-contained: it does everything from downloading the data until evaluation of a trained model.

The series of steps executed in a run is defined in

scripts/tatoeba/run_tatoeba_generic.sh

This script is generic and will never be called on its own (many variables would be undefined), but all our scripts eventually call this script.

SLURM jobs

Individual steps in runs are submitted to a SLURM system. The generic run script:

scripts/tatoeba/run_tatoeba_generic.sh

will submit each individual step (such as translation, or model training) as a separate SLURM job. Depending on the nature of the task, the scripts submits to a different cluster, or asks for different resources.

IMPORTANT: if

  • you do not work on a cluster that uses SLURM for job management,
  • your cluster layout, resource naming etc. is different

you absolutely need to modify or replace the generic script scripts/tatoeba/run_tatoeba_generic.sh before running anything. If you do not use SLURM at all, it might be possible to just replace calls to scripts/tatoeba/run_tatoeba_generic.sh with scripts/tatoeba/run_tatoeba_generic_no_slurm.sh.

scripts/tatoeba/run_tatoeba_generic_no_slurm.sh is a script we provide for convenience, but have not tested it ourselves. We cannot guarantee that it runs without error.

Dry run

Before you run actual experiments, it can be useful to perform a dry run. Dry runs attempt to run all commands, create all files etc. but are finished within minutes and use CPU only. Dry runs help to catch some bugs (such as file permissions) early.

To dry-run a baseline system for the language pair DAN-EPO, run:

./scripts/tatoeba/dry_run_baseline.sh

Single (non-dry!) example run

To run the entire pipeline (downloading data until evaluation of trained model) for a single language pair from Tatoeba, run

./scripts/tatoeba/run_baseline.sh

This will train a model for the language pair DAN-EPO, but also execute all steps before and after model training.

Start a certain group of runs

It is possible to submit several runs at the same time, using the same shell script. For instance, to run all required steps for a number of medium-resource language pairs, run

./scripts/tatoeba/run_mediums.sh

Recovering partial runs

Steps within a run pipeline depend on each other (SLURM sbatch --afterok dependency in most cases). This means that if a job X fails, subsequent jobs that depend on X will never start. If you attempt to re-run completed steps they exit immediately -- so you can always re-run an entire pipeline if any step fails.

Reproducing the results presented in our paper in particular

Training and evaluating the models

To create all models and statistics necessary to compare MBR with different utility functions:

scripts/tatoeba/run_compare_risk_functions.sh

To reproduce experiments on domain robustness:

scripts/tatoeba/run_robustness_data.sh

To reproduce experiments on copy noise in the training data:

scripts/tatoeba/run_copy_noise.sh

Creating visualizations and result tables

To reproduce exactly the tables and figures we show in the paper, use our Google Colab here:

https://colab.research.google.com/drive/1GYZvxRB1aebOThGllgb0teY8A4suH5j-?usp=sharing

This is possible only because we have hosted the results of our experiments on our servers and Colab can retrieve files from there.

Browse MBR samples

We also provide examples for pools of MBR samples for your perusal, as HTML files that can be viewed in any browser. The example HTML files are created by running the following script:

./scripts/tatoeba/local_html.sh

and are available at the following URLs (Markdown does not support clickable links, sorry!):

Domain robustness

language pair domain test set link
DEU-ENG it https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/deu-eng.domain_robustness.it.html
DEU-ENG koran https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/deu-eng.domain_robustness.koran.html
DEU-ENG law https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/deu-eng.domain_robustness.law.html
DEU-ENG medical https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/deu-eng.domain_robustness.medical.html
DEU-ENG subtitles https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/deu-eng.domain_robustness.subtitles.html

Copy noise in training data

language pair amount of copy noise link
ARA-DEU 0.001 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.001.slice-test.html
ARA-DEU 0.005 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.005.slice-test.html
ARA-DEU 0.01 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.01.slice-test.html
ARA-DEU 0.05 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.05.slice-test.html
ARA-DEU 0.075 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.075.slice-test.html
ARA-DEU 0.1 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.1.slice-test.html
ARA-DEU 0.25 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.25.slice-test.html
ARA-DEU 0.5 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/ara-deu.copy_noise.0.5.slice-test.html
language pair amount of copy noise link
ENG-MAR 0.001 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.001.slice-test.html
ENG-MAR 0.005 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.005.slice-test.html
ENG-MAR 0.01 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.01.slice-test.html
ENG-MAR 0.05 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.05.slice-test.html
ENG-MAR 0.075 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.075.slice-test.html
ENG-MAR 0.1 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.1.slice-test.html
ENG-MAR 0.25 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.25.slice-test.html
ENG-MAR 0.5 https://files.ifi.uzh.ch/cl/archiv/2020/clcontra/eng-mar.copy_noise.0.5.slice-test.html
Owner
ZurichNLP
University of Zurich, Department of Computational Linguistics
ZurichNLP
Deep motion transfer

animation-with-keypoint-mask Paper The right most square is the final result. Softmax mask (circles): \ Heatmap mask: \ conda env create -f environmen

9 Nov 01, 2022
Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

AWS Samples 3 Jan 01, 2022
some classic model used to segment the medical images like CT、X-ray and so on

github_project This is a project for medical image segmentation. This project includes common medical image segmentation models such as U-net, FCN, De

2 Mar 30, 2022
RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation YouTube | BiliBili 16X interpolation results from two input images: Introd

旷视天元 MegEngine 28 Dec 09, 2022
Repository of 3D Object Detection with Pointformer (CVPR2021)

3D Object Detection with Pointformer This repository contains the code for the paper 3D Object Detection with Pointformer (CVPR 2021) [arXiv]. This wo

Zhuofan Xia 117 Jan 06, 2023
Generative Exploration and Exploitation - This is an improved version of GENE.

GENE This is an improved version of GENE. In the original version, the states are generated from the decoder of VAE. We have to check whether the gere

33 Mar 23, 2022
Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

Sai Kumar Dwivedi 83 Nov 27, 2022
deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

deep_nn_model_with_only_python_100%_test_accuracy deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and differen

0 Aug 28, 2022
Official repository of the paper Privacy-friendly Synthetic Data for the Development of Face Morphing Attack Detectors

SMDD-Synthetic-Face-Morphing-Attack-Detection-Development-dataset Official repository of the paper Privacy-friendly Synthetic Data for the Development

10 Dec 12, 2022
Transformer model implemented with Pytorch

transformer-pytorch Transformer model implemented with Pytorch Attention is all you need-[Paper] Architecture Self-Attention self_attention.py class

Mingu Kang 12 Sep 03, 2022
This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

TSForecasting This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the tim

Rakshitha Godahewa 80 Dec 30, 2022
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

Explosion 30 Oct 09, 2022
DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency

[CVPR19] DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency (Oral paper) Authors: Kuang-Jui Hsu, Yen-Yu Lin, Yung-Yu Chuang PDF:

Kuang-Jui Hsu 139 Dec 22, 2022
This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch

This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch. The code was prepared to the final version of the accepted manuscript in AIST

Marcelo Hartmann 2 May 06, 2022
UnpNet - Rethinking 3-D LiDAR Point Cloud Segmentation(IEEE TNNLS)

UnpNet Citation Please cite the following paper if you use this repository in your reseach. @article {PMID:34914599, Title = {Rethinking 3-D LiDAR Po

Shijie Li 4 Jul 15, 2022
Faster Convex Lipschitz Regression

Faster Convex Lipschitz Regression This reepository provides a python implementation of our Faster Convex Lipschitz Regression algorithm with GPU and

Ali Siahkamari 0 Nov 19, 2021
Implementation for "Seamless Manga Inpainting with Semantics Awareness" (SIGGRAPH 2021 issue)

Seamless Manga Inpainting with Semantics Awareness [SIGGRAPH 2021](To appear) | Project Website | BibTex Introduction: Manga inpainting fills up the d

101 Jan 01, 2023
Riemannian Convex Potential Maps

Modeling distributions on Riemannian manifolds is a crucial component in understanding non-Euclidean data that arises, e.g., in physics and geology. The budding approaches in this space are limited b

Facebook Research 61 Nov 28, 2022
MlTr: Multi-label Classification with Transformer

MlTr: Multi-label Classification with Transformer This is official implement of "MlTr: Multi-label Classification with Transformer". Abstract The task

程星 38 Nov 08, 2022
[IJCAI'21] Deep Automatic Natural Image Matting

Deep Automatic Natural Image Matting [IJCAI-21] This is the official repository of the paper Deep Automatic Natural Image Matting. Introduction | Netw

Jizhizi_Li 316 Jan 06, 2023