Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"

Overview

When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer

This repository contains code for our paper titled "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer". [arXiv]

Table of contents

  1. Paper in a nutshell
  2. Installation
  3. Data and models
  4. Repository usage
  5. Links to experiments and results
  6. Citation

Paper in a nutshell

While recent work on multilingual language models has demonstrated their capacity for cross-lingual zero-shot transfer on downstream tasks, there is a lack of consensus in the community as to what shared properties between languages enable such transfer. Analyses involving pairs of natural languages are often inconclusive and contradictory since languages simultaneously differ in many linguistic aspects. In this paper, we perform a large-scale empirical study to isolate the effects of various linguistic properties by measuring zero-shot transfer between four diverse natural languages and their counterparts constructed by modifying aspects such as the script, word order, and syntax. Among other things, our experiments show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order, and there is a strong correlation between transfer performance and word embedding alignment between languages (e.g., Spearman's R=0.94 on the task of NLI). Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence.

Installation instructions

  1. Step 1: Install from the conda .yml file.
conda env create -f installation/multilingual.yml
  1. Step 2: Install transformers in an editable way.
pip install -e transformers/
pip install -r transformers/examples/language-modeling/requirements.txt
pip install -r transformers/examples/token-classification/requirements.txt

Repository usage

For the commands we used to get the reported numbers in the paper, click here. This file contains common instructions used. This file can automatically generate commands for your use case.

Bilingual pre-training

  1. For bilingual pre-training on original and derived language pairs, use the flag --invert_word_order for the Inversion transformation, --permute_words for Permutation and --one_to_one_mapping for Transliteration. Example command for bilingual pre-training for English with Inversion transformation to create the derived language pair.
nohup  python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm --run_name inverted_en_500K_mlm --invert_word_order --word_modification add &
  1. For Syntax transformations, the train file used in the following command ([email protected][email protected]) means that it is the concatenation of French corpus with French modified to English verb and noun order ([email protected][email protected]).
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/fr/roberta_8/config.json --tokenizer_name config/fr/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/fr/synthetic/[email protected][email protected] --validation_file ../../bucket/pretrain_data/fr/synthetic/[email protected][email protected] --output_dir ../../bucket/model_outputs/fr/syntax_modif_en/mlm --run_name fr_syntax_modif_en_500K_mlm &
  1. For composed transformations, apply multiple transformations by using multiple flags, e.g., --one_to_one_mapping --invert_word_order.
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/one_to_one_inverted/mlm --run_name en_one_to_one_inverted --one_to_one_mapping --invert_word_order --word_modification add &
  1. Using different domains for the original and derived language.
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic_transitive.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train_split_1.txt --transitive_file ../../bucket/pretrain_data/en/train_split_2.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/one_to_one_diff_source_100_more_steps/mlm --run_name en_one_to_one_diff_source_100_more_steps --one_to_one_mapping --word_modification add &

Fine-tuning and evaluation

This directory contains scripts used for downstream fine-tuning and evaluation.

  1. Transliteration, Inversion, and Permutation
  2. Syntax
  3. Composed transformations
  4. Using different domains for original and derived languages

Embedding alignment

Use this script to calculate embedding alignment for any model which uses Transliteration as one of the transformations.

Data and models

All the data used for our experiments, hosted on Google Cloud Bucket.

  1. Pre-training data - pretrain_data
  2. Downstream data - supervised_data
  3. Model files - model_outputs

Links to experiments and results

  1. Spreadsheets with run descriptions, commands, and weights and biases link
  2. Spreadsheet with all results
  3. Links to pre-training runs
  4. Link to fine-tuning and analysis

Citation

Please consider citing if you used our paper in your work!

To be updated soon!
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

Vivek Palaniappan 637 Jan 04, 2023
Pytorch implementation of various High Dynamic Range (HDR) Imaging algorithms

Deep High Dynamic Range Imaging Benchmark This repository is the pytorch impleme

Tianhong Dai 5 Nov 16, 2022
CUP-DNN is a deep neural network model used to predict tissues of origin for cancers of unknown of primary.

CUP-DNN CUP-DNN is a deep neural network model used to predict tissues of origin for cancers of unknown of primary. The model was trained on the expre

1 Oct 27, 2021
Aerial Imagery dataset for fire detection: classification and segmentation (Unmanned Aerial Vehicle (UAV))

Aerial Imagery dataset for fire detection: classification and segmentation using Unmanned Aerial Vehicle (UAV) Title FLAME (Fire Luminosity Airborne-b

79 Jan 06, 2023
PyTorch implementation of EigenGAN

PyTorch Implementation of EigenGAN Train python train.py [image_folder_path] --name [experiment name] Test python test.py [ckpt path] --traverse FFH

62 Nov 12, 2022
Reimplementation of Dynamic Multi-scale filters for Semantic Segmentation.

Paddle implementation of Dynamic Multi-scale filters for Semantic Segmentation.

Hongqiang.Wang 2 Nov 01, 2021
A TikTok-like recommender system for GitHub repositories based on Gorse

GitRec GitRec is the missing recommender system for GitHub repositories based on Gorse. Architecture The trending crawler crawls trending repositories

337 Jan 04, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2020 Links Doc

Sebastian Raschka 4.2k Jan 02, 2023
Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)

Automated Learning Rate Scheduler for Large-Batch Training The official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th

Kakao Brain 35 Jan 04, 2023
Towards uncontrained hand-object reconstruction from RGB videos

Towards uncontrained hand-object reconstruction from RGB videos Yana Hasson, Gül Varol, Ivan Laptev and Cordelia Schmid Project page Paper Table of Co

Yana 69 Dec 27, 2022
NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM Automatic Evaluation Metric described in the papers BaryScore (EM

Pierre Colombo 28 Dec 28, 2022
Libraries, tools and tasks created and used at DeepMind Robotics.

Libraries, tools and tasks created and used at DeepMind Robotics.

DeepMind 270 Nov 30, 2022
DeepMReye: magnetic resonance-based eye tracking using deep neural networks

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

73 Dec 21, 2022
PyTorch implementation of Super SloMo by Jiang et al.

Super-SloMo PyTorch implementation of "Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation" by Jiang H., Sun

Avinash Paliwal 2.9k Jan 03, 2023
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022
Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Pose Detection Project Description: Human pose estimation from video plays a critical role in various applications such as quantifying physical exerci

Hassan Shahzad 2 Jan 17, 2022
Official repository for ABC-GAN

ABC-GAN The work represented in this repository is the result of a 14 week semesterthesis on photo-realistic image generation using generative adversa

IgorSusmelj 10 Jun 23, 2022
PyTorch implementation of PSPNet

PSPNet with PyTorch Unofficial implementation of "Pyramid Scene Parsing Network" (https://arxiv.org/abs/1612.01105). This repository is just for caffe

Kazuto Nakashima 52 Nov 16, 2022
Supercharging Imbalanced Data Learning WithCausal Representation Transfer

ECRT: Energy-based Causal Representation Transfer Code for Supercharging Imbalanced Data Learning With Energy-basedContrastive Representation Transfer

Zidi Xiu 11 May 02, 2022
Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic S

Ken Lin 17 Oct 12, 2022