CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

Overview

M-BERT-Study

CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

Motivation

Multilingual BERT (M-BERT) has shown surprising cross lingual abilities --- even when it is trained without cross lingual objectives. In this work, we analyze what causes this multilinguality from three factors: linguistic properties of the languages, the architecture of the model, and the learning objectives.

Results

Linguistic properties:

  • Code switching text (word-piece overlap) is not the main cause of multilinguality.
  • Word ordering is crucial, when words in sentences are randomly permuted, multilinguality is low, however, still significantly better than random.
  • (Unigram) word frequency is not enough, as we resampled all words with the same frequency, and found almost random performance. Combining the second and the third property infers that there is language similarity other than ordering of words between two languages, and which unigram frequency does not capture. We hypothesize that it may be similarity of n-gram occurrences.

Architecture:

  • Depth of the transformer is the most important.
  • Number of attention heads effects the absolute performance on individual languages, but the gap between in-language supervision and cross-language zero-shot learning didn't change much.
  • Total number of parameters, like depth, effects multilinguality.

Learning Objectives:

  • Next Sentence Prediction objective, when removed, leads to slight increase in performance.
  • Even marking sentences in languages with language-ids, allowing BERT to know exactly which language its learning on, did not hurt performance
  • Using word-pieces leads to strong improvements on both source and target language (likely to depend on tasks) and slight improvement cross-lingually comparing to word or character based models.

Please refer to our paper for more details.

Scripts

Creating pre-training data

If you would like to pre-train a BERT with Fake language/permuted sentences, see preprocessing-scripts for how to create the tfrecords for BERT training.

Pre-training BERT

Once you have uploaded the tfrecords to google cloud, you can set up an instance and start BERT training via bert-running-scripts.

Evaluating

With models we provide or just trained, we provide code for evaluating on two tasks, NER and entailment. See evaluating-scripts.

BERT Models

We release the following bert models (in a few days):

  • Word-piece Experiments
  • Word Order Experiments
  • Word Frequency Experiments
  • Model Structure Experiments

See data for detailed paths to download (in a few days).

Requirements

  • allennlp: 0.9.0
  • ccg_nlpy

Citation

Please cite the following paper if you find our paper useful. Thanks!

Karthikeyan K, Zihan Wang, Stephen Mayhew, Dan Roth. "Cross-Lingual Ability of Multilingual BERT: An Empirical Study" arXiv preprint arXiv:1912.07840 (2019).

@article{wang2019cross,
  title={Cross-Lingual Ability of Multilingual BERT: An Empirical Study},
  author={K, Karthikeyan and Wang, Zihan and Mayhew, Stephen and Roth, Dan},
  journal={arXiv preprint arXiv:1912.07840},
  year={2019}
}
Owner
CogComp
Cognitive Computation Group, led by Prof. Dan Roth
CogComp
Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

CG-MuAlign A reference implementation for "Collective Multi-type Entity Alignment Between Knowledge Graphs", published in WWW 2020. If you find our pa

Bran Zhu 28 Dec 11, 2022
Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

ENOT 264 Dec 30, 2022
Codes for NeurIPS 2021 paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models"

Adversarial Neuron Pruning Purifies Backdoored Deep Models Code for NeurIPS 2021 "Adversarial Neuron Pruning Purifies Backdoored Deep Models" by Dongx

Dongxian Wu 31 Dec 11, 2022
Official implementation for the paper: "Multi-label Classification with Partial Annotations using Class-aware Selective Loss"

Multi-label Classification with Partial Annotations using Class-aware Selective Loss Paper | Pretrained models Official PyTorch Implementation Emanuel

99 Dec 27, 2022
StellarGraph - Machine Learning on Graphs

StellarGraph Machine Learning Library StellarGraph is a Python library for machine learning on graphs and networks. Table of Contents Introduction Get

S T E L L A R 2.6k Jan 05, 2023
Enhancing Knowledge Tracing via Adversarial Training

Enhancing Knowledge Tracing via Adversarial Training This repository contains source code for the paper "Enhancing Knowledge Tracing via Adversarial T

Xiaopeng Guo 14 Oct 24, 2022
Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

Google Research 16 Dec 21, 2022
Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking.

DEFT: Detection Embeddings for Tracking DEFT: Detection Embeddings for Tracking, Mohamed Chaabane, Peter Zhang, J. Ross Beveridge, Stephen O'Hara

Mohamed Chaabane 253 Dec 18, 2022
A repository that finds a person who looks like you by using face recognition technology.

Find Your Twin Hello everyone, I've always wondered how casting agencies do the casting for a scene where a certain actor is young or old for a movie

Cengizhan Yurdakul 3 Jan 29, 2022
MediaPipe is a an open-source framework from Google for building multimodal

MediaPipe is a an open-source framework from Google for building multimodal (eg. video, audio, any time series data), cross platform (i.e Android, iOS, web, edge devices) applied ML pipelines. It is

Bhavishya Pandit 3 Sep 30, 2022
(CVPR 2021) Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

BRNet Introduction This is a release of the code of our paper Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds,

86 Oct 05, 2022
Western-3DSlicer-Modules - Point-Set Registrations for Ultrasound Probe Calibrations

Point-Set Registrations for Ultrasound Probe Calibrations -Undergraduate Thesis-

Matteo Tanzi 0 May 04, 2022
No-reference Image Quality Assessment(NIQA) Algorithms (BRISQUE, NIQE, PIQE, RankIQA, MetaIQA)

No-Reference Image Quality Assessment Algorithms No-reference Image Quality Assessment(NIQA) is a task of evaluating an image without a reference imag

Dae-Young Song 26 Jan 04, 2023
Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation.

Distant Supervision for Scene Graph Generation Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation. Introduction The pape

THUNLP 23 Dec 31, 2022
Rax is a Learning-to-Rank library written in JAX

🦖 Rax: Composable Learning to Rank using JAX Rax is a Learning-to-Rank library written in JAX. Rax provides off-the-shelf implementations of ranking

Google 247 Dec 27, 2022
Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Deep Unsupervised Image Hashing by Maximizing Bit Entropy This is the PyTorch implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hash

62 Dec 30, 2022
HyperCube: Implicit Field Representations of Voxelized 3D Models

HyperCube: Implicit Field Representations of Voxelized 3D Models Authors: Magdalena Proszewska, Marcin Mazur, Tomasz Trzcinski, Przemysław Spurek [Pap

Magdalena Proszewska 3 Mar 09, 2022
Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes, ICCV 2017

AdaptationSeg This is the Python reference implementation of AdaptionSeg proposed in "Curriculum Domain Adaptation for Semantic Segmentation of Urban

Yang Zhang 128 Oct 19, 2022
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
Sequence lineage information extracted from RKI sequence data repo

Pango lineage information for German SARS-CoV-2 sequences This repository contains a join of the metadata and pango lineage tables of all German SARS-

Cornelius Roemer 24 Oct 26, 2022