Evaluating Cross-lingual Sentence Representations

Related tags

Deep LearningXNLI
Overview

XNLI: The Cross-Lingual NLI Corpus

XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.

New: XLM and Multilingual BERT use XNLI to evaluate the quality of the cross-lingual representations.

Introduction

Many NLP systems (e.g. sentiment analysis, topic classification, feed ranking) rely on training data in one high-resource language, but cannot be directly used to make predictions for other languages at test time. This problem happens in almost any industrial application that involves cross-lingual data.

Machine translation can be used to translate any sample into the high-resource language to alleviate this issue. However, having an MT system in every direction is costly and may not the best solution for cross-lingual classification. Cross-lingual encoders are a cheaper and more elegant alternative.

To evaluate such cross-lingual sentence understanding methods, we built XNLI, an extension of the SNLI/MultiNLI corpus in 15 languages. XNLI raises the following research question: how can we make predictions in any language at test time, when we only have English training data for that task?

While industrial applications may not include NLI in their routine tasks, we believe that NLI is a good testbed for evaluating cross-lingual sentence representations, and that better approaches for XNLI will result in better XLU methods.

The XNLI corpus

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations.

Examples

Download

XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.

Download: XNLI 1.0 (17MB, ZIP)

XNLI can also be used as a 15way parallel corpus of 10,000 sentences, for building or evaluating Machine Translation systems. XNLI provides additional open parallel data for low-resource languages such as Swahili or Urdu.

Download XNLI-15way (12MB, ZIP)

Useful XLU resources

Parallel data for aligning sentence encoders: OPUS: the open parallel corpus

Monolingual word embeddings in 157 languages: fastText CommonCrawl vectors

Aligning word embeddings: MUSE

Data description paper and citation

A description of the data can be found here or in the corpus package zip. If you use the corpus in an academic paper, please consider citing:

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
        and Rinott, Ruty
        and Lample, Guillaume
        and Williams, Adina
        and Bowman, Samuel R.
        and Schwenk, Holger
        and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

Baselines and MT data

The XNLI paper presents several baselines for language adaptation.

We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:

Download: XNLI-MT 1.0 (445MB, ZIP)

License

XNLI is licensed under the license found in the LICENSE file. See details in the XNLI paper.

Owner
Meta Research
Meta Research
[NeurIPS 2021] A weak-shot object detection approach by transferring semantic similarity and mask prior.

TransMaS This repository is the official pytorch implementation of the following paper: NIPS2021 Mixed Supervised Object Detection by TransferringMask

BCMI 49 Jul 27, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
Research into Forex price prediction from price history using Deep Sequence Modeling with Stacked LSTMs.

Forex Data Prediction via Recurrent Neural Network Deep Sequence Modeling Research Paper Our research paper can be viewed here Installation Clone the

Alex Taradachuk 2 Aug 07, 2022
Pytorch implementation of ICASSP 2022 paper Attention Probe: Vision Transformer Distillation in the Wild

Attention Probe: Vision Transformer Distillation in the Wild Jiahao Wang, Mingdeng Cao, Shuwei Shi, Baoyuan Wu, Yujiu Yang In ICASSP 2022 This code is

IIGROUP 6 Sep 21, 2022
Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

GANInversion_with_ConsecutiveImgs Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images" https://a

QingyangXu 38 Dec 07, 2022
Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集,然后分训练数据集和测试数据集

6 Jun 27, 2022
An expansion for RDKit to read all types of files in one line

RDMolReader An expansion for RDKit to read all types of files in one line How to use? Add this single .py file to your project and import MolFromFile(

Ali Khodabandehlou 1 Dec 18, 2021
Code for pre-training CharacterBERT models (as well as BERT models).

Pre-training CharacterBERT (and BERT) This is a repository for pre-training BERT and CharacterBERT. DISCLAIMER: The code was largely adapted from an o

Hicham EL BOUKKOURI 31 Dec 05, 2022
Github Traffic Insights as Prometheus metrics.

github-traffic Github Traffic collects your repository's traffic data and exposes it as Prometheus metrics. Grafana dashboard that displays the metric

Grafana Labs 34 Oct 27, 2022
Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations This repo contains official code for the NeurIPS 2021 paper Imi

Jiayao Zhang 2 Oct 18, 2021
The code for Expectation-Maximization Attention Networks for Semantic Segmentation (ICCV'2019 Oral)

EMANet News The bug in loading the pretrained model is now fixed. I have updated the .pth. To use it, download it again. EMANet-101 gets 80.99 on the

Xia Li 李夏 663 Nov 30, 2022
Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Semantic Diversity Learning for Zero-Shot Multi-label Classification Paper Official PyTorch Implementation Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Bar

28 Aug 29, 2022
Code for the Paper "Diffusion Models for Handwriting Generation"

Code for the Paper "Diffusion Models for Handwriting Generation"

62 Dec 21, 2022
Pose estimation for iOS and android using TensorFlow 2.0

💃 Mobile 2D Single Person (Or Your Own Object) Pose Estimation for TensorFlow 2.0 This repository is forked from edvardHua/PoseEstimationForMobile wh

tucan9389 165 Nov 16, 2022
Pytorch implementation of FlowNet by Dosovitskiy et al.

FlowNetPytorch Pytorch implementation of FlowNet by Dosovitskiy et al. This repository is a torch implementation of FlowNet, by Alexey Dosovitskiy et

Clément Pinard 762 Jan 02, 2023
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

Thinc: A refreshing functional take on deep learning, compatible with your favorite libraries From the makers of spaCy, Prodigy and FastAPI Thinc is a

Explosion 2.6k Dec 30, 2022
A commany has recently introduced a new type of bidding, the average bidding, as an alternative to the bid given to the current maximum bidding

Business Problem A commany has recently introduced a new type of bidding, the average bidding, as an alternative to the bid given to the current maxim

Kübra Bilinmiş 1 Jan 15, 2022
Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

Language Identifier What is this ? The goal of this project is to create a model that is able to predict a given sentence language through text proces

Hossam Asaad 9 Dec 15, 2022
Data-driven reduced order modeling for nonlinear dynamical systems

SSMLearn Data-driven Reduced Order Models for Nonlinear Dynamical Systems This package perform data-driven identification of reduced order model based

Haller Group, Nonlinear Dynamics 27 Dec 13, 2022
BarcodeRattler - A Raspberry Pi Powered Barcode Reader to load a game on the Mister FPGA using MBC

Barcode Rattler A Raspberry Pi Powered Barcode Reader to load a game on the Mist

Chrissy 29 Oct 31, 2022