A multilingual version of MS MARCO passage ranking dataset

Related tags

Deep LearningmMARCO
Overview

mMARCO

A multilingual version of MS MARCO passage ranking dataset

This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.

Translated Datasets

As described in our work, we made available 8 translated versions of MS MARCO passage ranking dataset. The translated passages collection and the queries set (training and validation) are available at:

Released Model Checkpoints

Our available fine-tuned models are:

Model Description [email protected]*
ptT5-base-pt-msmarco a PTT5 model fine-tuned on Portuguese MS MARCO 0.188
ptT5-base-en-pt-msmarco a PTT5 model fine-tuned on English and Portuguese MS MARCO 0.343
mT5-base-en-pt-msmarco a mT5 model fine-tuned on both English and Portuguese MS MARCO 0.375
mT5-base-multi-msmarco a mT5 model fine-tuned on mMARCO 0.366
mMiniLM-pt-msmarco a mMiniLM model fine-tuned on Portuguese MS MARCO -
mMiniLM-en-pt-msmarco a mMiniLM model fine-tuned on both English and Portuguese MS MARCO 0.375
mMiniLM-multi-msmarco a mMiniLM model fine-tuned on mMARCO 0.363

* [email protected] on English MS MARCO

Dataset

We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.

Translation Model

To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.

How To Translate

In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py script. This script expects a .tsv file, in which each line follows a document_id \t document_text format.

python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/

After translating, it is necessary to reassemble the file, as the documents were split into sentences.

python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection

Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{bonifacio2021mmarco,
      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, 
      author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
      year={2021},
      eprint={2108.13897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

Hong-Jia Chen 127 Dec 23, 2022
Framework that uses artificial intelligence applied to mathematical models to make predictions

LiconIA Framework that uses artificial intelligence applied to mathematical models to make predictions Interface Overview Table of contents [TOC] 1 Ar

4 Jun 20, 2021
Official Pytorch Implementation of 3DV2021 paper: SAFA: Structure Aware Face Animation.

SAFA: Structure Aware Face Animation (3DV2021) Official Pytorch Implementation of 3DV2021 paper: SAFA: Structure Aware Face Animation. Getting Started

QiulinW 122 Dec 23, 2022
Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Wav2CLIP 🚧 WIP 🚧 Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗 Ho-Hsiang Wu, Prem Seetharaman

Descript 240 Dec 13, 2022
Code for the paper Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration

IMAGINE: Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration This repo contains the code base of the paper Language as a Cog

Flowers Team 26 Dec 22, 2022
Real-Time SLAM for Monocular, Stereo and RGB-D Cameras, with Loop Detection and Relocalization Capabilities

ORB-SLAM2 Authors: Raul Mur-Artal, Juan D. Tardos, J. M. M. Montiel and Dorian Galvez-Lopez (DBoW2) 13 Jan 2017: OpenCV 3 and Eigen 3.3 are now suppor

Raul Mur-Artal 7.8k Dec 30, 2022
This is the official code of our paper "Diversity-based Trajectory and Goal Selection with Hindsight Experience Relay" (PRICAI 2021)

Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay This is the official implementation of our paper "Diversity-based Traje

Tianhong Dai 6 Jul 18, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

117 Jan 07, 2023
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021 [Projec

Zhengqi Li 583 Dec 30, 2022
Fast Neural Style for Image Style Transform by Pytorch

FastNeuralStyle by Pytorch Fast Neural Style for Image Style Transform by Pytorch This is famous Fast Neural Style of Paper Perceptual Losses for Real

Bengxy 81 Sep 03, 2022
traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

traiNNer traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation to

202 Jan 04, 2023
Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

Part Detector Discovery This is the code used in our paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodne

Computer Vision Group Jena 17 Feb 22, 2022
This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

PasswordGeneratorAndVault This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your us

Chris 1 Feb 26, 2022
An implementation of the paper "A Neural Algorithm of Artistic Style"

A Neural Algorithm of Artistic Style implementation - Neural Style Transfer This is an implementation of the research paper "A Neural Algorithm of Art

Srijarko Roy 27 Sep 20, 2022
Pytorch implementation of RED-SDS (NeurIPS 2021).

Recurrent Explicit Duration Switching Dynamical Systems (RED-SDS) This repository contains a reference implementation of RED-SDS, a non-linear state s

Abdul Fatir 10 Dec 02, 2022
Multiple-criteria decision-making (MCDM) with Electre, Promethee, Weighted Sum and Pareto

EasyMCDM - Quick Installation methods Install with PyPI Once you have created your Python environment (Python 3.6+) you can simply type: pip3 install

Labrak Yanis 6 Nov 22, 2022
Agile SVG maker for python

Agile SVG Maker Need to draw hundreds of frames for a GIF? Need to change the style of all pictures in a PPT? Need to draw similar images with differe

SemiWaker 4 Sep 25, 2022
Safe Bayesian Optimization

SafeOpt - Safe Bayesian Optimization This code implements an adapted version of the safe, Bayesian optimization algorithm, SafeOpt [1], [2]. It also p

Felix Berkenkamp 111 Dec 11, 2022
Learning Continuous Image Representation with Local Implicit Image Function

LIIF This repository contains the official implementation for LIIF introduced in the following paper: Learning Continuous Image Representation with Lo

Yinbo Chen 1k Dec 25, 2022