Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Last update: Dec 15, 2022

Related tags

Overview

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script for Easy-Evaluation of your translations 🥳

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this paper and first released in this repository.

M2M100 can directly translate between 9,900 directions of 100 languages.

Easy-Translate is built on top of 🤗 HuggingFace's Transformers and 🤗 HuggingFace's Accelerate library.

We currently support:

CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
BF16 / FP16 / FP32 precision.
Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
Sharded Data Parallel to load huge models sharded on multiple GPUs (See: https://huggingface.co/docs/accelerate/fsdp).

Test the 🔌 Online Demo here: https://huggingface.co/spaces/Iker/Translate-100-languages

Supported languages

See the Supported languages table for a table of the supported languages and their ids.

List of supported languages: Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian, Bulgarian, Bengali, Breton, Bosnian, Catalan, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, WesternFrisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, CentralKhmer, Kannada, Korean, Luxembourgish, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch, Norwegian, NorthernSotho, Occitan, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu

Supported Models

Facebook/m2m100_418M: https://huggingface.co/facebook/m2m100_418M
Facebook/m2m100_1.2B: https://huggingface.co/facebook/m2m100_1.2B
Facebook/m2m100_12B: https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt
Any other m2m100 model from HuggingFace's Hub: https://huggingface.co/models?search=m2m100

Requirements

Pytorch >= 1.10.0
See: https://pytorch.org/get-started/locally/

Accelerate >= 0.7.1
pip install --upgrade accelerate

HuggingFace Transformers 
pip install --upgrade transformers

Translate a file

Run python translate.py -h for more info.

Using a single CPU / GPU

accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Multi-GPU

See Accelerate documentation for more information (multi-node, TPU, Sharded model...): https://huggingface.co/docs/accelerate/index
You can use the Accelerate CLI to configure the Accelerate environment (Run accelerate config in your terminal) instead of using the --multi_gpu and --num_processes flags.

# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Automatic batch size finder

We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the --starting_batch_size 128 flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.

Choose precision

Use the --precision flag to choose the precision of the model. You can choose between: bf16, fp16 and 32.

accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--precision fp16

Evaluate translations

To run the evaluation script you need to install bert_score: pip install bert_score and 🤗 HuggingFace's Datasets model: pip install datasets.

The evaluation script will calculate the following metrics:

Run the following command to evaluate the translations:

accelerate launch eval.py \
--pred_path sample_text/es.txt \
--gold_path sample_text/en2es.translation.m2m100_1.2B.txt

If you want to save the results to a file use the --output_path flag.

See sample_text/en2es.m2m100_1.2B.json for a sample output.

Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Related tags

Overview

Supported languages

Supported Models

Requirements

Translate a file

Using a single CPU / GPU

Multi-GPU

Automatic batch size finder

Choose precision

Evaluate translations

Owner

Iker García-Ferrero

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

BERT-based Financial Question Answering System

Deep Learning Topics with Computer Vision & NLP

MEDIALpy: MEDIcal Abbreviations Lookup in Python

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Train BPE with fastBPE, and load to Huggingface Tokenizer.

SimBERT升级版（SimBERTv2）！

Write Python in Urdu - اردو میں کوڈ لکھیں

Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

PyWorld3 is a Python implementation of the World3 model

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Code for the paper "Language Models are Unsupervised Multitask Learners"

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

History Aware Multimodal Transformer for Vision-and-Language Navigation

基于pytorch_rnn的古诗词生成

A very simple framework for state-of-the-art Natural Language Processing (NLP)