Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Deep LearningVOLT
Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

  • July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
  • July 2021: Support subword-nmt tokenization.
  • July 2021: Support sentencepiece tokenization.

What's On-going:

  • Add translation training/evaluation codes.
  • Support classification tasks.
  • Support pip usage.

Features:

  • Efficient: CPU learning on one machine.
  • Simple: The core code is no more than 200 lines.
  • Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
  • Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

  • python 3.0
  • tqdm
  • mosedecoder
  • subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm 

Usage

  • The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:

    
    #Assume source_data is the file stroing data in the source language
    #Assume target_data is the file stroing data in the target language
    BPEROOT=subword-nmt
    size=30000 # the size of BPE
    cat source_data > training_data
    cat target_data >> training_data
    
    #subword-nmt style:
    mkdir bpeoutput
    BPE_CODE=code # the path to save vocabulary
    python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file
    
    #sentencepiece style:
    mkdir spmout
    python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
    #After this step, you will see spm.vocab and spm.model
    python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
    python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    
  • The second step is to run VOLT scripts. It accepts the following parameters:

    • --source_file: the file storing data in the source language.
    • --target_file: the file storing data in the target language.
    • --token_candidate_file: the file storing token candidates.
    • --max_number: the maximum size of the vocabulary generated by VOLT.
    • --interval: the search granularity in VOLT.
    • --loop_in_ot: the maximum interation loop in sinkhorn solution.
    • --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
    • --size_file: the file to store the vocabulary size generated by VOLT.
    • --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
    #subword-nmt style
    python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
    #sentencepiece style
    python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
    
  • The third step is to use the generated vocabulary to tokenize your texts:

      #for subword-nmt toolkit
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file
    
      #for sentencepiece toolkit, here we only keep the optimal size
      best_size=$(cat spmoutput/size)
      python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
    
      #After this step, you will see spm.vocab and spm.model
      python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
      python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}
Tensorflow-seq2seq-tutorials - Dynamic seq2seq in TensorFlow, step by step

seq2seq with TensorFlow Collection of unfinished tutorials. May be good for educational purposes. 1 - simple sequence-to-sequence model with dynamic u

Matvey Ezhov 1k Dec 17, 2022
Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

Nicholas Monath 35 Nov 16, 2022
Collections for the lasted paper about multi-view clustering methods (papers, codes)

Multi-View Clustering Papers Collections for the lasted paper about multi-view clustering methods (papers, codes). There also exists some repositories

Andrew Guan 10 Sep 20, 2022
The official PyTorch implementation for NCSNv2 (NeurIPS 2020)

Improved Techniques for Training Score-Based Generative Models This repo contains the official implementation for the paper Improved Techniques for Tr

174 Dec 26, 2022
The code of "Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer".

Code data_preprocess.py: preprocess data for Dependent-T5. parameters.py: define parameters of Dependent-T5. train_tools.py: traning and evaluation co

1 Apr 21, 2022
The repository includes the code for training cell counting applications. (Keras + Tensorflow)

cell_counting_v2 The repository includes the code for training cell counting applications. (Keras + Tensorflow) Dataset can be downloaded here : http:

Weidi 113 Oct 06, 2022
Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems.

CottonWeeds Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems. requirements pytorch torchsumma

Dong Chen 8 Jun 07, 2022
Official implementation of the paper 'High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network' in CVPR 2021

LPTN Paper | Supplementary Material | Poster High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network Ji

372 Dec 26, 2022
ICCV2021 - A New Journey from SDRTV to HDRTV.

ICCV2021 - A New Journey from SDRTV to HDRTV.

XyChen 82 Dec 27, 2022
Some simple programs built in Python: webcam with cv2 that detects eyes and face, with grayscale filter

Programas en Python Algunos programas simples creados en Python: 📹 Webcam con c

Madirex 1 Feb 15, 2022
Code and data for "TURL: Table Understanding through Representation Learning"

TURL This Repo contains code and data for "TURL: Table Understanding through Representation Learning". Environment and Setup Data Pretraining Finetuni

SunLab-OSU 63 Nov 23, 2022
Python with OpenCV - MediaPip Framework Hand Detection

Python HandDetection Python with OpenCV - MediaPip Framework Hand Detection Explore the docs » Contact Me About The Project It is a Computer vision pa

2 Jan 07, 2022
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling

IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling This is my code, data and approach for the IEEE-CIS Technical Challen

3 Sep 18, 2022
Official code release for: EditGAN: High-Precision Semantic Image Editing

Official code release for: EditGAN: High-Precision Semantic Image Editing

565 Jan 05, 2023
Improving Calibration for Long-Tailed Recognition (CVPR2021)

MiSLAS Improving Calibration for Long-Tailed Recognition Authors: Zhisheng Zhong, Jiequan Cui, Shu Liu, Jiaya Jia [arXiv] [slide] [BibTeX] Introductio

Jia Research Lab 116 Dec 20, 2022
Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

SUCP Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation () Direct Friends (i.e., users who follow each o

Kosar 8 Nov 26, 2022
This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

Sagnik Roy 13 Jan 26, 2022
My personal code and solution to the Synacor Challenge from 2012 OSCON.

Synacor OSCON Challenge Solution (2012) This repository contains my code and solution to solve the Synacor OSCON 2012 Challenge. If you are interested

2 Mar 20, 2022
MobileNetV1-V2,MobileNeXt,GhostNet,AdderNet,ShuffleNetV1-V2,Mobile+ViT etc.

MobileNetV1-V2,MobileNeXt,GhostNet,AdderNet,ShuffleNetV1-V2,Mobile+ViT etc. ⭐⭐⭐⭐⭐

568 Jan 04, 2023