Crosslingual Segmental Language Model

This repository contains the code from Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (2021, C.M. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral). The code here is a modified version of the repository from the original MSLM paper. The mslm package can be used to train and use Segmental Language Models.

In this repository, we additionally make available our preparation of the AmericasNLP 2021 multilingual dataset (see Data/AmericasNLP) and the target K'iche' data (Data/GlobalClassroom).

Paper Results

The results from the accompanying paper can be found in the Output directory. *.csv files include statistics from the training run, *.out contain the model output for the entire corpus, *.score contain the segmentation scores of the model output.

The results from the October 2021 pre-print (which we will refer to as Experiment Set A) are reproducible on commit 2b89575. We will consider this the official commit of the October 2021 pre-print.

Usage

The top-level scripts for training and experimentation can be found in RunScripts. Almost all functionality is run through the __main__.py script in the mslm package, which can either train or evaluate/use a model. The PyTorch modules for building SLMs can be found in mslm.segmental_lm, modules for the span-masking Transformer are in mslm.segmental_transformer, and modules for sequence lattice-based computations are in mslm.lattice. The main script takes in a configuration object to set most parameters for model training and use (see mslm.mslm_config). For information on the arguments to the main script:

python -m mslm --help

Environment setup

pip install -r requirements.txt

This code requires Python >= 3.6

Training

./RunScripts/run_mslm.sh

python -m mslm --input_file 
   
     \
    --model_path 
    
      \
    --mode train \
    --config_file 
     
       \
    --dev_file 
      
        \
    [--preexisting]

Evaluation

./RunScripts/eval_mslm.sh

Where is a text file containing all of the words from the training set

Crosslingual Segmental Language Model

Related tags

Overview

Crosslingual Segmental Language Model

Paper Results

Usage

Environment setup

Training

Evaluation

Owner

C.M. Downey

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

3D Pose Estimation for Vehicles

Self-Supervised Methods for Noise-Removal

这是一个yolo3-tf2的源码，可以用于训练自己的模型。

Official implementation for the paper: "Multi-label Classification with Partial Annotations using Class-aware Selective Loss"

Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

Source code for "Roto-translated Local Coordinate Framesfor Interacting Dynamical Systems"

B2EA: An Evolutionary Algorithm Assisted by Two Bayesian Optimization Modules for Neural Architecture Search

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Stochastic gradient descent with model building

code for paper -- "Seamless Satellite-image Synthesis"

Datasets, tools, and benchmarks for representation learning of code.

Warning: This project does not have any current developer. See bellow.

Official implementation of our paper "Learning to Bootstrap for Combating Label Noise"

Implementation of SSMF: Shifting Seasonal Matrix Factorization

Learning and Building Convolutional Neural Networks using PyTorch

A basic duplicate image detection service using perceptual image hash functions and nearest neighbor search, implemented using faiss, fastapi, and imagehash

FOSS Digital Asset Distribution Platform built on Frappe.