Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Overview

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

This repository contains code and data for evaluating model performance in crosslinguistic low-resource settings, using morphological segmentation as the test case. For more information, we refer to the paper Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation, to appear in Transactions of the Association for Computational Linguistics.

Arxiv version here

@misc{liu2022datadriven,
      title={Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation}, 
      author={Zoey Liu and Emily Prud'hommeaux},
      year={2022},
      eprint={2201.01845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

Install the following:

(1) Python 3

(2) Morfessor

(3) CRFsuite

(4) OpenNMT

Code

The code directory contains the code applied to conduct the experiments.

Collect initial data

Create a resource folder. This folder is supposed to hold the initial data for each language invited to participate in the experiments. The experiments were performed at different stages, therefore the initial data of different languages have different subdirectories within resource (please excuse this).

The data for three Mexican languages came from this paper.

(1) download the data from the public repository

(2) for each language, combine all the data from the training, development, and test set; this applies to both the *src files and the *tgt files.

(3) rename the combined data file as, e.g., Yorem Nokki: mayo_src, mayo_tgt, Nahuatl: nahuatl_src, nahuatl_tgt.

(4) put the data files within resource

The data for Persian came from here.

(1) download the data from the public repository

(2) combine the training, development, and test set to one data file

(3) rename the combined data file as persian

(4) put the single data file within resource

The data for German, Zulu and Indonesian came from this paper.

(1) download the data from the public repository

(2) put the downloaded supplement folder within resource

The data for English, Russian, Turkish and Finnish came from this repo.

(1) download the git repo

(2) put the downloaded NeuralMorphemeSegmentation folder within resource

Summary of (alternative) Language codes and data directories for running experiments

Yorem Nokki: mayo resources/

Nahuatl: nahuatl resources/

Wixarika: wixarika resources/

English: english/eng resources/NeuralMorphemeSegmentation/morphochal10data/

German: german/ger resources/supplement/seg/ger

Persian: persian resources/

Russian: russian/ru resources/NeuralMorphemeSegmentation/data/

Turkish: turkish/tur resources/NeuralMorphemeSegmentation/morphochal10data/

Finnish: finnish/fin resources/NeuralMorphemeSegmentation/morphochal10data/

Zulu: zulu/zul resources/supplement/seg/zul

Indonesian: indonesian/ind resources/supplement/seg/ind

Basic running of the code

Create experiments folder and subfolders for each language; e.g., Zulu

mkdir experiments

mkdir zulu

Generate data (an example)

with replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r with --k 500

without replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r without --k 500

Training models: Morfessor

Train morfessor models

python3 code/morfessor/morfessor.py --input experiments/zulu/500/with/ --lang zul

python3 code/morfessor/morfessor.py --input experiments/zulu/500/without/ --lang zul

Generate evaluation scrips for morfessor model results

python3 code/morf_shell.py --input experiments/zulu/500/ --lang zul

Evaluate morfessor model results

bash zulu_500_morf_eval.sh

Training models: CRF

Generate CRF shell script

e.g., generating 3-CRF shell script

python3 code/crf_order.py --input experiments/zulu/500/ --lang zul --r with --order 3

Training models: Seq2seq

Generate configuration .yaml files

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r with

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r without

Generate pbs file (containing also the code to train Seq2seq model)

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r with

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r without

Gather training results for a given language

Again take Zulu as an example. Make sure that given a data set size (e.g, 500) and a sampling method (e.g., with replacement), there are three subfolders in the folder experiments/zulu/500/with:

(1) morfessor for all *eval* files from Morfessor;

(2) higher_orders for all *eval* files from k-CRF;

(3) seq2seq for all *eval* files from Seq2seq

Then run:

python3 code/gather.py --input experiments/zulu/ --lang zul --short zulu.txt --full zulu_full.txt --long zulu_details.txt

Testing

Testing the best CRF

e.g., 4-CRFs trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_crf.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --order 4 --r with --k 50

Testing the best Seq2seq

e.g., trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_seq2seq.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --r with --k 50

Do the same for every language

Generating alternative splits

Gather features of data sets, as well as generate heuristic/adversarial data splits

python3 code/heuristics.py --input experiments/zulu/ --lang zul --output yayyy/ --split A --generate

Gather features of new unseen test sets

python3 code/new_test_heuristics.py --input experiments/zulu/ --output yayyy/ --lang zul

Yayyy: Full Results

Get them here

Running analyses and making plots

See code/plot.R for analysis and making fun plots

Owner
Zoey Liu
language, computation, music, food
Zoey Liu
DrNAS: Dirichlet Neural Architecture Search

This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random va

Xiangning Chen 37 Jan 03, 2023
PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

AI Wizards for Software Management (AWSM) Research Group 14 Nov 13, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 01, 2023
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

Akshita Gupta 54 Nov 21, 2022
Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Code for Domain Adaptive Video Segmentation via Temporal Consistency Regularization in ICCV 2021

Domain Adaptive Video Segmentation via Temporal Consistency Regularization Updates 08/2021: check out our domain adaptation for sematic segmentation p

36 Dec 12, 2022
Experiments for distributed optimization algorithms

Network-Distributed Algorithm Experiments -- This repository contains a set of optimization algorithms and objective functions, and all code needed to

Boyue Li 40 Dec 04, 2022
OpenVisionAPI server

🚀 Quick start An instance of ova-server is free and publicly available here: https://api.openvisionapi.com Checkout ova-client for a quick demo. Inst

Open Vision API 93 Nov 24, 2022
Repository for tackling Kaggle Ultrasound Nerve Segmentation challenge using Torchnet.

Ultrasound Nerve Segmentation Challenge using Torchnet This repository acts as a starting point for someone who wants to start with the kaggle ultraso

Qure.ai 46 Jul 18, 2022
Repository of the paper Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models at ML4AD @ NeurIPS 2021.

Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models Code and supplementary materials Repository of the p

Daniel Bogdoll 4 Jul 13, 2022
Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley

Sean Higley 1 Feb 23, 2022
You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks.

AllSet This is the repo for our paper: You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks. We prepared all codes and a subse

Jianhao 51 Dec 24, 2022
Bridging Vision and Language Model

BriVL BriVL (Bridging Vision and Language Model) 是首个中文通用图文多模态大规模预训练模型。BriVL模型在图文检索任务上有着优异的效果,超过了同期其他常见的多模态预训练模型(例如UNITER、CLIP)。 BriVL论文:WenLan: Bridgi

235 Dec 27, 2022
A PyTorch Implementation of Single Shot Scale-invariant Face Detector.

S³FD: Single Shot Scale-invariant Face Detector A PyTorch Implementation of Single Shot Scale-invariant Face Detector. Eval python wider_eval_pytorch.

carwin 235 Jan 07, 2023
The official PyTorch code for 'DER: Dynamically Expandable Representation for Class Incremental Learning' accepted by CVPR2021

DER.ClassIL.Pytorch This repo is the official implementation of DER: Dynamically Expandable Representation for Class Incremental Learning (CVPR 2021)

rhyssiyan 108 Jan 01, 2023
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
Density-aware Single Image De-raining using a Multi-stream Dense Network (CVPR 2018)

DID-MDN Density-aware Single Image De-raining using a Multi-stream Dense Network He Zhang, Vishal M. Patel [Paper Link] (CVPR'18) We present a novel d

He Zhang 224 Dec 12, 2022
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

hippopmonkey 4 Dec 11, 2022
DLL: Direct Lidar Localization

DLL: Direct Lidar Localization Summary This package presents DLL, a direct map-based localization technique using 3D LIDAR for its application to aeri

Service Robotics Lab 127 Dec 16, 2022
PyTorch implementation of Federated Learning with Non-IID Data, and federated learning algorithms, including FedAvg, FedProx.

Federated Learning with Non-IID Data This is an implementation of the following paper: Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, Vik

Youngjoon Lee 48 Dec 29, 2022