A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

leaking paid token generator that was a shit lmao for 100$ haha

Code for text augmentation method leveraging large-scale language models

Twitter Sentiment Analysis using #tag, words and username

Random Directed Acyclic Graph Generator

LUKE -- Language Understanding with Knowledge-based Embeddings

The swas programming language

Journalism AI – Quotes extraction for modular journalism

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

Longformer: The Long-Document Transformer

Pipeline for training LSA models using Scikit-Learn.

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

Just a Basic like Language for Zeno INC

Non-Autoregressive Predictive Coding

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

MMDA - multimodal document analysis

Clone a voice in 5 seconds to generate arbitrary speech in real-time