MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Last update: Oct 19, 2022

Overview

MILES

Multilingual Lexical Simplifier
Explore the docs »

Read LSBert Paper · Report Bug · Request Feature

About The Project

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking. MILES currently supports 22 languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Indonesian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Ukrainian.

As a result of not using any language-specific resources (WordNets, POS taggers, parallel corpora, etc.), MILES does not always offer synonymous substitutions for complex words. Although almost always simpler than the original, selected substitutions may alter the meaning of the text. Please keep this in mind, and feel free to download and tailor MILES to a language of your choosing!

Prerequisites

FastText Embeddings

It is recommended that fastText embeddings are downloaded for your target language/s. These will be used by MILES to make notably more accurate simplifications. To install fastText embeddings for MILES, download the .vec embeddings for you target language here. Once done, place the .vec file in simplifier/embeddings/ before running the key vector generation script with the ISO 639-1 code for the selected language:

python simplifier/embeddings/gen_keyed_vectors.py <ISO 639-1 code>

Usage

Flask App

MILES simplifications can be done using either a simple Flask app provided or the command line. To start using the Flask app, run app.py with ISO 639-1 language code:

python app.py -l <ISO 639-1 code>

Once running, open 127.0.0.1 in your browser and start simplifying!

Command Line

If you would prefer to use the command line, there are a couple of options available:

Simplifying sentences:

python simplify.py -t <sentence> -l <ISO 639-1 code>

Simplifying text files:

python simplify.py -f <text_file> -l <ISO 639-1 code>

Note: If no language code is provided, text will be simplified assuming it's English. The default language can be changed in simplifier/config.py.

Framework

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

If you have any questions or concerns, message me on LinkedIn or email me at [email protected].

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Related tags

Overview

MILES

About The Project

Prerequisites

FastText Embeddings

Usage

Flask App

Command Line

Framework

Roadmap

Contact

Owner

Kane

Crie tokens de autenticação íntegros e seguros com UToken.

Simple and efficient RevNet-Library with DeepSpeed support

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

CPC-big and k-means clustering for zero-resource speech processing

Shared code for training sentence embeddings with Flax / JAX

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

Code for Text Prior Guided Scene Text Image Super-Resolution

Unsupervised text tokenizer for Neural Network-based text generation.

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

Weird Sort-and-Compress Thing

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

PG-19 Language Modelling Benchmark

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

jiant is an NLP toolkit