AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Overview

LITMUS Predictor

LITMUS Predictor provides support for simulating performance in ~100 languages given training observations of the desired task-model. Each training observation specifies the finetuning-datasize + test-performance in different languages.

Further, the tool provides support for constructing a data-collection strategy to maximize performance in desired targets subject to different constraints.

Installation

pip install -U pip
pip install -r requirements.txt

Usage

litmus/litmus_mixing.py contains the implementation of the LITMUS Predictor which can be trained on observations of different task-model trainings.

usage: LITMUS Tool [-h] [--scores_file SCORES_FILE]
                   [--train_format {json,csv}] [--save_state SAVE_STATE]
                   [--load_state LOAD_STATE]
                   [--precomputed_features PRECOMPUTED_FEATURES]
                   [--pivot_features {none,all,data_only}] [--use_all_langs]
                   [--common_scaling] [--training_algorithm {xgboost,mlp}]
                   [--error_method {LOO,LOTO,split,kfold,manual_split}]
                   [--data_sizes DATA_SIZES] [--mode MODE [MODE ...]]
                   [--output_dir OUTPUT_DIR]
                   [--heatmap_targets HEATMAP_TARGETS]
                   [--suggestions_budget SUGGESTIONS_BUDGET]
                   [--suggestions_langbudget SUGGESTIONS_LANGBUDGET]
                   [--suggestions_targets SUGGESTIONS_TARGETS]
                   [--suggestions_weights SUGGESTIONS_WEIGHTS]
                   [--suggestions_pivots SUGGESTIONS_PIVOTS]
                   [--suggestions_augmentable SUGGESTIONS_AUGMENTABLE]
                   [--suggestions_grid {exponential,linear}]
                   [--suggestions_objective {avg,min}]
                   [--suggestions_minperf SUGGESTIONS_MINPERF]
                   [--suggestions_minlangperf SUGGESTIONS_MINLANGPERF]
                   [--suggestions_verbose]
                   {mbert,xlmr}

positional arguments:
  {mbert,xlmr}          name of model to use

optional arguments:
  -h, --help            show this help message and exit
  --scores_file SCORES_FILE
                        path of json file containing scores to train on
  --train_format {json,csv}
                        Format of the training data
  --save_state SAVE_STATE
                        Save state of training of model to pickle file
  --load_state LOAD_STATE
                        Load trained model from pickle file
  --precomputed_features PRECOMPUTED_FEATURES
                        Path to precomputed-features file.
  --pivot_features {none,all,data_only}
                        What features based on pivot langs to use
  --use_all_langs       Add features based on all langs the tool supports
                        (Needed for transfer)
  --common_scaling      Common min max scaling params that are pvt
                        dependent(data size, type overlap, distance)
  --training_algorithm {xgboost,mlp}
                        which regressor to use
  --error_method {LOO,LOTO,split,kfold,manual_split}

  --data_sizes DATA_SIZES
                        Pivot data-size configs (semi-colon separated configs,
                        each config itself being comma-separated key-value
                        pairs)

  --mode MODE [MODE ...]
                        Output modes (comma-separated). Choose from following:
                        {heatmap, suggestions}.
  --output_dir OUTPUT_DIR
                        Overrride output directory
  --heatmap_targets HEATMAP_TARGETS
                        Targets for heatmap. Overrides suggestions_targets
                        (which is used by deafult)

  --suggestions_budget SUGGESTIONS_BUDGET
                        Budget for finding suggestions of which languages to
                        add data for (0 to disable)
  --suggestions_langbudget SUGGESTIONS_LANGBUDGET
                        Language-specific budget for finding suggestions
                        (overrrides suggestions_budget for these langs, comma-
                        separated list of key:value pairs)
  --suggestions_targets SUGGESTIONS_TARGETS
                        Targets being considered (comma-separated)
  --suggestions_weights SUGGESTIONS_WEIGHTS
                        Target weights for avg perf objective (comma-separated
                        list of key:value pairs, default wt=1)
  --suggestions_pivots SUGGESTIONS_PIVOTS
                        Index of desired row in data_sizes
  --suggestions_augmentable SUGGESTIONS_AUGMENTABLE
                        Set of augmentable languages (comma-separated)
  --suggestions_grid {exponential,linear}
                        Search space grid to use for suggestions
  --suggestions_objective {avg,min}
                        Objective function to be used for finding suggestions
  --suggestions_minperf SUGGESTIONS_MINPERF
                        Minimum acceptable average performance across tgts
  --suggestions_minlangperf SUGGESTIONS_MINLANGPERF
                        Minimum acceptable performance for given tgts (comma-
                        separated list of key:value pairs)
  --suggestions_verbose
                        Verbose logging of search

Examples

From shell

python3 litmus_mixing.py xlmr --scores_file training_observations.json --common_scaling --error_method split --mode heatmap --data_sizes "en:1000,hi:1000;en:1000,ar:1000" --use_all_langs --heatmap_targets en,fr,de,hi,ar,ru

From external scripts

from litmus import litmus_mixing

data_file = "" # Location of train data file
args = litmus_mixing.parse_args([
    "xlmr", data_file,
    "--common_scaling",
    "--error_method", "kfold",
    "--training_algorithm", "xgboost"
])
res = litmus_mixing.litmus_main(args)

WebApp

frontend/ contains the code for hosting the tool as a webapp using Azure Functions. frontend/WebUx implements the client-side as a static website which interacts with a Azure Functions backend which internally runs the litmus/litmus_mixing.py script.

Instructions to self-host

  1. Create an Azure Functions resource on Azure.
  2. Install Azure CLI and Functions Core Tools
  3. cd into the frontend/ directory and deploy to azure functions using func azure functionapp publish .

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

Benjamin Heinzerling 1.1k Jan 03, 2023
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
A paper list for aspect based sentiment analysis.

Aspect-Based-Sentiment-Analysis A paper list for aspect based sentiment analysis. Survey [IEEE-TAC-20]: Issues and Challenges of Aspect-based Sentimen

jiangqn 419 Dec 20, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
The code for two papers: Feedback Transformer and Expire-Span.

transformer-sequential This repo contains the code for two papers: Feedback Transformer Expire-Span The training code is structured for long sequentia

Meta Research 125 Dec 25, 2022
A raytrace framework using taichi language

ti-raytrace The code use Taichi programming language Current implement acceleration lvbh disney brdf How to run First config your anaconda workspace,

蕉太狼 73 Dec 11, 2022
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023