Blazing fast language detection using fastText model

Last update: Dec 20, 2022

Overview

Luga

A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create a smart data checker (a fast List[str], what do with none strings)
make it faster with Cython

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

Comments

fix: Fix invalid pytest dependency version
poetry does not want to accept flake8 as a valid versionFixes issue #13

fix: Fix invalid pytest dependency version

fix: Use fasttext-wheel instead of fasttext
opened by saevarb 1
Installation fails with recent poetry due to `fasttext` issues

Hey!

As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

opened by saevarb 1

SSL WRONG_VERSION_NUMBER

Solution from httpx

import httpx
import ssl

ssl_context = httpx.create_ssl_context()
ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
resp = httpx.get(..., verify=ssl_context)
```

opened by Proteusiq 0

Return array for compatibility with pandas

This fails since pandas expects an array and luga returns a list

texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]

But this works

texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]

opened by nthomsencph 0

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.7-py3-none-any.whl(5.55 KB)
luga-0.2.7.tar.gz(5.34 KB)
v0.2.6(Sep 28, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.6-py3-none-any.whl(5.51 KB)
luga-0.2.6.tar.gz(5.32 KB)
v0.2.5(Apr 19, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.5-py3-none-any.whl(5.50 KB)
luga-0.2.5.tar.gz(5.39 KB)
v0.2.4(Dec 23, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.4-py3-none-any.whl(4.60 KB)
luga-0.2.4.tar.gz(4.52 KB)
v0.2.3(Dec 22, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.3-py3-none-any.whl(4.56 KB)
luga-0.2.3.tar.gz(4.46 KB)
v0.2.2(Dec 3, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.2-py3-none-any.whl(4.42 KB)
luga-0.2.2.tar.gz(4.28 KB)
v0.2.1(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.1-py3-none-any.whl(4.07 KB)
luga-0.2.1.tar.gz(3.95 KB)
v0.2.0(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.0-py3-none-any.whl(4.07 KB)
luga-0.2.0.tar.gz(3.95 KB)
v0.1.8(Nov 20, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.8-py3-none-any.whl(3.88 KB)
luga-0.1.8.tar.gz(3.76 KB)
v0.1.7(Nov 17, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.7-py3-none-any.whl(3.81 KB)
luga-0.1.7.tar.gz(3.66 KB)

Owner

Prayson Wilfred Daniel

🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python

GitHub Repository

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

114 Nov 04, 2022

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

54 Jan 04, 2023

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

71 Nov 25, 2022

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

231 Dec 26, 2022

Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

250 Nov 03, 2022

Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

7 Dec 14, 2022

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

7.1k Jan 05, 2023

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

203 Dec 20, 2022

Blazing fast language detection using fastText model

Related tags

Overview

Luga

Installation

Usage:

Comming soon ...

TODO:

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Comments

fix: Fix invalid pytest dependency version

Installation fails with recent poetry due to `fasttext` issues

SSL WRONG_VERSION_NUMBER

Return array for compatibility with pandas

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

v0.2.6(Sep 28, 2022)

v0.2.5(Apr 19, 2022)

v0.2.4(Dec 23, 2021)

v0.2.3(Dec 22, 2021)

v0.2.2(Dec 3, 2021)

v0.2.1(Nov 26, 2021)

v0.2.0(Nov 26, 2021)

v0.1.8(Nov 20, 2021)

v0.1.7(Nov 17, 2021)

Owner

Prayson Wilfred Daniel

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Neural network sequence labeling model

Russian words synonyms and antonyms

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Klexikon: A German Dataset for Joint Summarization and Simplification

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

2021搜狐校园文本匹配算法大赛baseline

Example code for "Real-World Natural Language Processing"

A Telegram bot to add notes to Flomo.

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Chatbot with Pytorch, Python & Nextjs

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

hashily is a Python module that provides a variety of text decoding and encoding operations.

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".