Text language identification using Wikipedia data

The aim of this project is to provide high-quality language detection over all the web's languages. The proxy for all web's languages is Wikipedia. Currently, we support 156 languages that have their Wikipedia entries.

Usage

The main function is text-langs that returns 2 values:

a lang - probability alist (languages are represented by their ISO-639-1 codes)
a vector of tokens with their inferred langs

WILD> (text-langs "це тест")
((:UK . 0.5000003) (:RU . 0.4999998))
#(<це - UK:1.00> <тест - RU:1.00>)

Running as a service

Installation

Install SBCL
Get Quicklisp
Git clone project
$ cd wiki-lang-detect; sbcl --load run.lisp

Running as a Docker

docker build -t wiki-lang-detect:latest .
docker run -it -p 5000:5000 wiki-lang-detect:latest

curl -X POST -H "Content-Type: application/json" -d "{'text': 'Несе Галя'}"  http://localhost:5000/detect | jq '.'

Or you can use prebuilt Docker image maintained outside of this repository.

docker run -it -p 5000:5000 chaliy/wiki-lang-detect:latest

API

See swagger definition

Text language identification using Wikipedia data

Related tags

Overview

Text language identification using Wikipedia data

Usage

Running as a service

Installation

Running as a Docker

API

Helpful links:

Owner

Vsevolod Dyomkin

🖺 OCR using tensorflow with attention

Semantic-based Patch Detection for Binary Programs

Source Code for AAAI 2022 paper "Graph Convolutional Networks with Dual Message Passing for Subgraph Isomorphism Counting and Matching"

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

A simple component to display annotated text in Streamlit apps.

Convolutional Recurrent Neural Network (CRNN) for image-based sequence recognition.

a deep learning model for page layout analysis / segmentation.

BNF Globalization Code (CVPR 2016)

An Optical Character Recognition system using Pytesseract/Extracting data from Blood Pressure Reports.

Using computer vision method to recognize and calcutate the features of the architecture.

Convolutional Recurrent Neural Networks(CRNN) for Scene Text Recognition

基于openpose和图像分类的手语识别项目

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

An Implementation of the FOTS: Fast Oriented Text Spotting with a Unified Network

Repository for playing the computer vision apps: People analytics on Raspberry Pi.

Code for the ACL2021 paper "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction"

Face Anonymizer - FaceAnonApp v1.0

Thresholding-and-masking-using-OpenCV - Image Thresholding is used for image segmentation

A tool to make dumpy among us GIFS

A simple document layout analysis using Python-OpenCV