Simple embedding based text classifier inspired by fastText, implemented in tensorflow

Overview

FastText in Tensorflow

This project is based on the ideas in Facebook's FastText but implemented in Tensorflow. However, it is not an exact replica of fastText.

Classification is done by embedding each word, taking the mean embedding over the full text and classifying that using a linear classifier. The embedding is trained with the classifier. You can also specify to use 2+ character ngrams. These ngrams get hashed then embedded in a similar manner to the orginal words. Note, ngrams make training much slower but only make marginal improvements in performance, at least in English.

I may implement skipgram and cbow training later. Or preloading embedding tables.

<< Still WIP >>

You can use Horovod to distribute training across multiple GPUs, on one or multiple servers. See usage section below.

FastText Language Identification

I have added utilities to train a classifier to detect languages, as described in Fast and Accurate Language Identification using FastText

See usage below. It basically works in the same way as default usage.

Implemented:

  • classification of text using word embeddings
  • char ngrams, hashed to n bins
  • training and prediction program
  • serve models on tensorflow serving
  • preprocess facebook format, or text input into tensorflow records

Not Implemented:

  • separate word vector training (though can export embeddings)
  • heirarchical softmax.
  • quantize models (supported by tensorflow, but I haven't tried it yet)

Usage

The following are examples of how to use the applications. Get full help with --help option on any of the programs.

To transform input data into tensorflow Example format:

process_input.py --facebook_input=queries.txt --output_dir=. --ngrams=2,3,4

Or, using a text file with one example per line with an extra file for labels:

process_input.py --text_input=queries.txt --labels=labels.txt --output_dir=.

To train a text classifier:

classifier.py \
  --train_records=queries.tfrecords \
  --eval_records=queries.tfrecords \
  --label_file=labels.txt \
  --vocab_file=vocab.txt \
  --model_dir=model \
  --export_dir=model

To predict classifications for text, use a saved_model from classifier. classifier.py --export_dir stores a saved model in a numbered directory below export_dir. Pass this directory to the following to use that model for predictions:

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=proba

To export the embedding layer you can export from predictor. Note, this will only be the text embedding, not the ngram embeddings.

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=embedding

Use the provided script to train easily:

train_classifier.sh path-to-data-directory

Language Identification

To implement something similar to the method described in Fast and Accurate Language Identification using FastText you need to download the data:

lang_dataset.sh [datadir]

You can then process the training and validation data using process_input.py and classifier.py as described above.

There is a utility script to do this for you:

train_langdetect.sh datadir

It reaches about 96% accuracy using word embeddings and this increases to nearly 99% when adding --ngrams=2,3,4

Distributed Training

You can run training across multiple GPUs either on one or multiple servers. To do so you need to install MPI and Horovod then add the --horovod option. It runs very close to the GPU multiple in terms of performance. I.e. if you have 2 GPUs on your server, it should run close to 2x the speed.

NUM_GPUS=2
mpirun -np $NUM_GPUS python classifier.py \
  --horovod \
  --train_records=queries.tfrecords \
  --eval_records=queries.tfrecords \
  --label_file=labels.txt \
  --vocab_file=vocab.txt \
  --model_dir=model \
  --export_dir=model

The training script has this option added: train_classifier.sh.

Tensorflow Serving

As well as using predictor.py to run a saved model to provide predictions, it is easy to serve a saved model using Tensorflow Serving with a client server setup. There is a supplied simple rpc client (predictor_client.py) that provides predictions by using tensorflow server.

First make sure you install the tensorflow serving binaries. Instructions are here.

You then serve the latest saved model by supplying the base export directory where you exported saved models to. This directory will contain the numbered model directories:

tensorflow_model_server --port=9000 --model_base_path=model

Now you can make requests to the server using gRPC calls. An example simple client is provided in predictor_client.py:

predictor_client.py --text="Some text to classify"

Facebook Examples

<< NOT IMPLEMENTED YET >>

You can compare with Facebook's fastText by running similar examples to what's provided in their repository.

./classification_example.sh
./classification_results.sh
Owner
Alan Patterson
Alan Patterson
AQP is a modular pipeline built to enable the comparison and testing of different quality metric configurations.

Audio Quality Platform - AQP An Open Modular Python Platform for Objective Speech and Audio Quality Metrics AQP is a highly modular pipeline designed

Jack Geraghty 24 Oct 01, 2022
The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

Watermark-Robustness-Toolbox - Official PyTorch Implementation This repository contains the official PyTorch implementation of the following paper to

49 Dec 19, 2022
A tensorflow model that predicts if the image is of a cat or of a dog.

Quick intro Hello and thank you for your interest in my project! This is the backend part of a two-repo application. The other part can be found here

Tudor Matei 0 Mar 08, 2022
SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

SparseML is a toolkit that includes APIs, CLIs, scripts and libraries that apply state-of-the-art sparsification algorithms such as pruning and quantization to any neural network. General, recipe-dri

Neural Magic 1.5k Dec 30, 2022
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 06, 2022
A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

VAMSHI CHOWDARY 3 Jun 22, 2022
A simple implementation of Kalman filter in single object tracking

kalman-filter-in-single-object-tracking A simple implementation of Kalman filter in single object tracking https://www.bilibili.com/video/BV1Qf4y1J7D4

130 Dec 26, 2022
Controlling Hill Climb Racing with Hand Tacking

Controlling Hill Climb Racing with Hand Tacking Opened Palm for Gas Closed Palm for Brake

Rohit Ingole 3 Jan 18, 2022
Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

This repository is used to suspend the results of our paper "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement"

ScorpioMiku 19 Sep 30, 2022
atmaCup #11 の Public 4th / Pricvate 5th Solution のリポジトリです。

#11 atmaCup 2021-07-09 ~ 2020-07-21 に行われた #11 [初心者歓迎! / 画像編] atmaCup のリポジトリです。結果は Public 4th / Private 5th でした。 フレームワークは PyTorch で、実装は pytorch-image-m

Tawara 12 Apr 07, 2022
Project for music generation system based on object tracking and CGAN

Project for music generation system based on object tracking and CGAN The project was inspired by MIDINet: A Convolutional Generative Adversarial Netw

1 Nov 21, 2021
🍅🍅🍅YOLOv5-Lite: lighter, faster and easier to deploy. Evolved from yolov5 and the size of model is only 1.7M (int8) and 3.3M (fp16). It can reach 10+ FPS on the Raspberry Pi 4B when the input size is 320×320~

YOLOv5-Lite:lighter, faster and easier to deploy Perform a series of ablation experiments on yolov5 to make it lighter (smaller Flops, lower memory, a

pogg 1.5k Jan 05, 2023
Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Running SB3 developed agents on TFLite or Coral Introduction I've been using Stable-Baselines3 to train agents against some custom Gyms, some of which

Gary Briggs 16 Oct 11, 2022
A general-purpose programming language, focused on simplicity, safety and stability.

The Rivet programming language A general-purpose programming language, focused on simplicity, safety and stability. Rivet's goal is to be a very power

The Rivet programming language 17 Dec 29, 2022
Uni-Fold: Training your own deep protein-folding models

Uni-Fold: Training your own deep protein-folding models. This package provides an implementation of a trainable, Transformer-based deep protein foldin

DP Technology 187 Jan 04, 2023
The source code of "SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation", accepted to WACV 2022.

SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation The source code of our work "SIDE: Center-based Stereo 3D Detecto

10 Dec 18, 2022
Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model

Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model SWAGAN: A Style-based Wavelet-driven Generative Model Rinon Gal, Dana

55 Dec 06, 2022
Simple Python application to transform Serial data into OSC messages

SerialToOSC-Bridge Simple Python application to transform Serial data into OSC messages. The current purpose is to be a compatibility layer between ha

Division of Applied Acoustics at Chalmers University of Technology 3 Jun 03, 2021
Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

281 Dec 09, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023