Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Overview

Build Status

Yase

Yet Another Sequence Encoder - encode sequences to vector of vectors in python !

Why Yase ?

Yase enable you to encode any sequence which can be represented by string to be encoded into a list of word-vector representation.

When searching over a tool to encode a sentence as a list of word-vector, it was clear that there was no simple tool to use. And so, i decided to create Yase.

Note : If you only want to get the word-vector of a word, or average of word-vector in sentence, you should probably better check Spacy.

Requirements

Yase requirements are :

Mapping file

The mapping should be a columnar file like :

<token> <vector value>
token1 0.1 0.6 -1.2
token2 0.6 -2.3 3.4

All data should be separated by space, thus no space is allowed in token. You should be able to directly use Facebook Fast Text pretrained word vector as mapping.

Input file

Input file should be a list of text, with one sample per line.

hello world
Yase is awesome !

The default separator is a space " " but any regular expression can be provided.

Note that Yase is case insensitive

How to use

yase is command line tool. You can install by with :pip install git+https://github.com/PPACI/yase.git

>> yase
usage: yase [-h] --input input.txt [--input-encoding UTF8] --output
               output.txt --mapping mapping.vec [--mapping-encoding UTF8]
               [--separator \ |\.|\,] [--no-replace]
               [--cleaning-json cleaning.json]

Yet Another Sequence Encoder

optional arguments:
  -h, --help            show this help message and exit
  --input input.txt     Path to file to encode
  --input-encoding UTF8
                        encoding of input file. UTF8 by default
  --output output.txt   Path to output file
  --mapping mapping.vec
                        Path to mapping file
  --mapping-encoding UTF8
                        encoding of mapping file. UTF8 by default
  --separator \ |\.|\,  regular expression used to split the input sequence
  --no-replace          don't clean input data
  --cleaning-json cleaning.json
                        Path to your own json replacement file for cleaning.
                        Will use the included replacement file otherwise.

If you wanted to use the english word vector for an input file like previously described :

yase --input "input.txt" --output "output.csv" --mapping "wiki.en.vec" 

Output format

The idea behind yase is to be as easy as possible to integrate it in all data science processing.

Yase output it's your data as CSV.

The only problem with CSV is that it's difficult to integrate multi-dimensional array. So we had to find a compromise..

Yase encode the vector columns in JSON format, which is easily readable and is very similar to python array representation.

The output file will be similar to :

inputs vectors
hello world [[1,1,1],[2,2,2]]
yase is awesome ! [[3,3,3],[4,4,4]]

Cleaning

Yase will automatically try to clean your input file by applying regex in the right order.

For example : Hello I'm yase.Nice to meet you will magically become Hello I m yase . Nice to meet you.

Remember that yase is case insensitive. So yase will understand as hello i m yase . nice to meet you.

Lastly, if your mapping doesn't include a mapping for ".", you will obtain vectors for hello i m yase nice to meet you

Of course, you can disable this behaviour by providing --no-replace argument.

Providing your own replacement file

You can do this by providing a path to your file with --cleaning-json.

The replacement file is a json like :

{
  "\"": "",
  "'": "",
  ",": " , ",
  "\\.": " . ",
  "  ": " "
}

Input are regex, so remind to escape . or *.

Note that replacement are made in the same order as in the json. So here, the first replacement will be to remove "

How to load a yase output ?

As said previously, the choice made with Yase make it possible to use it as simply as :

import pandas, json

csv = pandas.read_csv("output.csv")
csv.vectors = csv.vectors.apply(json.loads)

csv.head()

Note that Pandas is not mandatory but very recommended for data science.

TODO

  • Optimize Mapping loading time
  • Optional argument to output fixed size vectors for all input sequences
  • Surely lot of thing !

Can i contribute ?

Off course ! If you want to improve Yase, your idea / pull requests / issues are welcomed !

Owner
Pierre PACI
Cloud Engineer @FundsDLT Luxembourg
Pierre PACI
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Transliterator Text Editor This is a simple transliteration program which is used to convert english word to phonetically matching word in another lan

Merin Rose Tom 1 Jan 16, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022