Textlesslib - Library for Textless Spoken Language Processing

Last update: Dec 27, 2022

Related tags

Overview

textlesslib

Textless NLP is an active area of research that aims to extend NLP techniques to work directly on spoken language. By using self-supervisedly learnt discrete speech representations, the area promises to unlock interesting NLP applications on languages without written form or on facets of spoken language that are unaccessable for text-based approaches, e.g. prosody. To learn more, please check some of the papers.

textlesslib is a library aimed to facilitate research in Textless NLP. The goal of the library is to speed up the research cycle and lower the learning curve for those who want to start. We provide highly configurable, off-the-shelf available tools to encode speech as sequences of discrete values and tools to decode such streams back into the audio domain.

Installation
Usage examples
Provided models
Testing

Installation

git clone [email protected]:facebookresearch/textlesslib.git
cd textlesslib
pip install -e .
pip install git+git://github.com:pytorch/[email protected]

Usage examples

We include a set of examples in the examples folder:

There is also a [Jupyter notebook] and a [Google Colab] that combine discrete resynthesis and speech continuation examples in a step-by-step mini-tutorial.

We believe those examples can serve both as illustrations for the provided components and provide a starting point for tinkering in interesting directions.

Encoding speech

Below is an example on loading an audio example and encoding it as a sequence of HuBERT-based discrete tokens (aka pseudo-units). Downloading of the required checkpoints is handled by textlesslib itself (by default they are stored in ~/.textless):

import torchaudio
from textless.data.speech_encoder import SpeechEncoder

dense_model_name = "hubert-base-ls960"
quantizer_name, vocab_size = "kmeans", 100
input_file = "input.wav"

# now let's load an audio example
waveform, sample_rate = torchaudio.load(input_file)

# We can build a speech encoder module using names of pre-trained
# dense and quantizer models.  The call below will download
# appropriate checkpoints as needed behind the scenes. We can
# also construct an encoder by directly passing model instances
encoder = SpeechEncoder.by_name(
    dense_model_name=dense_model_name,
    quantizer_model_name=quantizer_name,
    vocab_size=vocab_size,
    deduplicate=True,
).cuda()


# now convert it in a stream of deduplicated units (as in GSLM)
encoded = encoder(waveform.cuda())
# encoded is a dict with keys ('dense', 'units', 'durations').
# It can also contain 'f0' if SpeechEncoder was initialized
# with need_f0=True flag.
units = encoded["units"]  # tensor([71, 12, 57, ...], ...)

Now it can be casted back into the audio domain:

# as with encoder, we can setup vocoder by passing checkpoints
# directly or by specifying the expected format by the names
# of dense and quantizer models (these models themselves
# won't be loaded)
vocoder = TacotronVocoder.by_name(
    dense_model_name,
    quantizer_name,
    vocab_size,
).cuda()

# now we turn those units back into the audio.
audio = vocoder(units)

# save the audio
torchaudio.save(output_file, audio.cpu().float().unsqueeze(0), vocoder.output_sample_rate)

Dataset helpers

Below is an example on using textless view on the LibriSpeech dataset:

encoder = SpeechEncoder.by_name(
  dense_model_name=dense_model_name,
  quantizer_model_name=quantizer_name,
  vocab_size=vocab_size,
  deduplicate=True,
).cuda()

quantized_dataset = QuantizedLibriSpeech(
  root=existing_root, speech_encoder=encoder, url=url)

datum = quantized_dataset[0]
sample_rate, utterance, speaker_id, chapter_id, utterance_id = datum['rest']
# datum['units'] = tensor([71, 12, 63, ...])

In the probing example we illustrate how such a dataset can be used with a standard Pytorch dataloader in a scalable manner.

Data preprocessing

We also provide a multi-GPU/multi-node preprocessing tool for the cases where on-the-fly processing of audio should be avoided.

Provided models

We provide implementations and pre-trained checkpoints for the following models:

Dense representations: HuBERT-base (trained on LibriSpeech 960h) and CPC (trained on 6Kh subset of LibriLight);
Quantizers: k-means quantizers with vocabulary sizes of 50, 100, 200 for both the dense models (trained on LibriSpeech 960h);
Decoders: Tacotron2 models for all (dense model x quantizer) combinations (trained on LJSpeech).

Finally, the pitch extraction is done via YAAPT.

Testing

We use pytest (pip install pytest pytest-xdist ). Our unit tests are located in the tests directory:

cd tests && pytest -n 8

Licence

textlesslib is licensed under MIT, the text of the license can be found here. Internally, it uses

WaveGlow - licensed under BSD-3-Clause license;
tacotron implementation - licensed under MIT license;
tacotron2 implementation - licensed under BSD-3-Clause license;
STFT implementation - licensed under BSD-3-Clause license.

Textlesslib - Library for Textless Spoken Language Processing

Related tags

Overview

textlesslib

Table of Contents

Installation

Usage examples

Encoding speech

Dataset helpers

Data preprocessing

Provided models

Testing

Licence

Owner

Meta Research

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Open Source Neural Machine Translation in PyTorch

Py65 65816 - Add support for the 65C816 to py65

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

Subtitle Workshop (subshop): tools to download and synchronize subtitles

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Bnagla hand written document digiiztion

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

Simple, hackable offline speech to text - using the VOSK-API.

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Opal-lang - A WIP programming language based on Python

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

A highly sophisticated sequence-to-sequence model for code generation

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Estimation of the CEFR complexity score of a given word, sentence or text.