This repo contains simple to use, pretrained/training-less models for speaker diarization.

Last update: Jan 20, 2022

Related tags

Text Data & NLP pydiar

Overview

PyDiar

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Supported Models

Binary Key Speaker Modeling

Based on pyBK by Jose Patino which implements the diarization system from "The EURECOM submission to the first DIHARD Challenge" by Patino, Jose and Delgado, Héctor and Evans, Nicholas

If you have any other models you would like to see added, please open an issue.

Usage

This library seeks to provide a very basic interface. To use the Binary Key model on a file, do something like this:

import numpy as np
from pydiar.models import BinaryKeyDiarizationModel, Segment
from pydiar.util.misc import optimize_segments
from pydub import AudioSegment

INPUT_FILE = "test.wav"

sample_rate = 32000
audio = AudioSegment.from_wav(test.wav)
audio = audio.set_frame_rate(sample_rate)
audio = audio.set_channels(1)

diarization_model = BinaryKeyDiarizationModel()
segments = diarization_model.diarize(
    sample_rate, np.array(audio.get_array_of_samples())
)
optimized_segments = optimize_segments(segments)

Now optimized_segments contains a list of segments with their start, length and speaker id

Example

A simple script which reads an audio file, diarizes it and transcribes it into the WebVTT format can be found in examples/generate_webvtt.py. To use it, download a vosk model from https://alphacephei.com/vosk/models and then run the script using

poetry install
poetry run python -m examples.generate_webvtt -i PATH/TO/INPUT.wav -m PATH/TO/VOSK_MODEL

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Related tags

Overview

PyDiar

Supported Models

Usage

Example

Owner

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Quick insights from Zoom meeting transcripts using Graph + NLP

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Some embedding layer implementation using ivy library

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Experiments in converting wikidata to ftm

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

PIZZA - a task-oriented semantic parsing dataset

Python functions for summarizing and improving voice dictation input.

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective