Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Last update: Dec 01, 2022

Related tags

Deep Learning CMST

Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

Preprocessing : contains all files used to preprocess the data (Python 3.6)
Data : contains data required to run this code
Statistics : contains all files that contains statistics of the dataset

Dataset

file name	discription
train/test/dev.csv	This is the dataset for code-mixed Speech Translation.
chopped_audios	This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages	#types	#tokens	Types per line	Tokens per line	Avg. token length
English[100%]	40,324	601889	10.58	11.27	4.92
French (France)	50510	645651	11.38	12.09	5.08
German[100%]	50748	584575	10.44	10.95	5.57
Gujarati[100%]	41959	584989	10.37	10.95	4.46
Hindi[100%]	29744	716800	12.36	13.42	3.74
Hungarian[100%]	84872	506608	9.13	9.49	5.89
Indonesian[100%]	39365	653374	11.54	12.23	6.14
Italian[100%]	52372	512061	9.23	9.59	5.37
Latvian[100%]	70040	477106	8.69	8.93	5.72
Lithuanian[100%]	75222	491558	8.92	9.2	6.04
Nepali[100%]	52630	570268	10.03	10.68	4.88
Persian (Farsi)[100%]	51722	598096	10.61	11.2	4.1
Polish[100%]	71662	494263	8.99	9.25	5.86
Portuguese (Brazil)[100%]	50087	608432	10.8	11.39	5.12
Russian[100%]	72162	490908	8.96	9.19	5.79
Slovak[100%]	73789	520465	9.39	9.75	5.37
Slovenian[100%]	68619	516649	9.35	9.67	5.3
Spanish[100%]	49806	608868	10.75	11.4	5.07
Swedish[100%]	48233	581751	10.31	10.89	5
Tamil[100%]	84183	460678	8.37	8.63	7.65
Telugu[100%]	72006	464665	8.34	8.7	6.56
Turkish[100%]	78957	453521	8.27	8.49	6.35
Bulgarian[100%]	60712	564150	10.1	10.56	5.24
Croatian[100%]	73075	531326	9.58	9.95	5.28
Danish[100%]	50170	587253	10.4	11	4.98
Dutch[100%]	42716	595464	10.52	11.15	5.05

Code-mixing

All languages in Code-mixing

Language	Total Words	Unique Words	Percentage
English	500136	6312	83.6
Bengali	46933	3907	7.84
Sanskrit	51246	7202	8.56
Total	598315	17421	100

Types of Code-mixing

	English-Sanskrit	Sanskrit-English	English-Bengali	Bengali-English
Inter-Sentential	2356	2366	339	339
Intra-Sentential	2338	851	124	0

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

File organization

Dataset

Statistics of Corpora contained

Code-mixing

All languages in Code-mixing

Types of Code-mixing

Owner

Ayush Daksh

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

A general python framework for visual object tracking and video object segmentation, based on PyTorch

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Fast and simple implementation of RL algorithms, designed to run fully on GPU.

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts

Convex optimization for fun and profit.

Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

Fully convolutional networks for semantic segmentation

This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Code of the paper "Multi-Task Meta-Learning Modification with Stochastic Approximation".

A Python training and inference implementation of Yolov5 helmet detection in Jetson Xavier nx and Jetson nano

Semantic Segmentation of images using PixelLib with help of Pascalvoc dataset trained with Deeplabv3+ framework.

Array Camera Ptychography

PyTorch implementation of TSception V2 using DEAP dataset

PyTorch implementations of algorithms for density estimation

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

High performance distributed framework for training deep learning recommendation models based on PyTorch.