Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Deep LearningCMST
Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

  • Preprocessing : contains all files used to preprocess the data (Python 3.6)
  • Data : contains data required to run this code
  • Statistics : contains all files that contains statistics of the dataset

Dataset

file name discription
train/test/dev.csv This is the dataset for code-mixed Speech Translation.
chopped_audios This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages #types #tokens Types per line Tokens per line Avg. token length
English[100%] 40,324 601889 10.58 11.27 4.92
French (France) 50510 645651 11.38 12.09 5.08
German[100%] 50748 584575 10.44 10.95 5.57
Gujarati[100%] 41959 584989 10.37 10.95 4.46
Hindi[100%] 29744 716800 12.36 13.42 3.74
Hungarian[100%] 84872 506608 9.13 9.49 5.89
Indonesian[100%] 39365 653374 11.54 12.23 6.14
Italian[100%] 52372 512061 9.23 9.59 5.37
Latvian[100%] 70040 477106 8.69 8.93 5.72
Lithuanian[100%] 75222 491558 8.92 9.2 6.04
Nepali[100%] 52630 570268 10.03 10.68 4.88
Persian (Farsi)[100%] 51722 598096 10.61 11.2 4.1
Polish[100%] 71662 494263 8.99 9.25 5.86
Portuguese (Brazil)[100%] 50087 608432 10.8 11.39 5.12
Russian[100%] 72162 490908 8.96 9.19 5.79
Slovak[100%] 73789 520465 9.39 9.75 5.37
Slovenian[100%] 68619 516649 9.35 9.67 5.3
Spanish[100%] 49806 608868 10.75 11.4 5.07
Swedish[100%] 48233 581751 10.31 10.89 5
Tamil[100%] 84183 460678 8.37 8.63 7.65
Telugu[100%] 72006 464665 8.34 8.7 6.56
Turkish[100%] 78957 453521 8.27 8.49 6.35
Bulgarian[100%] 60712 564150 10.1 10.56 5.24
Croatian[100%] 73075 531326 9.58 9.95 5.28
Danish[100%] 50170 587253 10.4 11 4.98
Dutch[100%] 42716 595464 10.52 11.15 5.05

Code-mixing

All languages in Code-mixing

Language Total Words Unique Words Percentage
English 500136 6312 83.6
Bengali 46933 3907 7.84
Sanskrit 51246 7202 8.56
Total 598315 17421 100

Types of Code-mixing

English-Sanskrit Sanskrit-English English-Bengali Bengali-English
Inter-Sentential 2356 2366 339 339
Intra-Sentential 2338 851 124 0
Owner
Ayush Daksh
IIT Kharagpur | Mathematics & Computing | 3rd Year | NLP | UG Researcher
Ayush Daksh
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan NOTE: This documentation describes a BETA release of PyStan 3. PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is

Stan 229 Dec 29, 2022
A general python framework for visual object tracking and video object segmentation, based on PyTorch

PyTracking A general python framework for visual object tracking and video object segmentation, based on PyTorch. 📣 Two tracking/VOS papers accepted

2.6k Jan 04, 2023
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

77 Dec 16, 2022
Fast and simple implementation of RL algorithms, designed to run fully on GPU.

RSL RL Fast and simple implementation of RL algorithms, designed to run fully on GPU. This code is an evolution of rl-pytorch provided with NVIDIA's I

Robotic Systems Lab - Legged Robotics at ETH Zürich 68 Dec 29, 2022
Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels Blazing fast x86-64 VM kernel fuzzing framework with performant VM reloads for Linux, MacOS an

Chair for Sys­tems Se­cu­ri­ty 541 Nov 27, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 07, 2022
To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts

JaxTon 💯 JAX exercises Mission 🚀 To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beg

Rohan Rao 512 Jan 01, 2023
Convex optimization for fun and profit.

CFMM Optimal Routing This repository contains the code needed to generate the figures used in the paper Optimal Routing for Constant Function Market M

Guillermo Angeris 183 Dec 29, 2022
Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

The official code for the paper "Inverse Problems Leveraging Pre-trained Contrastive Representations" (to appear in NeurIPS 2021).

Sriram Ravula 26 Dec 10, 2022
QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

This project provides abundant choices of quantization strategies (such as the quantization algorithms, training schedules and empirical tricks) for quantizing the deep neural networks into low-bit c

Monash Green AI Lab 51 Dec 10, 2022
Fully convolutional networks for semantic segmentation

FCN-semantic-segmentation Simple end-to-end semantic segmentation using fully convolutional networks [1]. Takes a pretrained 34-layer ResNet [2], remo

Kai Arulkumaran 186 Dec 25, 2022
This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021) Introduction This repository is the offical Pytorch implementation of

37 Nov 21, 2022
Code of the paper "Multi-Task Meta-Learning Modification with Stochastic Approximation".

Multi-Task Meta-Learning Modification with Stochastic Approximation This repository contains the code for the paper "Multi-Task Meta-Learning Modifica

Andrew 3 Jan 05, 2022
A Python training and inference implementation of Yolov5 helmet detection in Jetson Xavier nx and Jetson nano

yolov5-helmet-detection-python A Python implementation of Yolov5 to detect head or helmet in the wild in Jetson Xavier nx and Jetson nano. In Jetson X

12 Dec 05, 2022
Semantic Segmentation of images using PixelLib with help of Pascalvoc dataset trained with Deeplabv3+ framework.

CARscan- Approach 1 - Segmentation of images by detecting contours. It failed because in images with elements along with cars were also getting detect

Padmanabha Banerjee 5 Jul 29, 2021
Array Camera Ptychography

Array Camera Ptychography This repository provides the code for the following papers: Schulz, Timothy J., David J. Brady, and Chengyu Wang. "Photon-li

Brady lab in Optical Sciences 1 Nov 15, 2021
PyTorch implementation of TSception V2 using DEAP dataset

TSception This is the PyTorch implementation of TSception V2 using DEAP dataset in our paper: Yi Ding, Neethu Robinson, Su Zhang, Qiuhao Zeng, Cuntai

Yi Ding 27 Dec 15, 2022
PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

Ilya Kostrikov 546 Dec 05, 2022
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 218 Jan 05, 2023
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI 340 Dec 30, 2022