Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Overview

MLP Singer

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Abstract

Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system, both in terms of audio quality and synthesis speed. In particular, MLP Singer achieves a real-time factor of up to 200 and 3400 on CPUs and GPUs respectively, enabling order of magnitude faster generation on both environments.

Citation

Please cite this work as follows.

@misc{tae2021mlp,
      title={MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis}, 
      author={Jaesung Tae and Hyeongju Kim and Younggun Lee},
      year={2021},
}

Quickstart

  1. Clone the repository including the git submodule.

    git clone --recurse-submodules https://github.com/neosapience/mlp-singer.git
  2. Install package requirements.

cd mlp-singer
pip install -r requirements.txt
  1. To generate audio files with the trained model checkpoint, download the HiFi-GAN checkpoint along with its configuration file and place them in hifi-gan.

  2. Run inference using the following command. Generated audio samples are saved in the samples directory by default.

    python inference.py --checkpoint_path checkpoints/default/model.pt

Dataset

We used the Children Song Dataset, an open-source singing voice dataset comprised of 100 annotated Korean and English children songs sung by a single professional singer. We used only the Korean subset of the dataset to train the model.

You can train the model on any custom dataset of your choice, as long as it includes lyrics text, midi transcriptions, and monophonic a capella audio file triplets. These files should be titled identically, and should also be placed in specific directory locations as shown below.

├── data
│   └── raw
│       ├── mid
│       ├── txt
│       └── wav

The directory names correspond to file extensions. We have included a sample as reference.

Preprocessing

Once you have prepared the dataset, run

python -m data.serialize

from the root directory. This will create data/bin that contains binary files used for training. This repository already contains example binary files created from the sample in data/raw.

Training

To train the model, run

python train.py

This will read the default configuration file located in configs/model.json to initialize the model. Alternatively, you can also create a new configuration and train the model via

python train.py --config_path PATH/TO/CONFIG.json

Running this command will create a folder under the checkpoints directory according to the name field specified in the configuration file.

You can also continue training from a checkpoint. For example, to resume training from the provided pretrained model checkpoint, run

python train.py --checkpoint_path /checkpoints/default/model.pt

Unless a --config_path flag is explicitly provided, the script will read config.json in the checkpoint directory. In both cases, model checkpoints will be saved regularly according to the interval defined in the configuration file.

Inference

MLP Singer produces mel-spectrograms, which are then fed into a neural vocoder to generate raw waveforms. This repository uses HiFi-GAN as the vocoder backend, but you can also plug other vocoders like WaveGlow. To generate samples, run

python inference.py --checkpoint_path PATH/TO/CHECKPOINT.pt --song little_star

This will create .wav samples in the samples directory, and save mel-spectrogram files as .npy files in hifi-gan/test_mel_dirs.

You can also specify any song you want to perform inference on, as long as the song is present in data/raw. The argument to the --song flag should match the title of the song as it is saved in data/raw.

Note

For demo and internal experiments, we used a variant of HiFi-GAN that used different mel-spectrogram configurations. As such, the provided checkpoint for MLP Singer is different from the one referred to in the paper. Moreover, the vocoder used in the demo was further fine-tuned on the Children's Song Dataset.

Acknowledgements

This implementation was inspired by the following repositories.

License

Released under the MIT License.

Owner
Neosapience
Neosapience, an artificial being enabled by artificial intelligence, will soon be everywhere in our daily lives.
Neosapience
Code for the paper "Flexible Generation of Natural Language Deductions"

Code for the paper "Flexible Generation of Natural Language Deductions"

Kaj Bostrom 12 Nov 11, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
Saptak Bhoumik 14 May 24, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
Yuqing Xie 2 Feb 17, 2022
Natural language computational chemistry command line interface.

nlcc Install pip install nlcc Must have Open-AI Codex key: export OPENAI_API_KEY=your key here then nlcc key bindings ctrl-w copy to clipboard (Note

Andrew White 37 Dec 14, 2022
An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Vincent Herrmann 858 Dec 27, 2022
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
The training code for the 4th place model at MDX 2021 leaderboard A.

The training code for the 4th place model at MDX 2021 leaderboard A.

Chin-Yun Yu 32 Dec 18, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023