A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Overview

Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation

A Pytorch Implementation of the paper "Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation (ISMIR 2020)"

Installation

conda install pytorch=1.6 cudatoolkit=10.2 -c pytorch
conda install -c conda-forge ffmpeg librosa
conda install -c anaconda jupyter
pip install musdb museval pytorch_lightning effortless_config wandb pydub nltk spacy 

Dataset

  1. Download Musdb18
  2. Unzip files
  3. We recommend you to use the wav file mode for the fast data preparation.
    musdbconvert path/to/musdb-stems-root path/to/new/musdb-wav-root

Demonstration: A Pretrained Model (TFC_TDF_Net (large))

Colab Link

Tutorial

1. activate your conda

conda activate yourcondaname

2. Training a default UNet with TFC_TDFs

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode train --gpus 4 --distributed_backend ddp --sync_batchnorm True --pin_memory True --num_workers 32 --precision 16 --run_id debug --optimizer adam --lr 0.001 --save_top_k 3 --patience 100 --min_epochs 1000 --max_epochs 2000 --n_fft 2048 --hop_length 1024 --num_frame 128  --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net  --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24  --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu  --first_conv_activation relu --last_activation identity --seed 2020

3. Evaluation

After training is done, checkpoints are saved in the following directory.

etc/modelname/run_id/*.ckpt

For evaluation,

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode eval --gpus 1 --pin_memory True --num_workers 64 --precision 32 --run_id debug --batch_size 4 --n_fft 2048 --hop_length 1024 --num_frame 128 --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu --first_conv_activation relu --last_activation identity --log wandb --ckpt vocals_epoch=891.ckpt

Below is the result.

wandb:          test_result/agg/vocals_SDR 6.954695
wandb:   test_result/agg/accompaniment_SAR 14.3738075
wandb:          test_result/agg/vocals_SIR 15.5527
wandb:   test_result/agg/accompaniment_SDR 13.561705
wandb:   test_result/agg/accompaniment_ISR 22.69328
wandb:   test_result/agg/accompaniment_SIR 18.68421
wandb:          test_result/agg/vocals_SAR 6.77698
wandb:          test_result/agg/vocals_ISR 12.45371

4. Interactive Report (wandb)

wandb report

Indermediate Blocks

Please see this document.

How to use

1. Training

1.1. Intermediate Block independent Parameters

1.1.A. General Parameters
  • --musdb_root musdb path
  • --musdb_is_wav whether the path contains wav files or not
  • --filed_mode whether you want to use filed mode or not. recommend to use it for the fast data preparation.
  • --target_name one of vocals, drum, bass, other
1.1.B. Training Environment
  • --mode train or eval
  • --gpus number of gpus
    • (WARN) gpus > 1 might be problematic when evaluating models.
  • distributed_backend use this option only when you are using multi-gpus. distributed backend, one of ddp, dp, ... we recommend you to use ddp.
  • --sync_batchnorm True only when you are using ddp
  • --pin_memory
  • --num_workers
  • --precision 16 or 32
  • --dev_mode whether you want a developement mode or not. dev mode is much faster because it uses only a small subset of the dataset.
  • --run_id (optional) directory path where you want to store logs and etc. if none then the timestamp.
  • --log True for default pytorch lightning log. wandb is also available.
  • --seed random seed for a deterministic result.
1.1.C. Training hyperparmeters
  • --batch_size trivial :)
  • --optimizer adam, rmsprop, etc
  • --lr learning rate
  • --save_top_k how many top-k epochs you want to save the training state (criterion: validation loss)
  • --patience early stop control parameter. see pytorch lightning docs.
  • --min_epochs trivial :)
  • --max_epochs trivial :)
  • --model
    • tfc_tdf_net
    • tfc_net
    • tdc_net
1.1.D. Fourier parameters
  • --n_fft
  • --hop_length
  • num_frame number of frames (time slices)
1.1.F. criterion
  • --train_loss: spec_mse, raw_l1, etc...
  • --val_loss: spec_mse, raw_l1, etc...

1.2. U-net Parameters

  • --n_blocks: number of intermediate blocks. must be an odd integer. (default=7)
  • --input_channels:
    • if you use two-channeled complex-valued spectrogram, then 4
    • if you use two-channeled manginutde spectrogram, then 2
  • --internal_channels: number of internal chennels (default=24)
  • --first_conv_activation: (default='relu')
  • --last_activation: (default='sigmoid')
  • --t_down_layers: list of layer where you want to doubles/halves the time resolution. if None, ds/us applied to every single layer. (default=None)
  • --f_down_layers: list of layer where you want to doubles/halves the frequency resolution. if None, ds/us applied to every single layer. (default=None)

1.3. SVS Framework

  • --spec_type: type of a spectrogram. ['complex', 'magnitude']

  • --spec_est_mode: spectrogram estimation method. ['mapping', 'masking']

  • CaC Framework

    • you can use cac framework [1] by setting
      • --spec_type complex --spec_est_mode mapping --last_activation identity
  • Mag-only Framework

    • if you want to use the traditional magnitude-only estimation with sigmoid, then try
      • --spec_type magnitude --spec_est_mode masking --last_activation sigmoid
    • you can also change the last activation as follows
      • --spec_type magnitude --spec_est_mode masking --last_activation relu
  • Alternatives

    • you can build an svs framework with any combination of these parameters
    • e.g. --spec_type complex --spec_est_mode masking --last_activation tanh

1.4. Block-dependent Parameters

1.4.A. TDF Net
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --bias: (default=False)
  • --tdf_activation: activation function of each block (default=relu)

1.4.B. TDC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tdc_activation: activation function of each block (default=relu)

1.4.C. TFC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_activation: activation function of each block (default=relu)

1.4.D. TFC_TDF Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_tdf_activation: activation function of each block (default=relu)
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --tfc_tdf_bias: (default=False)

1.4.E. TDC_RNN Net
  • '--n_internal_layers' : number of 1-d CNNs in a block (default=5)

  • '--kernel_size_f' : size of kernel of frequency-dimension (default=3)

  • '--bn_factor_rnn' : (default=16)

  • '--num_layers_rnn' : (default=1)

  • '--bias_rnn' : bool, (default=False)

  • '--min_bn_units_rnn' : (default=16)

  • '--bn_factor_tdf' : (default=16)

  • '--bias_tdf' : bool, (default=False)

  • '--tdc_rnn_activation' : (default='relu')

current bug - cuda error occurs when tdc_rnn net with precision 16

Reproducible Experimental Results

  • TFC_TDF_large
    • parameters
    --musdb_root ../repos/musdb18_wav
    --musdb_is_wav True
    --filed_mode True
    
    --gpus 4
    --distributed_backend ddp
    --sync_batchnorm True
    
    --num_workers 72
    --train_loss spec_mse
    --val_loss raw_l1
    --batch_size 12
    --precision 16
    --pin_memory True
    --num_worker 72         
    --save_top_k 3
    --patience 200
    --run_id debug_large
    --log wandb
    --min_epochs 2000
    --max_epochs 3000
    
    --optimizer adam
    --lr 0.001
    
    --model tfc_tdf_net
    --n_fft 4096
    --hop_length 1024
    --num_frame 128
    --spec_type complex
    --spec_est_mode mapping
    --last_activation identity
    --n_blocks 9
    --internal_channels 24
    --n_internal_layers 5
    --kernel_size_t 3 
    --kernel_size_f 3 
    --tfc_tdf_bias True
    --seed 2020
    
    
    • training
    python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --gpus 4 --distributed_backend ddp --sync_batchnorm True --num_workers 72 --train_loss spec_mse --val_loss raw_l1 --batch_size 24 --precision 16 --pin_memory True --num_worker 72 --save_top_k 3 --patience 200 --run_id debug_large --log wandb --min_epochs 2000 --max_epochs 3000 --optimizer adam --lr 0.001 --model tfc_tdf_net --n_fft 4096 --hop_length 1024 --num_frame 128 --spec_type complex --spec_est_mode mapping --last_activation identity --n_blocks 9 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --tfc_tdf_bias True --seed 2020
    • evaluation result (epoch 2007)
      • SDR 8.029
      • ISR 13.708
      • SIR 16.409
      • SAR 7.533

Interactive Report (wandb)

wandb report

You can cite this paper as follows:

@inproceedings{choi_2020, Author = {Choi, Woosung and Kim, Minseok and Chung, Jaehwa and Lee, Daewon and Jung, Soonyoung}, Booktitle = {21th International Society for Music Information Retrieval Conference}, Editor = {ISMIR}, Month = {OCTOBER}, Title = {Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation.}, Year = {2020}}

Reference

[1] Woosung Choi, Minseok Kim, Jaehwa Chung, DaewonLee, and Soonyoung Jung, “Investigating u-nets with various intermediate blocks for spectrogram-based singingvoice separation.,” in 21th International Society for Music Information Retrieval Conference, ISMIR, Ed., OCTOBER 2020.

Owner
Woosung Choi
WooSung Choi Ph.d candidate @IELab-AT-KOREA-UNIV Seoul, Korea
Woosung Choi
Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

Source code for CAST: Crisis Domain Adaptation UsingSequence-to-sequenceTransformers (Paper, BibTeX, Accepted to ISCRAM 2021, CorePaper) Quick start D

Congcong Wang 0 Jul 14, 2021
Boundary-aware Transformers for Skin Lesion Segmentation

Boundary-aware Transformers for Skin Lesion Segmentation Introduction This is an official release of the paper Boundary-aware Transformers for Skin Le

Jiacheng Wang 79 Dec 16, 2022
Generate image analogies using neural matching and blending

neural image analogies This is basically an implementation of this "Image Analogies" paper, In our case, we use feature maps from VGG16. The patch mat

Adam Wentz 3.5k Jan 08, 2023
Semantic segmentation models, datasets and losses implemented in PyTorch.

Semantic Segmentation in PyTorch Semantic Segmentation in PyTorch Requirements Main Features Models Datasets Losses Learning rate schedulers Data augm

Yassine 1.3k Jan 07, 2023
Official repository for "Deep Recurrent Neural Network with Multi-scale Bi-directional Propagation for Video Deblurring".

RNN-MBP Deep Recurrent Neural Network with Multi-scale Bi-directional Propagation for Video Deblurring (AAAI-2022) by Chao Zhu, Hang Dong, Jinshan Pan

SIV-LAB 22 Aug 31, 2022
Bridging Vision and Language Model

BriVL BriVL (Bridging Vision and Language Model) 是首个中文通用图文多模态大规模预训练模型。BriVL模型在图文检索任务上有着优异的效果,超过了同期其他常见的多模态预训练模型(例如UNITER、CLIP)。 BriVL论文:WenLan: Bridgi

235 Dec 27, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

THUNLP 5 Jun 16, 2022
The official implementation of "Rethink Dilated Convolution for Real-time Semantic Segmentation"

RegSeg The official implementation of "Rethink Dilated Convolution for Real-time Semantic Segmentation" Paper: arxiv D block Decoder Setup Install the

Roland 61 Dec 27, 2022
Code for "Searching for Efficient Multi-Stage Vision Transformers"

Searching for Efficient Multi-Stage Vision Transformers This repository contains the official Pytorch implementation of "Searching for Efficient Multi

Yi-Lun Liao 62 Oct 25, 2022
Code for Reciprocal Adversarial Learning for Brain Tumor Segmentation: A Solution to BraTS Challenge 2021 Segmentation Task

BRATS 2021 Solution For Segmentation Task This repo contains the supported pytorch code and configuration files to reproduce 3D medical image segmenta

Himashi Amanda Peiris 6 Sep 15, 2022
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

Holy Wu 8 Jun 07, 2022
This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

Nils L. Westhausen 182 Jan 07, 2023
Multi Agent Path Finding Algorithms

MATP-solver Simulator collision check path step random initial states or given states Traditional method Seperate A* algorithem Confict-based Search S

30 Dec 12, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
Using Python to Play Cyberpunk 2077

CyberPython 2077 Using Python to Play Cyberpunk 2077 This repo will contain code from the Cyberpython 2077 video series on Youtube (youtube.

Harrison 118 Oct 18, 2022
Tensorflow implementation and notebooks for Implicit Maximum Likelihood Estimation

tf-imle Tensorflow 2 and PyTorch implementation and Jupyter notebooks for Implicit Maximum Likelihood Estimation (I-MLE) proposed in the NeurIPS 2021

NEC Laboratories Europe 69 Dec 13, 2022
Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

This repository contains code for the following two papers: VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short

Natural Language Processing @UCLA 463 Dec 09, 2022
COIN the currently largest dataset for comprehensive instruction video analysis.

COIN Dataset COIN is the currently largest dataset for comprehensive instruction video analysis. It contains 11,827 videos of 180 different tasks (i.e

86 Dec 28, 2022
Code for "Multi-Compound Transformer for Accurate Biomedical Image Segmentation"

News The code of MCTrans has been released. if you are interested in contributing to the standardization of the medical image analysis community, plea

97 Jan 05, 2023