Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Overview

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Figure: Example visualization of the method and baseline as a spectogram

This is the implementation of our Project for the course "Deep Learning: Architectures and Methods" by Prof. Christian Kersting from the Artificial Intelligence and Machine Learning Lab at the Technical University of Darmstadt in the summer semester 2021.

In the field of audio signal processing, Super-Resolution is one of the most relevant topics. The motivation is to reconstruct high- quality audio from low-quality signals. From a practical perspective, the technique has applications in telephony or generally in applications in which audio is transmitted and has to be compressed accordingly. Other applications are the processing of ancient recordings, for example old sound recordings of music, speech or videos. First approaches of the combination of machine learning and audio signal processing lead to promising results and outperform standard techniques. Accordingly the scope of the project was to reimplement the paper Temporal FiLM: Capturing Long-Range SequenceDependencies with Feature-Wise Modulation by Birnbaum et al. in PyTorch, reproduce the results and extend them further to the music domain.

This repository contains everything needed to prepare the data sets, train the model and create final evaluation and visualization of the results. We also provide the weights of the models to reproduce our reported results.

Installation

This project was originally developed with Python 3.8, PyTorch 1.7, and CUDA 11.0. The training requires at least one NVIDIA GeForce GTX 980 (4GB memory).

  • Create conda environment:
conda create --name audiosr
source activate audiosr
conda install PYTORCH torchvision cudatoolkit=11.0 -c pytorch
  • Install the dependencies:
pip install -r requirements.txt

Dataset preparation

To reproduce the results shown below tha datasets have to be prepared. This repo includes scripts to prepare the following dataset:

VCTK preparation

  • run prep_dataset.py from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python prep_dataset.py \
  --file-list vctk/speaker1/speaker1-train-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-train.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass
python prep_dataset.py \
  --file-list vctk/speaker1/speaker1-val-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-val.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass

GTZAN preparation

  • run prep_dataset.py from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python prep_dataset.py \
  --file-list gtzan/blues_wav_list_train.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-train.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \
  --low-pass
python prep_dataset.py \
  --file-list gtzan/blues_wav_list_val.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-val.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \
  --low-pass

Piano dataset preparation

python prep_piano.py \
  --file-list data/music_train.npy \
  --out piano-train.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \
  --low-pass
python prep_piano.py \
  --file-list data/music_valid.npy \
  --out piano-val.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \
  --low-pass

Notes:

  • the --in-dir argument has to be adapted to the respective dataset location
  • The dimension parameter and sampling rate define the absolute length of a patch (dim/sr = length patch)

Model

Generally, there are three main models in this implementation.

Baseline

On the one hand the b-spline interpolation which serves as the baseline and can be found in the data loader in prep_dataset.py.

Model

On the other hand two neural networks whose implementation can be found in the /models/ folder. In a first step a model was implemented which uses a batchnorm layer instead of the later used TFILM layer. This is implemented in audiounet.py. The final model, which is also used in the paper, can be found in tfilmunet.py.

Train Model

To run the trainings use the following commands and change the dataset root the corresponding domain.

python train.py \
  --dataset-root hereroottodataset! \
  --epochs 50 \
  --lr 3*10e-4 \
  --batch-size 16 

Evaluation

Save examples from inference

It is possible to evaluate any given wav-file with the inference.py script by invoking the --save-example flag and saving the results as wav-files and spectrogram plots. The script performs the following steps:

  • prepares all files in a provided list (--wave-file-list) and creates a low-res version and the baseline reconstruction
  • runs inference on the prepared files to create a super-resolution output
  • saves all results to the "examples" folder with the respective file names
  • saves spectrogram plots of all versions as pdf-files

Notes:

It is important to adapt the sampling parameter (--sr) which is set to 16000 by default. The sampling rate has to be the one of the original wav file. The scale (--scale) defines the down sampling factor which is set to 4 by default. Depending on which trained model is used for the inference the parameters --checkpoints-root and --checkpoint have to be specified accordingly.

To reproduce an example from our plot run the following command from the repo root directory (modify --checkpoints-root if necessary):

python inference.py \
  --save-example \
  --wave-file-list assets/save_wav_list.txt \
  --scale 4 \
  --sr 16000 \
  --checkpoint pretrained/vctk_speaker1_pretrained.pth

Results

Training Dataset Ratio BASELINE SNR (dB) BASELINE LSD (dB) METHOD SNR (dB) METHOD LSD (dB) Checkpoint
VTCK SingleSpeaker r = 4 15.6 5.4 16.6 3.2 Checkpoint
Piano r = 4 19.7 2.9 20.4 2.2 Checkpoint
GTZAN (Genre: Blues) r = 4 13.3 7.8 13.8 3.8 Checkpoint

Qualitative Examples

Here we provide a qualitative example per Dataset. These can be generated using inference.py

VTCK SingleSpeaker Piano GTZAN (Genre: Blues)
Low Resolution Low Resolution Low Resolution
Baseline Baseline Baseline
Method Method Method
High Resolution High Resolution High Resolution
Owner
Oliver Hahn
Master Thesis @VIsual Inference Lab | Grad Student @Technical University of Darmstadt
Oliver Hahn
A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

ffcv ImageNet Training A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get... ...high accur

FFCV 92 Dec 31, 2022
Pytorch implementation of XRD spectral identification from COD database

XRDidentifier Pytorch implementation of XRD spectral identification from COD database. Details will be explained in the paper to be submitted to NeurI

Masaki Adachi 4 Jan 07, 2023
[CoRL 21'] TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo Lukas Koestler1*    Nan Yang1,2*,†    Niclas Zeller2,3    Daniel Cremers1

TUM Computer Vision Group 744 Jan 04, 2023
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Welcome to the cuQuantum repository! This public repository contains two sets of files related to the NVIDIA cuQuantum SDK: samples: All C/C++ sample

NVIDIA Corporation 147 Dec 27, 2022
Official implementation of MSR-GCN (ICCV 2021 paper)

MSR-GCN Official implementation of MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction (ICCV 2021 paper) [Paper] [Sup

LevonDang 42 Nov 07, 2022
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
Code for the tech report Toward Training at ImageNet Scale with Differential Privacy

Differentially private Imagenet training Code for the tech report Toward Training at ImageNet Scale with Differential Privacy by Alexey Kurakin, Steve

Google Research 29 Nov 03, 2022
Code for the paper "How Attentive are Graph Attention Networks?"

How Attentive are Graph Attention Networks? This repository is the official implementation of How Attentive are Graph Attention Networks?. The PyTorch

175 Dec 29, 2022
U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection

The code for our newly accepted paper in Pattern Recognition 2020: "U^2-Net: Going Deeper with Nested U-Structure for Salient Object Detection."

Xuebin Qin 6.5k Jan 09, 2023
Model serving at scale

Run inference at scale Cortex is an open source platform for large-scale machine learning inference workloads. Workloads Realtime APIs - respond to pr

Cortex Labs 7.9k Jan 06, 2023
Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

567 Dec 26, 2022
This is a deep learning-based method to segment deep brain structures and a brain mask from T1 weighted MRI.

DBSegment This tool generates 30 deep brain structures segmentation, as well as a brain mask from T1-Weighted MRI. The whole procedure should take ~1

Luxembourg Neuroimaging (Platform OpNeuroImg) 2 Oct 25, 2022
Pytorch and Torch testing code of CartoonGAN

CartoonGAN-Test-Pytorch-Torch Pytorch and Torch testing code of CartoonGAN [Chen et al., CVPR18]. With the released pretrained models by the authors,

Yijun Li 642 Dec 27, 2022
Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Jonas Köhler 893 Dec 28, 2022
An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Kakao Brain 72 Dec 28, 2022
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
Deep metric learning methods implemented in Chainer

Deep Metric Learning Implementation of several methods for deep metric learning in Chainer v4.2.0. Proxy-NCA: No Fuss Distance Metric Learning using P

ronekko 156 Nov 28, 2022
Interactive Terraform visualization. State and configuration explorer.

Rover - Terraform Visualizer Rover is a Terraform visualizer. In order to do this, Rover: generates a plan file and parses the configuration in the ro

Tu Nguyen 2.3k Jan 07, 2023
Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

QSORT QSORT(Quick + Simple Online and Realtime Tracking) is a simple online and realtime tracking algorithm for 2D multiple object tracking in video s

Yonghye Kwon 8 Jul 27, 2022
Repository for benchmarking graph neural networks

Benchmarking Graph Neural Networks Updates Nov 2, 2020 Project based on DGL 0.4.2. See the relevant dependencies defined in the environment yml files

NTU Graph Deep Learning Lab 2k Jan 03, 2023