Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

Overview

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper]

This is Official PyTorch implementation for HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning.

@inproceedings{lee2021help,
    title     = {HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning},
    author    = {Lee, Hayeon and Lee, Sewoong and Chong, Song and Hwang, Sung Ju},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2021}
} 

Overview

For deployment, neural architecture search should be hardware-aware, in order to satisfy the device-specific constraints (e.g., memory usage, latency and energy consumption) and enhance the model efficiency. Existing methods on hardware-aware NAS collect a large number of samples (e.g., accuracy and latency) from a target device, either builds a lookup table or a latency estimator. However, such approach is impractical in real-world scenarios as there exist numerous devices with different hardware specifications, and collecting samples from such a large number of devices will require prohibitive computational and monetary cost. To overcome such limitations, we propose Hardware-adaptive Efficient Latency Predictor (HELP), which formulates the device-specific latency estimation problem as a meta-learning problem, such that we can estimate the latency of a model's performance for a given task on an unseen device with a few samples. To this end, we introduce novel hardware embeddings to embed any devices considering them as black-box functions that output latencies, and meta-learn the hardware-adaptive latency predictor in a device-dependent manner, using the hardware embeddings. We validate the proposed HELP for its latency estimation performance on unseen platforms, on which it achieves high estimation performance with as few as 10 measurement samples, outperforming all relevant baselines. We also validate end-to-end NAS frameworks using HELP against ones without it, and show that it largely reduces the total time cost of the base NAS method, in latency-constrained settings.

Prerequisites

  • Python 3.8 (Anaconda)
  • PyTorch 1.8.1
  • CUDA 10.2

Hardware spec used for meta-training the proposed HELP model

  • GPU: A single Nvidia GeForce RTX 2080Ti
  • CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz

Installation

$ conda create --name help python=3.8
$ conda activate help
$ conda install pytorch==1.8.1 torchvision cudatoolkit=10.2 -c pytorch
$ pip install nas-bench-201
$ pip install tqdm
$ conda install scipy
$ conda install pyyaml
$ conda install tensorboard

Contents

1. Experiments on NAS-Bench-201 Search Space

2. Experiments on FBNet Search Space

3. Experiments on OFA Search Space

4. Experiments on HAT Search Space

1. Reproduce Main Results on NAS-Bench-201 Search Space

We provide the code to reproduce the main results on NAS-Bench-201 search space as follows:

  • Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 3).
  • Latency-constrained NAS Results with MetaD2A + HELP on unseen devices (Table 4).
  • Meta-Training HELP model.

1.1. Data Preparation and Model Checkpoint

We include all required datasets and checkpoints in this github repository.

1.2. [Meta-Test] Architecture ranking correlation

You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on NAS-Bench-201 search space (Table 3):

$ python main.py --search_space nasbench201 \
		 --mode 'meta-test' \
		 --num_samples 10 \
		 --num_meta_train_sample 900 \
                 --load_path [Path of Checkpoint File] \
		 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
		 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \                 
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss' 

You can use checkpoint file provided by this git repository ./data/nasbench201/checkpoint/help_max_corr.pt as follows:

$ python main.py --search_space nasbench201 \
		 --mode 'meta-test' \
		 --num_samples 10 \
		 --num_meta_train_sample 900 \
                 --load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
		 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
		 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \                 
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss' 

or you can use provided script:

$ bash script/run_meta_test_nasbench201.sh [GPU_NUM]

Architecture Ranking Correlation Results (Table 3)

Method # of Training Samples
From Target Device
Desktop GPU
(Titan RTX Batch 256)
Desktop CPU
(Intel Gold 6226)
Mobile
Pixel2
Raspi4 ASIC FPGA Mean
FLOPS - 0.950 0.826 0.765 0.846 0.437 0.900 0.787
Layer-wise Predictor - 0.667 0.866 - - - - 0.767
BRP-NAS 900 0.814 0.796 0.666 0.847 0.811 0.801 0.789
BRP-NAS
(+extra samples)
3200 0.822 0.805 0.693 0.853 0.830 0.828 0.805
HELP (Ours) 10 0.987 0.989 0.802 0.890 0.940 0.985 0.932

1.3. [Meta-Test] Efficient Latency-constrained NAS combined with MetaD2A

You can reproduce latency-constrained NAS results with MetaD2A + HELP on unseen devices on NAS-Bench-201 search space (Table 4):

$ python main.py --search_space nasbench201 --mode 'nas' \
                 --load_path [Path of Checkpoint File] \
                 --sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
                 --nas_target_device [Device] \ 
                 --latency_constraint [Latency Constraint] 

For example, if you use checkpoint file provided by this git repository, then path of checkpoint file is ./data/nasbench201/checkpoint/help_max_corr.pt, if you set target device as CPU Intel Gold 6226 (gold_6226) with batch size 256 and target latency constraint as 11.0 (ms), command is as follows:

$ python main.py --search_space nasbench201 --mode 'nas' \
                 --load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
                 --sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
                 --nas_target_device gold_6226 \ 
                 --latency_constraint 11.0 

or you can use provided script:

$ bash script/run_nas_metad2a.sh [GPU_NUM]

Efficient Latency-constrained NAS Results (Table 4)

Device # of Training Samples
from Target Device
Latency
Constraint (ms)
Latency
(ms)
Accuracy
(%)
Neural Architecture
Config
GPU Titan RTX
(Batch 256)
titan_rtx_256
10 18.0
21.0
25.0
17.8
18.9
24.2
69.7
71.5
71.8
link
link
link
CPU Intel Gold 6226
gold_6226
10 8.0
11.0
14.0
8.0
10.7
14.3
67.3
70.2
72.1
link
link
link
Mobile Pixel2
pixel2
10 14.0
18.0
22.0
13.0
19.0
25.0
69.7
71.8
73.2
link
link
link
ASIC-Eyeriss
eyeriss
10 5.0
7.0
9.0
3.9
5.1
9.1
71.5
71.8
73.5
link
link
link
FPGA
fpga
10 4.0
5.0
6.0
3.8
4.7
7.4
70.2
71.8
73.5
link
link
link

1.4. Meta-Training HELP model

Note that this process is performed only once for all NAS results.

$ python main.py --search_space nasbench201 \
                 --mode 'meta-train' \
                 --num_samples 10 \
                 --num_meta_train_sample 900 \
                 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
                 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \           
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss' \
                 --exp_name [EXP_NAME] \
                 --seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_nasbench201.sh [GPU_NUM]

The results (checkpoint file, log file etc) are saved in

./results/nasbench201/[EXP_NAME]

2. Reproduce Main Results on FBNet Search Space

We provide the code to reproduce the main results on FBNet search space as follows:

  • Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 2).
  • Meta-Training HELP model.

2.1. Data Preparation and Model Checkpoint

We include all required datasets and checkpoints in this github repository.

2.2. [Meta-Test] Architecture ranking correlation

You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on FBNet search space (Table 2):

$ python main.py --search_space fbnet \
	--mode 'meta-test' \
	--num_samples 10 \
	--num_episodes 4000 \
	--num_meta_train_sample 4000 \
	--load_path './data/fbnet/checkpoint/help_max_corr.pt' \
	--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
	--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
	--meta_test_devices 'fpga,raspi4,eyeriss'

or you can use provided script:

$ bash script/run_meta_test_fbnet.sh [GPU_NUM]

Architecture Ranking Correlation Results (Table 2)

Method Raspi4 ASIC FPGA Mean
MAML 0.718 0.763 0.727 0.736
Meta-SGD 0.821 0.822 0.776 0.806
HELP (Ours) 0.887 0.943 0.892 0.910

2.3. Meta-Training HELP model

Note that this process is performed only once for all results.

$ python main.py --search_space fbnet \
	--mode 'meta-train' \
	--num_samples 10 \
	--num_episodes 4000 \
	--num_meta_train_sample 4000 \
	--exp_name [EXP_NAME] \
	--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
	--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
	--meta_test_devices 'fpga,raspi4,eyeriss' \
	--seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_fbnet.sh [GPU_NUM]

The results (checkpoint file, log file etc) are saved in

./results/fbnet/[EXP_NAME]

3. Reproduce Main Results on OFA Search Space

We provide the code to reproduce the main results on OFA search space as follows:

  • Latency-constrained NAS Results with accuracy predictor of OFA + HELP on unseen devices (Table 5).
  • Validating obatined neural architecture on ImageNet-1K.
  • Meta-Training HELP model.

3.1. Data Preparation and Model Checkpoint

We include required datasets except ImageNet-1K, and checkpoints in this github repository. To validate obatined neural architecture on ImageNet-1K, you should download ImageNet-1K (2012 ver.)

3.2. [Meta-Test] Efficient Latency-constrained NAS combined with accuracy predictor of OFA

You can reproduce latency-constrained NAS results with OFA + HELP on unseen devices on OFA search space (Table 5):

python main.py \
	--search_space ofa \
	--mode nas \
	--num_samples 10 \
	--seed 3 \
	--num_meta_train_sample 4000 \
	--load_path './data/ofa/checkpoint/help_max_corr.pt' \
	--nas_target_device [DEVICE_NAME] \
	--latency_constraint [LATENCY_CONSTRAINT] \
	--exp_name 'nas' \
	--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
	--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
	--meta_test_devices 'titan_rtx_64' 

For example,

$ python main.py \
	--search_space ofa \
	--mode nas \
	--num_samples 10 \
	--seed 3 \
	--num_meta_train_sample 4000 \
	--load_path './data/ofa/checkpoint/help_max_corr.pt' \
	--nas_target_device titan_rtx_64 \
	--latency_constraint 20 \
	--exp_name 'nas' \
	--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
	--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
	--meta_test_devices 'titan_rtx_64' 

or you can use provided script:

$ bash script/run_nas_ofa.sh [GPU_NUM]

Efficient Latency-constrained NAS Results (Table 5)

Device Sample from
Target Device
Latency
Constraint (ms)
Latency
(ms)
Accuracy
(%)
Architecture
config
GPU Titan RTX
(Batch 64)
10 20
23
28
20.3
23.1
28.6
76.0
76.8
77.9
link
link
link
CPU Intel Gold 6226 20 170
190
147
171
77.6
78.1
link
link
Jetson AGX Xavier 10 65
70
67.4
76.4
75.9
76.4
link
link

3.3. Validating obtained neural architecture on ImageNet-1K

$ python validate_imagenet.py \
		--config_path [Path of neural architecture config file]
		--imagenet_save_path [Path of ImageNet 1k]

for example,

$ python validate_imagenet.py \
		--config_path 'data/ofa/architecture_config/gpu_titan_rtx_64/latency_28.6ms_accuracy_77.9.json' \
		--imagenet_save_path './ILSVRC2012'

3.4. Meta-training HELP model

Note that this process is performed only once for all results.

$ python main.py --search_space ofa \
		--mode 'meta-train' \
		--num_samples 10 \
		--num_meta_train_sample 4000 \
		--exp_name [EXP_NAME] \
                --meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
                --meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
                --meta_test_devices 'titan_rtx_64' \
		--seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_ofa.sh [GPU_NUM]

4. Main Results on HAT Search Space

We provide the neural architecture configurations to reproduce the results of machine translation (WMT'14 En-De Task) on HAT search space.

Efficient Latency-constrained NAS Results

Task Device Samples from
Target Device
Latency BLEU score Architecture
Config
WMT'14 En-De GPU NVIDIA Titan RTX 10 74.0ms
106.5ms
27.19
27.44
link
link
WMT'14 En-De CPU Intel Xeon Gold 6240 10 159.6ms
343.2ms
27.20
27.52
link
link

You can test models by BLEU score and Computing Latency.

Reference

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (ICML17)

Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

Once-for-All: Train One Network and Specialize it for Efficient Deployment (ICLR20)

NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search (ICLR20)

BRP-NAS: Prediction-based NAS using GCNs (NeurIPS20)

HAT: Hardware Aware Transformers for Efficient Natural Language Processing (ACL20)

Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets (ICLR21)

HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark (ICLR21)

Owner
Ph.D. student @ School of Computing, Korea Advanced Institute of Science and Technology (KAIST)
This app is a simple example of using Strealit to create a financial data web app.

Streamlit Demo: Finance Chart This app is a simple example of using Streamlit to create a financial data web app. This demo use streamlit, pandas and

91 Jan 02, 2023
Code in PyTorch for the convex combination linear IAF and the Householder Flow, J.M. Tomczak & M. Welling

VAE with Volume-Preserving Flows This is a PyTorch implementation of two volume-preserving flows as described in the following papers: Tomczak, J. M.,

Jakub Tomczak 87 Dec 26, 2022
Indoor Panorama Planar 3D Reconstruction via Divide and Conquer

HV-plane reconstruction from a single 360 image Code for our paper in CVPR 2021: Indoor Panorama Planar 3D Reconstruction via Divide and Conquer (pape

sunset 36 Jan 03, 2023
This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

README This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et a

Raghav 42 Dec 15, 2022
Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

CaloGAN Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks. This repository c

Deep Learning for HEP 101 Nov 13, 2022
Code for "AutoMTL: A Programming Framework for Automated Multi-Task Learning"

AutoMTL: A Programming Framework for Automated Multi-Task Learning This is the website for our paper "AutoMTL: A Programming Framework for Automated M

Ivy Zhang 40 Dec 04, 2022
DC540 hacking challenge 0x00005a.

dc540-0x00005a DC540 hacking challenge 0x00005a. PROMOTIONAL VIDEO - WATCH NOW HERE ON YOUTUBE CRITICAL PART 5A VIDEO - WATCH NOW HERE ON YOUTUBE Prio

Kevin Thomas 3 May 09, 2022
Riemann Noise Injection With PyTorch

Riemann Noise Injection - PyTorch A module for modeling GAN noise injection based on Riemann geometry, as described in Ruili Feng, Deli Zhao, and Zhen

2 May 27, 2022
[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021) Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao This

Daniel-Ji 55 Dec 22, 2022
Breaking the Dilemma of Medical Image-to-image Translation

Breaking the Dilemma of Medical Image-to-image Translation Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that dominate the field

Kid Liet 86 Dec 21, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
Using some basic methods to show linkages and transformations of robotic arms

roboticArmVisualizer Python GUI application to create custom linkages and adjust joint angles. In the future, I plan to add 2d inverse kinematics solv

Sandesh Banskota 1 Nov 19, 2021
CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection

CLOCs is a novel Camera-LiDAR Object Candidates fusion network. It provides a low-complexity multi-modal fusion framework that improves the performance of single-modality detectors. CLOCs operates on

Su Pang 254 Dec 16, 2022
GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

Course Description The programming language Julia is being more and more adopted in High Performance Computing (HPC) due to its unique way to combine

Samuel Omlin 192 Jan 03, 2023
Unpaired Caricature Generation with Multiple Exaggerations

CariMe-pytorch The official pytorch implementation of the paper "CariMe: Unpaired Caricature Generation with Multiple Exaggerations" CariMe: Unpaired

Gu Zheng 37 Dec 30, 2022
Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships.

feature-set-comp Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships. Reposito

Trent Henderson 7 May 25, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 03, 2023
The repo of Feedback Networks, CVPR17

Feedback Networks http://feedbacknet.stanford.edu/ Paper: Feedback Networks, CVPR 2017. Amir R. Zamir*,Te-Lin Wu*, Lin Sun, William B. Shen, Bertram E

Stanford Vision and Learning Lab 87 Nov 19, 2022
Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

Zhang Yi 72 Jan 01, 2023
This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.

Q-Programming Summer of Qode This repository contains all the code and materials distributed in the Q-Programming Summer of Qode. If you want to creat

Sammarth Kumar 11 Jun 11, 2021