Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Overview

Selection via Proxy: Efficient Data Selection for Deep Learning

This repository contains a refactored implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

If you use this code in your research, please use the following BibTeX entry.

@inproceedings{
    coleman2020selection,
    title={Selection via Proxy: Efficient Data Selection for Deep Learning},
    author={Cody Coleman and Christopher Yeh and Stephen Mussmann and Baharan Mirzasoleiman and Peter Bailis and Percy Liang and Jure Leskovec and Matei Zaharia},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=HJg2b0VYDr}
}

The original code is also available as a zip file, but lacks documentation, uses outdated packages, and won't be maintained. Please use this repository instead and report issues here.

Setup

Prerequisites

Installation

git clone https://github.com/stanford-futuredata/selection-via-proxy.git
cd selection-via-proxy
pip install -e .

or simply

pip install git+https://github.com/stanford-futuredata/selection-via-proxy.git

Quickstart

Perform active learning on CIFAR10 from the command line:

python -m svp.cifar active

Or from the python interpreter:

from svp.cifar.active import active
active()

"Selection via proxy" happens when --proxy-arch doesn't match --arch:

# ResNet20 selecting data for a ResNet164
python -m svp.cifar active --proxy-arch preact20 --arch preact164

For help, see python -m svp.cifar active --help or active()'s docstrinng.

Example Usage

Below are more examples of the command line interface that cover different datasets (e.g., CIFAR100, ImageNet, Amazon Review Polarity) and commands (e.g., train, coreset).

Basic Training

CIFAR10 and CIFAR100

Preliminaries

None. The CIFAR10 and CIFAR100 datasets will download if they don't exist in ./data/cifar10 and ./data/cifar100 respectively.

Examples
# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR10.
python -m svp.cifar train --dataset cifar10 --arch preact164

Replace --dataset CIFAR10 with --dataset CIFAR100 to run on CIFAR100 rather than CIFAR10.

# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR100.
python -m svp.cifar train --dataset cifar100 --arch preact164

The same is true for all the python -m svp.cifar commands below

ImageNet

Preliminaries
  • Download the ImageNet dataset into a directory called imagenet.
  • Extract the images.
# Extract train data.
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
# Extract validation data.
cd ../ && mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
  • Replace /path/to/data in all the python -m svp.imagenet commands below with the path to the imagenet directory you created. Note, do not include imagenet in the path; the script will automatically do that.
Examples
# Train ResNet50 (https://arxiv.org/abs/1512.03385).
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20

For convenience, you can use larger batch sizes and scale learning rates according to "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" with --scale-learning-rates:

# Train ResNet50 with a batch size of 1048 and scaled learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates

Mixed precision training is also supported using apex. Apex isn't installed during the pip install instructions above, so please follow the installation instructions in the apex repository before running the command below.

# Use mixed precision training to train ResNet50 with a batch size of 1048 and scale learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates --fp16

Amazon Review Polarity and Full

Preliminaries
tar -xvzf amazon_review_full_csv.tar.gz
tar -xvzf amazon_review_polarity_csv.tar.gz
  • Replace /path/to/data in all the python -m svp.amazon commands below with the path to the root directory you created. Note, do not include amazon_review_full_csv or amazon_review_polarity_csv in the path; the script will automatically do that.
Examples
# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Polarity.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_polarity --arch vdcnn29-conv \
    --num-workers 4 --eval-num-workers 8

Replace --dataset amazon_review_polarity with --dataset amazon_review_full to run on Amazon Review Full rather than Amazon Review Polarity.

# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Full.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_full --arch vdcnn29-maxpool \
    --num-workers 4 --eval-num-workers 8

The same is true for all the python -m svp.amazon commands below

Active learning

Active learning selects points to label from a large pool of unlabeled data by repeatedly training a model on a small pool of labeled data and selecting additional examples to label based on the model’s uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics. The commands below demonstrate how to perform active learning on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform active learning with ResNet164 for both selection and the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet20 for selection and ResNet164 for the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet20 after only 50 epochs for selection.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

ImageNet

Baseline Approach
# Perform active learning with ResNet50 for both selection and the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet18 for selection and ResNet50 for the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet18 after only 45 epochs for selection.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467 \
    --proxy-learning-rate 0.0167 --proxy-epochs 1 \
    --proxy-learning-rate 0.0333 --proxy-epochs 1 \
    --proxy-learning-rate 0.05 --proxy-epochs 1 \
    --proxy-learning-rate 0.0667 --proxy-epochs 1 \
    --proxy-learning-rate 0.0833 --proxy-epochs 1 \
    --proxy-learning-rate 0.1 --proxy-epochs 25 \
    --proxy-learning-rate 0.01 --proxy-epochs 15

Amazon Review Polarity and Full

Baseline Approach
# Perform active learning with VDCNN29 for both selection and the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity  --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model. You can evaluate a series of selections later using the precomputed_selection option.

# Perform active learning with VDCNN9 for selection and VDCNN29 for the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --proxy-arch vdcnn9-maxpool --eval-target-at 1440000

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform active learning with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity --selection-method least_confidence \
    --size 72000 --size 360000 --size 720000 --size 1080000 --size 1440000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --precomputed-selection $fasttext_path --eval-target-at 1440000

Core-set Selection

Core-set selection techniques start with a large labeled or unlabeled dataset and aim to find a small subset that accurately approximates the full dataset by selecting representative examples. The commands below demonstrate how to perform core-set selection on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet164 for both selection and the final predictions.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet20 selecting for ResNet164.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform core-set selection with ResNet20 after only 50 epochs.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4

ImageNet

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet50 for both selection and the final predictions.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet18 selecting for ResNet50.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates

Amazon Review Polarity and Full

Baseline Approach
# Perform core-set selection with an oracle that uses VDCNN29 for both selection and the final predictions.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000  --selection-method entropy
Selection via Proxy
# Perform core-set selection with VDCNN9 selecting for VDCNN29.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000 --selection-method entropy \
    --proxy-arch vdcnn9-maxpool

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform core-set selection with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity \
    --selection-method entropy --size 3600000 --size 2160000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --precomputed-selection $fasttext_path
Owner
Stanford Future Data Systems
We are a CS research group at Stanford building data-intensive systems
Stanford Future Data Systems
Revealing and Protecting Labels in Distributed Training

Revealing and Protecting Labels in Distributed Training

Google Interns 0 Nov 09, 2022
Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

SinIR (Official Implementation) Requirements To install requirements: pip install -r requirements.txt We used Python 3.7.4 and f-strings which are in

47 Oct 11, 2022
This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting

1 MAGNN This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. 1.1 The frame

SZJ 12 Nov 08, 2022
Aydin is a user-friendly, feature-rich, and fast image denoising tool

Aydin is a user-friendly, feature-rich, and fast image denoising tool that provides a number of self-supervised, auto-tuned, and unsupervised image denoising algorithms.

Royer Lab 99 Dec 14, 2022
A SAT-based sudoku solver

SAT Sudoku solver A SAT-based Sudoku solver made in the context of a small project in the "Logic Problem Solving" class in the first year at the Polyt

Alexandre Malfreyt 5 Apr 15, 2022
Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation Official PyTorch implementation of the NeurIPS 2021 paper Mingcong Liu, Qiang

onion 462 Dec 29, 2022
Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis CVPR 2021 (Oral) Taming Transformers for High-Resolution Image Synthesis Patrick Esser*, Robin

CompVis Heidelberg 3.5k Jan 03, 2023
PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

Scaffold-Federated-Learning PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020). Environment numpy=

KI 30 Dec 29, 2022
Dashboard for the COVID19 spread

COVID-19 Data Explorer App A streamlit Dashboard for the COVID-19 spread. The app is live at: [https://covid19.cwerner.ai]. New data is queried from G

Christian Werner 22 Sep 29, 2022
Pytorch implementation of Integrating Tree Path in Transformer for Code Representation

This is an official Pytorch implementation of the approaches proposed in: Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, Zhi Jin “Integrating Tree Path in

Han Peng 16 Dec 23, 2022
This project deals with the detection of skin lesions within the ISICs dataset using YOLOv3 Object Detection with Darknet.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Skin Lesion detection using YOLO This project deal

Lalith Veerabhadrappa Badiger 1 Nov 22, 2021
Hand-distance-measurement-game - Hand Distance Measurement Game

Hand Distance Measurement Game This is program is made to calculate the distance

Priyansh 2 Jan 12, 2022
Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

9 Nov 18, 2022
An Industrial Grade Federated Learning Framework

DOC | Quick Start | 中文 FATE (Federated AI Technology Enabler) is an open-source project initiated by Webank's AI Department to provide a secure comput

Federated AI Ecosystem 4.8k Jan 09, 2023
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022
HomeAssitant custom integration for dyson

HomeAssistant Custom Integration for Dyson This custom integration is still under development. This is a HA custom integration for dyson. There are se

Xiaonan Shen 232 Dec 31, 2022
Zsseg.baseline - Zero-Shot Semantic Segmentation

This repo is for our paper A Simple Baseline for Zero-shot Semantic Segmentation

98 Dec 20, 2022
Underwater image enhancement

LANet Our work proposes an adaptive learning attention network (LANet) to solve the problem of color casts and low illumination in underwater images.

LiuShiBen 7 Sep 14, 2022
A curated list of awesome Model-Based RL resources

Awesome Model-Based Reinforcement Learning This is a collection of research papers for model-based reinforcement learning (mbrl). And the repository w

OpenDILab 427 Jan 03, 2023