Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Overview


            

COResets and Data Subset selection

GitHub Decile Documentation GitHub Stars GitHub Forks

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of pytorch. Deep Learning systems are extremely compute intensive today with large turn around times, energy inefficiencies, higher costs and resourse requirements [1,2]. CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the right representative data subsets from massive datasets, and it does so iteratively. CORDS uses some recent advances in data subset selection and particularly, ideas of coresets and submodularity select such subsets. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include:

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

  • Reproducability of SOTA in Data Selection and Coresets: Enable easy reproducability of SOTA described above. We are trying to also add more algorithms so if you have an algorithm you would like us to include, please let us know,.
  • Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets including CIFAR-10, CIFAR-100, MNIST, SVHN and ImageNet.
  • Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
  • Modular design: The data selection algorithms are separate from the training loop, thereby enabling modular design and also varied scenarios of utility.
  • Broad number of usecases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating a number of additional use cases like object detection, speech recognition, semi-supervised learning, Auto-ML, etc.

Installation

  1. To install latest version of CORDS package using PyPI:

    pip install -i https://test.pypi.org/simple/ cords
  2. To install using source:

    git clone https://github.com/decile-team/cords.git
    cd cords
    pip install -r requirements/requirements.txt

Next Steps

Tutorials

Documentation

The documentation for the latest version of CORDS can always be found here.

Comments
  • Logistic Regression support for Gradmatch

    Logistic Regression support for Gradmatch

    Logistic Regression model throws errors when we do back propagation. The fix for this is perhaps making freeze=False in forward function of utils/models/logreg_net.py

    opened by nlokeshiisc 4
  • [Bug] Got weight with same value when running examples.

    [Bug] Got weight with same value when running examples.

    Hi, I tested the example with Supervised learning and Glister strategy. https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb But when I print the weight of the train loader, they are all 1.0. I believe that by using Glister strategy, we will get different weights.

    tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
            1., 1.], device='cuda:0')
    

    Is that a bug or something special? Thanks.

    opened by HaoKang-Timmy 3
  • Segmentation fault (core dumped)

    Segmentation fault (core dumped)

    Hi,

    I was trying to deploy CORDS selection to my training, but this error popped out Segmentation fault (core dumped).

    I imitated code from https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb.

    So basically I put my training and testing loader into GLISTERDataLoader, and switched this part into my code

    for _, (inputs, targets, weights) in enumerate(dataloader): inputs = inputs.to(device) targets = targets.to(device, non_blocking=True) weights = weights.to(device) optimizer.zero_grad() outputs = model(inputs) losses = criterion_nored(outputs, targets) loss = torch.dot(losses, weights/(weights.sum())) loss.backward()

    before modifying my code was running fine, so I believe there is an error inside the CORDS, my dataset is CIFAR10.

    Thanks

    opened by chengwuxinlin 2
  • Replace apricot with submodlib

    Replace apricot with submodlib

    Fixes #16
    submodlib is now used for the CRAIG strategy/dataloader as well as the submodular strategy/dataloader. Please let me know if you have any feedback!

    Notes:

    • I am not sure if sum redundancy (a submodular function implemented in apricot) has an analogue in submodlib, so it is disabled as an option for now.
    • It doesn't seem like submodularselectionstrategy.py is used in the corresponding dataloader. This may be a good opportunity to refactor, so that behavior is consistent between the two.
    • Any existing code that specifies the "optimizer" (greedy algorithm) used by apricot will break, since the names used by submodlib are different than those used by apricot (e.g. 'LazyGreedy' instead of 'Lazy'). This includes configs that use this option.
    opened by ghost 1
  • Typo in cords_cifar10_glister_train.ipynb

    Typo in cords_cifar10_glister_train.ipynb

    There is a typo in the cords_cifar10_glister_train.ipynb notebook : https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/cords_cifar10_glister_train.ipynb

    glister_trn.configdata.train_args.print_every = 1
    glister_trn.configdata.train_args.device = 'cuda'
    glister_trn.configdata.dss_args.fraction = fraction
    

    instead of

    glister_trn.cfg.train_args.print_every = 1
    glister_trn.cfg.train_args.device = 'cuda'
    glister_trn.cfg.dss_args.fraction = fraction
    
    opened by eendee 1
  • Evaluation on ImageNet

    Evaluation on ImageNet

    Hello, thanks for a very interesting and useful project.

    Could you mind providing an evaluation method for ImageNet? I tried to, adding loader for ImageNet to custom_dataset.py, but failed due to a GPU memory issue during subset selection.

    Many thanks!

    opened by Hayoung93 1
  • For GRAD_MATCH method, the weights associated with each data point in X(subset of training set)

    For GRAD_MATCH method, the weights associated with each data point in X(subset of training set)

    1. For GRAD-MATCH method, there are weights associated with each data point in X(subset of training set). Do the weights have physical significance? for example, if the value of the weight is higher, the relevant selected data has the greater contribution to the residual?
    2. During the iteration, the selective index is in the selected indices, so the iteration break. why this happen? [email protected]
    opened by lishaguo 1
  • Questions about accuracy logging

    Questions about accuracy logging

    Hello! Thanks for your great work.

    I'm currently working on this code and I want to ask a question about accuracy logging.

    https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L530-L541

    In line 541 of train.py, val_acc contains cumulative accuracies over input batches. For example, if the loader contains 4500 examples and the batch size is 1000, then tst_acc has 5 accuracies per each evaluation. (the first element of tst_acc will be the accuracy over the first 1000 examples)

    https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L631-L633

    In line 633, it prints the best value in tst_acc. In this case, the resulted best accuracies over different algorithms and seeds might be the values evaluated on different test samples.

    Is this what you intended? In my experience, I think evaluating algorithms on an identical test dataset is a convention. In addition, is the reported test accuracies in the GRAD-MATCH paper the best values as above or the last test accuracy?

    Best, Jang-Hyun

    opened by Janghyun1230 1
  • CORDS gradient calculations for different loss functions

    CORDS gradient calculations for different loss functions

    a) Implement gradient calculation for Squared Loss, Negative logistic loss, General loss function gradient computation, Hinge loss.

    b) Integrate the new gradient calculation with different selection strategies

    enhancement 
    opened by krishnatejakk 1
  • Refactor the folders in the repo

    Refactor the folders in the repo

    • Add a folder called benchmarks which has all the results/benchmarks for the various cases. We should remove the results from the main readme and point them to that folder. Also, add the notebooks to reproduce the benchmark results
    • Rename notebooks to tutorials. Add different tutorials based on use-cases (NLP, Vision, SSL, Hyper-parameter tunings, NAS, etc.)
    opened by rishabhk108 0
  • Inquiry about performance of gradmatch

    Inquiry about performance of gradmatch

    Hello, I ran some experiments with gradmatch and randomonline, and find these two actually reach similar performances after 300 epochs, which is around 93, is there something important to note for reproducing the results? Thanks for your help!

    opened by pipilurj 0
  • Implement faster version of OMP

    Implement faster version of OMP

    Implement the following versions of OMP:

    1. FNNOMP (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7012095)
    2. SNNOMP (https://hal.univ-lorraine.fr/hal-01585253/document)
    high priority in progress 
    opened by krishnatejakk 0
  • Gradmatch Data subset selection method making training slow

    Gradmatch Data subset selection method making training slow

    I tried to run some experiments as follows:

    • Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.
    • Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.
    • Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

    I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture. Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

    opened by animesh-007 9
  • Implement CRUST Algorithm

    Implement CRUST Algorithm

    1. Implement the CRUST strategy in the supervised learning setting.
    2. Create the CRUST data loader class building it on top of adaptive_dataloader class.
    enhancement 
    opened by krishnatejakk 0
Releases(v0.0.1)
  • v0.0.1(Mar 24, 2022)

    What's Changed

    • Selcon sahasra by @sahasrarjn in https://github.com/decile-team/cords/pull/73
    • Selcon sahasra by @sahasrarjn in https://github.com/decile-team/cords/pull/74

    New Contributors

    • @sahasrarjn made their first contribution in https://github.com/decile-team/cords/pull/73

    Full Changelog: https://github.com/decile-team/cords/compare/v0.0.0...v0.0.1

    Source code(tar.gz)
    Source code(zip)
  • v0.0.0(Mar 4, 2022)

    Pre-release of CORDS

    What's Changed

    • Dev by @krishnatejakk in https://github.com/decile-team/cords/pull/9
    • CONFIG Files Pull by @krishnatejakk in https://github.com/decile-team/cords/pull/10
    • New Gradient Computation Code by @krishnatejakk in https://github.com/decile-team/cords/pull/11
    • Feature: add support for hyperparameter tuning with subset selection by @savan77 in https://github.com/decile-team/cords/pull/12
    • Added checkpoints to save the model and updated documentation by @dheerajnbhat in https://github.com/decile-team/cords/pull/15
    • test CI and dual tests by @noilreed in https://github.com/decile-team/cords/pull/29
    • Dual CI flow merge to main by @noilreed in https://github.com/decile-team/cords/pull/30
    • Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/36
    • Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/40
    • Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/66

    New Contributors

    • @krishnatejakk made their first contribution in https://github.com/decile-team/cords/pull/9
    • @savan77 made their first contribution in https://github.com/decile-team/cords/pull/12
    • @dheerajnbhat made their first contribution in https://github.com/decile-team/cords/pull/15
    • @noilreed made their first contribution in https://github.com/decile-team/cords/pull/29

    Full Changelog: https://github.com/decile-team/cords/commits/v0.0.0

    Source code(tar.gz)
    Source code(zip)
Owner
decile-team
DECILE: Data EffiCient machIne LEarning
decile-team
Implementation of "RaScaNet: Learning Tiny Models by Raster-Scanning Image" from CVPR 2021.

RaScaNet: Learning Tiny Models by Raster-Scanning Images Deploying deep convolutional neural networks on ultra-low power systems is challenging, becau

SAIT (Samsung Advanced Institute of Technology) 5 Dec 26, 2022
Linear Variational State Space Filters

Linear Variational State Space Filters To set up the environment, use the provided scripts in the docker/ folder to build and run the codebase inside

0 Dec 13, 2021
3D Human Pose Machines with Self-supervised Learning

3D Human Pose Machines with Self-supervised Learning Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei, “3D Human Pose Machines with Self

Chenhan Jiang 398 Dec 20, 2022
Continuous Diffusion Graph Neural Network

We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE.

Twitter Research 227 Jan 05, 2023
Code of paper: "DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks"

DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks Abstract: Adversarial training has been proven to

倪仕文 (Shiwen Ni) 58 Nov 10, 2022
A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

張致強 14 Dec 02, 2022
Diffusion Probabilistic Models for 3D Point Cloud Generation (CVPR 2021)

Diffusion Probabilistic Models for 3D Point Cloud Generation [Paper] [Code] The official code repository for our CVPR 2021 paper "Diffusion Probabilis

Shitong Luo 323 Jan 05, 2023
Fast and simple implementation of RL algorithms, designed to run fully on GPU.

RSL RL Fast and simple implementation of RL algorithms, designed to run fully on GPU. This code is an evolution of rl-pytorch provided with NVIDIA's I

Robotic Systems Lab - Legged Robotics at ETH Zürich 68 Dec 29, 2022
a curated list of docker-compose files prepared for testing data engineering tools, databases and open source libraries.

data-services A repository for storing various Data Engineering docker-compose files in one place. How to use it ? Set the required settings in .env f

BigData.IR 525 Dec 03, 2022
masscan + nmap + Finger

说明 个人根据使用习惯修改masnmap而来的一个小工具。调用masscan做全端口扫描,再调用nmap做服务识别,最后调用Finger做Web指纹识别。工具使用场景适合风险探测排查、众测等。 使用方法 安装依赖 pip3 install -r requirements.txt -i https:/

Ryan 3 Mar 25, 2022
A Python library for unevenly-spaced time series analysis

traces A Python library for unevenly-spaced time series analysis. Why? Taking measurements at irregular intervals is common, but most tools are primar

Datascope Analytics 516 Dec 29, 2022
Repository for the Bias Benchmark for QA dataset.

BBQ Repository for the Bias Benchmark for QA dataset. Authors: Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Tho

ML² AT CILVR 18 Nov 18, 2022
[CVPR 2021] Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach This is the repo to host the dataset TextSeg and code for TexRNe

SHI Lab 174 Dec 19, 2022
multimodal transformer

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower. Overview of the repo It is structu

Guillermo Valle 68 Dec 13, 2022
A general and strong 3D object detection codebase that supports more methods, datasets and tools (debugging, recording and analysis).

ALLINONE-Det ALLINONE-Det is a general and strong 3D object detection codebase built on OpenPCDet, which supports more methods, datasets and tools (de

Michael.CV 5 Nov 03, 2022
MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

Multi-Graph Fusion Networks for Urban Region Embedding (IJCAI-22) This is the implementation of Multi-Graph Fusion Networks for Urban Region Embedding

202 Nov 18, 2022
Active window border replacement for window managers.

xborder Active window border replacement for window managers. Usage git clone https://github.com/deter0/xborder cd xborder chmod +x xborders ./xborder

deter 250 Dec 30, 2022
SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

SOFT: Softmax-free Transformer with Linear Complexity SOFT: Softmax-free Transformer with Linear Complexity, Jiachen Lu, Jinghan Yao, Junge Zhang, Xia

Fudan Zhang Vision Group 272 Dec 25, 2022
Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators..

ARAPReg Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators.. Installation The cod

Bo Sun 132 Nov 28, 2022
Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

DQC: Differentiable Quantum Chemistry Differentiable quantum chemistry package. Currently only support differentiable density functional theory (DFT)

75 Dec 02, 2022