Keras implementation of AdaBound

Overview

AdaBound for Keras

Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Usage

Add the adabound.py script to your project, and import it. Can be a dropin replacement for Adam Optimizer.

Also supports AMSBound variant of the above, equivalent to AMSGrad from Adam.

from adabound import AdaBound

optm = AdaBound(lr=1e-03,
                final_lr=0.1,
                gamma=1e-03,
                weight_decay=0.,
                amsbound=False)

Results

With a wide ResNet 34 and horizontal flips data augmentation, and 100 epochs of training with batchsize 128, it hits 92.16% (called v1).

Weights are available inside the Releases tab

NOTE

  • The smaller ResNet 20 models have been removed as they did not perform as expected and were depending on a flaw during the initial implementation. The ResNet 32 shows the actual performance of this optimizer.

With a small ResNet 20 and width + height data + horizontal flips data augmentation, and 100 epochs of training with batchsize 1024, it hits 89.5% (called v1).

On a small ResNet 20 with only width and height data augmentations, with batchsize 1024 trained for 100 epochs, the model gets close to 86% on the test set (called v3 below).

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

  • Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
  • Numpy
Comments
  • suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    same as my PR https://github.com/keras-team/keras-contrib/pull/478 works only with TF backend

    class AdaBound(Optimizer):
        """AdaBound optimizer.
        Default parameters follow those provided in the original paper.
        # Arguments
            lr: float >= 0. Learning rate.
            final_lr: float >= 0. Final learning rate.
            beta_1: float, 0 < beta < 1. Generally close to 1.
            beta_2: float, 0 < beta < 1. Generally close to 1.
            gamma: float >= 0. Convergence speed of the bound function.
            epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
            decay: float >= 0. Learning rate decay over each update.
            weight_decay: Weight decay weight.
            amsbound: boolean. Whether to apply the AMSBound variant of this
                algorithm.
            tf_cpu_mode: only for tensorflow backend
                  0 - default, no changes.
                  1 - allows to train x2 bigger network on same VRAM consuming RAM
                  2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                      and CPU power.
        # References
            - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
              (https://openreview.net/forum?id=Bkg3g2R9FX)
            - [Adam - A Method for Stochastic Optimization]
              (https://arxiv.org/abs/1412.6980v8)
            - [On the Convergence of Adam and Beyond]
              (https://openreview.net/forum?id=ryQu7f-RZ)
        """
    
        def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                     epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
            super(AdaBound, self).__init__(**kwargs)
    
            if not 0. <= gamma <= 1.:
                raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")
    
            with K.name_scope(self.__class__.__name__):
                self.iterations = K.variable(0, dtype='int64', name='iterations')
                self.lr = K.variable(lr, name='lr')
                self.beta_1 = K.variable(beta_1, name='beta_1')
                self.beta_2 = K.variable(beta_2, name='beta_2')
                self.decay = K.variable(decay, name='decay')
    
            self.final_lr = final_lr
            self.gamma = gamma
    
            if epsilon is None:
                epsilon = K.epsilon()
            self.epsilon = epsilon
            self.initial_decay = decay
            self.amsbound = amsbound
    
            self.weight_decay = float(weight_decay)
            self.base_lr = float(lr)
            self.tf_cpu_mode = tf_cpu_mode
    
        def get_updates(self, loss, params):
            grads = self.get_gradients(loss, params)
            self.updates = [K.update_add(self.iterations, 1)]
    
            lr = self.lr
            if self.initial_decay > 0:
                lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                          K.dtype(self.decay))))
    
            t = K.cast(self.iterations, K.floatx()) + 1
    
            # Applies bounds on actual learning rate
            step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                              (1. - K.pow(self.beta_1, t)))
    
            final_lr = self.final_lr * lr / self.base_lr
            lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
            upper_bound = final_lr * (1. + 1. / (self.gamma * t))
    
            e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
            if e: e.__enter__()
            ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            if self.amsbound:
                vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            else:
                vhats = [K.zeros(1) for _ in params]
            if e: e.__exit__(None, None, None)
            
            self.weights = [self.iterations] + ms + vs + vhats
    
            for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
                # apply weight decay
                if self.weight_decay != 0.:
                    g += self.weight_decay * K.stop_gradient(p)
    
                e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
                if e: e.__enter__()                    
                m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
                v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
                if self.amsbound:
                    vhat_t = K.maximum(vhat, v_t)
                    self.updates.append(K.update(vhat, vhat_t))
                if e: e.__exit__(None, None, None)
                
                if self.amsbound:
                    denom = (K.sqrt(vhat_t) + self.epsilon)
                else:
                    denom = (K.sqrt(v_t) + self.epsilon)                        
    
                # Compute the bounds
                step_size_p = step_size * K.ones_like(denom)
                step_size_p_bound = step_size_p / denom
                bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                         lower_bound), upper_bound)
    
                p_t = p - bounded_lr_t
    
                self.updates.append(K.update(m, m_t))
                self.updates.append(K.update(v, v_t))
                new_p = p_t
    
                # Apply constraints.
                if getattr(p, 'constraint', None) is not None:
                    new_p = p.constraint(new_p)
    
                self.updates.append(K.update(p, new_p))
            return self.updates
    
        def get_config(self):
            config = {'lr': float(K.get_value(self.lr)),
                      'final_lr': float(self.final_lr),
                      'beta_1': float(K.get_value(self.beta_1)),
                      'beta_2': float(K.get_value(self.beta_2)),
                      'gamma': float(self.gamma),
                      'decay': float(K.get_value(self.decay)),
                      'epsilon': self.epsilon,
                      'weight_decay': self.weight_decay,
                      'amsbound': self.amsbound}
            base_config = super(AdaBound, self).get_config()
            return dict(list(base_config.items()) + list(config.items()))
    
    opened by iperov 13
  • AdaBound.iterations

    AdaBound.iterations

    this param is not saved.

    I looked at official pytorch implementation from original paper. https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

    it has

    # State initialization
    if len(state) == 0:
        state['step'] = 0
    

    state is saved with the optimizer.

    also it has

    # Exponential moving average of gradient values
    state['exp_avg'] = torch.zeros_like(p.data)
    # Exponential moving average of squared gradient values
    state['exp_avg_sq'] = torch.zeros_like(p.data)
    

    these values should also be saved

    So your keras implementation is wrong.

    opened by iperov 10
  • Using SGDM with lr=0.1 leads to not learning

    Using SGDM with lr=0.1 leads to not learning

    Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below. image

    I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

    opened by syorami 10
  • clip by value

    clip by value

    https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

    K.minimum(K.maximum(step, lower_bound), upper_bound)

    will not work?

    opened by iperov 2
  • Unexpected keyword argument passed to optimizer: amsbound

    Unexpected keyword argument passed to optimizer: amsbound

    I installed with pip install keras-adabound imported with: from keras_adabound import AdaBound and declared the optimizer as: opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Then, I'm getting the error: TypeError: Unexpected keyword argument passed to optimizer: amsbound

    changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error: TypeError: __init__() missing 1 required positional argument: 'params'

    Am I mixing something up here or missing something?

    opened by stabilus 0
  • Unclear how to import and use tf.keras version

    Unclear how to import and use tf.keras version

    I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

    opened by mnweaver1 0
  • about lr

    about lr

    Thanks for a good optimizer According to usage optm = AdaBound(lr=1e-03, final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Does the learning rate gradually increase by the number of steps?


    final lr is described as Final learning rate. but it actually is leaning rate relative to base lr and current klearning rate? https://github.com/titu1994/keras-adabound/blob/5ce819b6ca1cd95e32d62e268bd2e0c99c069fe8/adabound.py#L72

    opened by tanakataiki 1
Releases(0.1)
Owner
Somshubra Majumdar
Interested in Machine Learning, Deep Learning and Data Science in general
Somshubra Majumdar
Automatic Number Plate Recognition using Contours and Convolution Neural Networks (CNN)

Cite our paper if you find this project useful https://www.ijariit.com/manuscripts/v7i4/V7I4-1139.pdf Abstract Image processing technology is used in

Adithya M 2 Jun 28, 2022
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective

Unofficial pytorch implementation of the paper "Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective"

16 Nov 21, 2022
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

417 Dec 20, 2022
Iris prediction model is used to classify iris species created julia's DecisionTree, DataFrames, JLD2, PlotlyJS and Statistics packages.

Iris Species Predictor Iris prediction is used to classify iris species using their sepal length, sepal width, petal length and petal width created us

Siva Prakash 2 Jan 06, 2022
MiraiML: asynchronous, autonomous and continuous Machine Learning in Python

MiraiML Mirai: future in japanese. MiraiML is an asynchronous engine for continuous & autonomous machine learning, built for real-time usage. Usage In

Arthur Paulino 25 Jul 27, 2022
Official implementation of our paper "Learning to Bootstrap for Combating Label Noise"

Learning to Bootstrap for Combating Label Noise This repo is the official implementation of our paper "Learning to Bootstrap for Combating Label Noise

21 Apr 09, 2022
Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

GANInversion_with_ConsecutiveImgs Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images" https://a

QingyangXu 38 Dec 07, 2022
SNE-RoadSeg in PyTorch, ECCV 2020

SNE-RoadSeg Introduction This is the official PyTorch implementation of SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentati

242 Dec 20, 2022
MoCap-Solver: A Neural Solver for Optical Motion Capture Data

MoCap-Solver is a data-driven-based robust marker denoising method, which takes raw mocap markers as input and outputs corresponding clean markers and skeleton motions.

55 Dec 28, 2022
MT3: Multi-Task Multitrack Music Transcription

MT3: Multi-Task Multitrack Music Transcription MT3 is a multi-instrument automatic music transcription model that uses the T5X framework. This is not

Magenta 867 Dec 29, 2022
Code for "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" @ICRA2021

CloudAAE This is an tensorflow implementation of "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" Files log:

Gee 35 Nov 14, 2022
(Arxiv 2021) NeRF--: Neural Radiance Fields Without Known Camera Parameters

NeRF--: Neural Radiance Fields Without Known Camera Parameters Project Page | Arxiv | Colab Notebook | Data Zirui Wang¹, Shangzhe Wu², Weidi Xie², Min

Active Vision Laboratory 411 Dec 26, 2022
Efficient face emotion recognition in photos and videos

This repository contains code of face emotion recognition that was developed in the RSF (Russian Science Foundation) project no. 20-71-10010 (Efficien

Andrey Savchenko 239 Jan 04, 2023
TensorFlow implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Source: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Aritra Roy Gosthipaty 23 Dec 24, 2022
A PyTorch library for Vision Transformers

VFormer A PyTorch library for Vision Transformers Getting Started Read the contributing guidelines in CONTRIBUTING.rst to learn how to start contribut

Society for Artificial Intelligence and Deep Learning 142 Nov 28, 2022
Tutorials, assignments, and competitions for MIT Deep Learning related courses.

MIT Deep Learning This repository is a collection of tutorials for MIT Deep Learning courses. More added as courses progress. Tutorial: Deep Learning

Lex Fridman 9.5k Jan 07, 2023
a Lightweight library for sequential learning agents, including reinforcement learning

SaLinA: SaLinA - A Flexible and Simple Library for Learning Sequential Agents (including Reinforcement Learning) TL;DR salina is a lightweight library

Facebook Research 405 Dec 17, 2022
Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

Adri1G 1 Mar 23, 2022
Azua - build AI algorithms to aid efficient decision-making with minimum data requirements.

Project Azua 0. Overview Many modern AI algorithms are known to be data-hungry, whereas human decision-making is much more efficient. The human can re

Microsoft 197 Jan 06, 2023
PyTorch implementation of CVPR'18 - Perturbative Neural Networks

This is an attempt to reproduce results in Perturbative Neural Networks paper. See original repo for details.

Michael Klachko 57 May 14, 2021