Keras implementation of AdaBound

Last update: Sep 23, 2022

Overview

AdaBound for Keras

Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Usage

Add the adabound.py script to your project, and import it. Can be a dropin replacement for Adam Optimizer.

Also supports AMSBound variant of the above, equivalent to AMSGrad from Adam.

from adabound import AdaBound

optm = AdaBound(lr=1e-03,
                final_lr=0.1,
                gamma=1e-03,
                weight_decay=0.,
                amsbound=False)

Results

With a wide ResNet 34 and horizontal flips data augmentation, and 100 epochs of training with batchsize 128, it hits 92.16% (called v1).

Weights are available inside the Releases tab

NOTE

The smaller ResNet 20 models have been removed as they did not perform as expected and were depending on a flaw during the initial implementation. The ResNet 32 shows the actual performance of this optimizer.

With a small ResNet 20 and width + height data + horizontal flips data augmentation, and 100 epochs of training with batchsize 1024, it hits 89.5% (called v1).

On a small ResNet 20 with only width and height data augmentations, with batchsize 1024 trained for 100 epochs, the model gets close to 86% on the test set (called v3 below).

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
Numpy

Comments

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

same as my PR https://github.com/keras-team/keras-contrib/pull/478 works only with TF backend

class AdaBound(Optimizer):
    """AdaBound optimizer.
    Default parameters follow those provided in the original paper.
    # Arguments
        lr: float >= 0. Learning rate.
        final_lr: float >= 0. Final learning rate.
        beta_1: float, 0 < beta < 1. Generally close to 1.
        beta_2: float, 0 < beta < 1. Generally close to 1.
        gamma: float >= 0. Convergence speed of the bound function.
        epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
        decay: float >= 0. Learning rate decay over each update.
        weight_decay: Weight decay weight.
        amsbound: boolean. Whether to apply the AMSBound variant of this
            algorithm.
        tf_cpu_mode: only for tensorflow backend
              0 - default, no changes.
              1 - allows to train x2 bigger network on same VRAM consuming RAM
              2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                  and CPU power.
    # References
        - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
          (https://openreview.net/forum?id=Bkg3g2R9FX)
        - [Adam - A Method for Stochastic Optimization]
          (https://arxiv.org/abs/1412.6980v8)
        - [On the Convergence of Adam and Beyond]
          (https://openreview.net/forum?id=ryQu7f-RZ)
    """

    def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                 epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
        super(AdaBound, self).__init__(**kwargs)

        if not 0. <= gamma <= 1.:
            raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")

        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.beta_1 = K.variable(beta_1, name='beta_1')
            self.beta_2 = K.variable(beta_2, name='beta_2')
            self.decay = K.variable(decay, name='decay')

        self.final_lr = final_lr
        self.gamma = gamma

        if epsilon is None:
            epsilon = K.epsilon()
        self.epsilon = epsilon
        self.initial_decay = decay
        self.amsbound = amsbound

        self.weight_decay = float(weight_decay)
        self.base_lr = float(lr)
        self.tf_cpu_mode = tf_cpu_mode

    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr
        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                      K.dtype(self.decay))))

        t = K.cast(self.iterations, K.floatx()) + 1

        # Applies bounds on actual learning rate
        step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                          (1. - K.pow(self.beta_1, t)))

        final_lr = self.final_lr * lr / self.base_lr
        lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
        upper_bound = final_lr * (1. + 1. / (self.gamma * t))

        e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
        if e: e.__enter__()
        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        if self.amsbound:
            vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        else:
            vhats = [K.zeros(1) for _ in params]
        if e: e.__exit__(None, None, None)
        
        self.weights = [self.iterations] + ms + vs + vhats

        for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
            # apply weight decay
            if self.weight_decay != 0.:
                g += self.weight_decay * K.stop_gradient(p)

            e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
            if e: e.__enter__()                    
            m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
            if self.amsbound:
                vhat_t = K.maximum(vhat, v_t)
                self.updates.append(K.update(vhat, vhat_t))
            if e: e.__exit__(None, None, None)
            
            if self.amsbound:
                denom = (K.sqrt(vhat_t) + self.epsilon)
            else:
                denom = (K.sqrt(v_t) + self.epsilon)                        

            # Compute the bounds
            step_size_p = step_size * K.ones_like(denom)
            step_size_p_bound = step_size_p / denom
            bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                     lower_bound), upper_bound)

            p_t = p - bounded_lr_t

            self.updates.append(K.update(m, m_t))
            self.updates.append(K.update(v, v_t))
            new_p = p_t

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'final_lr': float(self.final_lr),
                  'beta_1': float(K.get_value(self.beta_1)),
                  'beta_2': float(K.get_value(self.beta_2)),
                  'gamma': float(self.gamma),
                  'decay': float(K.get_value(self.decay)),
                  'epsilon': self.epsilon,
                  'weight_decay': self.weight_decay,
                  'amsbound': self.amsbound}
        base_config = super(AdaBound, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

opened by iperov 13

AdaBound.iterations
this param is not saved.

I looked at official pytorch implementation from original paper. https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

it has

# State initialization if len(state) == 0: state['step'] = 0

state is saved with the optimizer.

also it has

# Exponential moving average of gradient values state['exp_avg'] = torch.zeros_like(p.data) # Exponential moving average of squared gradient values state['exp_avg_sq'] = torch.zeros_like(p.data)

these values should also be saved

So your keras implementation is wrong.
opened by iperov 10
Using SGDM with lr=0.1 leads to not learning

Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below.

I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

opened by syorami 10
clip by value

https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

K.minimum(K.maximum(step, lower_bound), upper_bound)

will not work?

opened by iperov 2
Unexpected keyword argument passed to optimizer: amsbound

I installed with pip install keras-adabound imported with: from keras_adabound import AdaBound and declared the optimizer as: opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Then, I'm getting the error: TypeError: Unexpected keyword argument passed to optimizer: amsbound

changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error: TypeError: __init__() missing 1 required positional argument: 'params'

Am I mixing something up here or missing something?

opened by stabilus 0
Unclear how to import and use tf.keras version

I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

opened by mnweaver1 0
about lr

Thanks for a good optimizer According to usage optm = AdaBound(lr=1e-03, final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Does the learning rate gradually increase by the number of steps?

final lr is described as Final learning rate. but it actually is leaning rate relative to base lr and current klearning rate? https://github.com/titu1994/keras-adabound/blob/5ce819b6ca1cd95e32d62e268bd2e0c99c069fe8/adabound.py#L72

opened by tanakataiki 1

Releases(0.1)

0.1(Feb 28, 2019)

Contains the Weights of the ResNet 34 model which obtains 92.16% on CIFAR 10.
Source code(tar.gz)
Source code(zip)
cifar10_ResNet34v1_model.h5(81.53 MB)

Owner

Somshubra Majumdar

Interested in Machine Learning, Deep Learning and Data Science in general

GitHub Repository

Simple-System-Convert--C--F - Simple System Convert With Python

Simple-System-Convert--C--F REQUIREMENTS Python version : 3 HOW TO USE Run the c

2 Feb 16, 2022

This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

Over-and-Undervalued-Stocks Of Nepse Using Graham's Number Scrap the latest data using different websites and creates a formatted excel file that high

6 May 03, 2022

Rendering color and depth images for ShapeNet models.

Color & Depth Renderer for ShapeNet This library includes the tools for rendering multi-view color and depth images of ShapeNet models. Physically bas

41 Dec 19, 2022

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data arXiv This is the code base for weakly supervised NER. We provide a

92 Jan 04, 2023

Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)

gans-collection.torch Torch implementation of various types of GANs (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN). Note that EBGAN and

53 Jan 22, 2022

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization This is an official implementation of "Feature Stylization and Domain-

22 Sep 22, 2022

Code for ICDM2020 full paper: "Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning"

Subg-Con Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning (Jiao et al., ICDM 2020): https://arxiv.org/abs/2009.10273 Over

34 Jul 06, 2022

Continuous Time LiDAR odometry

CT-ICP: Elastic SLAM for LiDAR sensors This repository implements the SLAM CT-ICP (see our article), a lightweight, precise and versatile pure LiDAR o

385 Dec 29, 2022

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization Official PyTorch implementation for our URST (Ultra-Resolution Sty

148 Dec 27, 2022

HybridNets: End-to-End Perception Network

HybridNets: End2End Perception Network HybridNets Network Architecture. HybridNets: End-to-End Perception Network by Dat Vu, Bao Ngo, Hung Phan 📧 FPT

370 Dec 29, 2022

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

25 Oct 30, 2022

Keras implementation of AdaBound

Related tags

Overview

AdaBound for Keras

Usage

Results

NOTE

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

Comments

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

AdaBound.iterations

Using SGDM with lr=0.1 leads to not learning

clip by value

Unexpected keyword argument passed to optimizer: amsbound

Unclear how to import and use tf.keras version

about lr

Releases(0.1)

0.1(Feb 28, 2019)

Owner

Somshubra Majumdar

Simple-System-Convert--C--F - Simple System Convert With Python

This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

Rendering color and depth images for ShapeNet models.

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Code for ICDM2020 full paper: "Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning"

Continuous Time LiDAR odometry

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

HybridNets: End-to-End Perception Network

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Cleaned up code for DSTC 10: SIMMC 2.0 track: subtask 2: multimodal coreference resolution

Improving Object Detection by Label Assignment Distillation

Image restoration with neural networks but without learning.

State-Relabeling Adversarial Active Learning

Human Detection - Pedestrian Detection using OpenCV Python

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

Complete* list of autonomous driving related datasets

E-RAFT: Dense Optical Flow from Event Cameras