SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Last update: Nov 07, 2022

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

This repo contains our codes for the paper "No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models" (ICLR 2022).

Getting Start

Pull and run docker
pytorch/pytorch:1.5.1-cuda10.1-cudnn7-devel
Install requirements
pip install -r requirements.txt

Data and Model

Download data and pre-trained models
./download.sh
Please refer to this link for details on the GLUE benchmark.
Preprocess data
./experiments/glue/prepro.sh
For the most updated data processing details, please refer to the mt-dnn repo.

Fine-tuning Pre-trained Models using SAGE

We provide an example script for fine-tuning a pre-trained BERT-base model on MNLI using Adamax-SAGE:

./scripts/train_mnli_usadamax.sh GPUID

A few notices:

learning_rate and beta3 are two of the most important hyper-parameters. learning_rate that works well for Adamax/AdamW-SAGE is usually 2 to 5 times larger than that works well for Adamax/AdamW, depending on the tasks. beta3 that works well for Adamax/AdamW-SAGE is usually in the range of 0.6 and 0.9, depending on the tasks.
To use AdamW-SAGE, set argument --optim=usadamw. The current codebase only contains the implementation of Adamax-SAGE and AdamW-SAGE. Please refer to module/bert_optim.py for details. Please refer to our paper for integrating SAGE on other optimizers.
To fine-tune a pre-trained RoBERTa-base model, set arguments --init_checkpoint to the model path and set --encoder_type to 2. Other supported models are listed in pretrained_models.py.
To fine-tune on other tasks, set arguments --train_datasets and --test_datasets to the corresponding task names.

Citation

@inproceedings{
liang2022no,
title={No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models},
author={Chen Liang and Haoming Jiang and Simiao Zuo and Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen and Tuo Zhao},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=cuvga_CiVND}
}

Contact Information

For help or issues related to this package, please submit a GitHub issue. For personal questions related to this paper, please contact Chen Liang ([email protected]).

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Related tags

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Getting Start

Data and Model

Fine-tuning Pre-trained Models using SAGE

Citation

Contact Information

Owner

Chen Liang

EGNN - Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

Code for our work "Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection".

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

[CIKM 2019] Code and dataset for "Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction"

Source code for paper "ATP: AMRize Than Parse! Enhancing AMR Parsing with PseudoAMRs" @NAACL-2022

Residual Pathway Priors for Soft Equivariance Constraints

Trainable Bilateral Filter Layer (PyTorch)

A SAT-based sudoku solver

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

The Submission for SIMMC 2.0 Challenge 2021

CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding

学习 python3 以来写的一些垃圾玩具……

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

DFM: A Performance Baseline for Deep Feature Matching

[AAAI 2022] Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification

The aim of the game, as in the original one, is to find a specific image from a group of different images of a person's face

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

基于YoloX目标检测+DeepSort算法实现多目标追踪Baseline

Multi-objective gym environments for reinforcement learning.

Adapter-BERT: Parameter-Efficient Transfer Learning for NLP.