Benchmark for Tuning Accuracy and Efficiency

Overview

The benchmark includes our efforts in using Colossal-AI to train different tasks to achieve SOTA results. We are interested in both validataion accuracy and training speed, and prefer larger batch size to take advantage of more GPU devices. For example, we trained vision transformer with batch size 512 on CIFAR10 and 4096 on ImageNet1k, which are basically not used in existing works. Some of the results in the benchmark trained with 8x A100 are shown below.

Task	Model	Training Time	Top-1 Accuracy
CIFAR10	ViT-Lite-7/4	~ 16 min	~ 90.5%
ImageNet1k	ViT-S/16	~ 16.5 h	~ 74.5%

The train.py script in each task runs training with the specific configuration script in configs/ for different parallelisms. Supported parallelisms include data parallel only (ends with vanilla), 1D (ends with 1d), 2D (ends with 2d), 2.5D (ends with 2p5d), 3D (ends with 3d).

Each configuration scripts basically includes the following elements, taking ImageNet1k task as example:

TOTAL_BATCH_SIZE = 4096
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3

NUM_EPOCHS = 300
WARMUP_EPOCHS = 32

# data parallel only
TENSOR_PARALLEL_SIZE = 1    
TENSOR_PARALLEL_MODE = None

# parallelism setting
parallel = dict(
    pipeline=1,
    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
)

fp16 = dict(mode=AMP_TYPE.TORCH, ) # amp setting

gradient_accumulation = 2 # accumulate 2 steps for gradient update

BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation # actual batch size for dataloader

clip_grad_norm = 1.0 # clip gradient with norm 1.0

Upper case elements are basically what train.py needs, and lower case elements are what Colossal-AI needs to initialize the training.

Usage

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE \
                                        --rank=RANK \
                                        --local_rank=LOCAL_RANK \
                                        --host=MASTER_IP_ADDRESS \
                                        --port=MASTER_PORT \
                                        --config=CONFIG_FILE

It is also recommended to start training with torchrun as:

$ DATA=/path/to/dataset torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
                                 --nnodes=NUM_NODES \
                                 --node_rank=NODE_RANK \
                                 --master_addr=MASTER_IP_ADDRESS \
                                 --master_port=MASTER_PORT \
                                 train.py --config=CONFIG_FILE

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Related tags

Overview

Benchmark for Tuning Accuracy and Efficiency

Overview

Usage

Owner

HPC-AI Tech

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA

Running Google MoveNet Multipose Tracking models on OpenVINO.

Converting CPT to bert form for use

QuALITY: Question Answering with Long Input Texts, Yes!

Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

Buffon’s needle: one of the oldest problems in geometric probability

A PyTorch implementation of "ANEMONE: Graph Anomaly Detection with Multi-Scale Contrastive Learning", CIKM-21

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

A real world application of a Recurrent Neural Network on a binary classification of time series data

Repository for scripts and notebooks from the book: Programming PyTorch for Deep Learning

[NeurIPS 2021] PyTorch Code for Accelerating Robotic Reinforcement Learning with Parameterized Action Primitives

A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

시각 장애인을 위한 스마트 지팡이에 활용될 딥러닝 모델 (DL Model Repo)

CTF challenges from redpwnCTF 2021

A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

SEC'21: Sparse Bitmap Compression for Memory-Efficient Training onthe Edge