The pure and clear PyTorch Distributed Training Framework.

Last update: Dec 20, 2022

Overview

The pure and clear PyTorch Distributed Training Framework.

Introduction
Requirements and Usage
Baselines
Zombie processes problem
Acknowledgments
Citation

Introduction

Distribuuuu is a Distributed Classification Training Framework powered by native PyTorch.

Please check tutorial for detailed Distributed Training tutorials:

Single Node Single GPU Card Training [snsc.py]
Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
- Slurm Workload Manager [mnmc_ddp_slurm.py]
ImageNet training example [imagenet.py]

For the complete training framework, please see distribuuuu.

Requirements and Usage

Dependency

Install PyTorch>= 1.6 (has been tested on 1.6, 1.7.1, 1.8 and 1.8.1)
Install other dependencies: pip install -r requirements.txt

Dataset

Download the ImageNet dataset and move validation images to labeled subfolders, using the script valprep.sh.

Expected datasets structure for ILSVRC

ILSVRC
|_ train
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ val
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ ...

Create a directory containing symlinks:

mkdir -p /path/to/distribuuuu/data

Symlink ILSVRC:

ln -s /path/to/ILSVRC /path/to/distribuuuu/data/ILSVRC

Basic Usage

Single Node with one task

# 1 node, 8 GPUs
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Distribuuuu use yacs, a elegant and lightweight package to define and manage system configurations. You can setup config via a yaml file, and overwrite by other opts. If the yaml is not provided, the default configuration file will be used, please check distribuuuu/config.py.

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml \
    OUT_DIR /tmp \
    MODEL.SYNCBN True \
    TRAIN.BATCH_SIZE 256

# --cfg config/resnet18.yaml parse config from file
# OUT_DIR /tmp            overwrite OUT_DIR
# MODEL.SYNCBN True       overwrite MODEL.SYNCBN
# TRAIN.BATCH_SIZE 256    overwrite TRAIN.BATCH_SIZE

Single Node with two tasks

# 1 node, 2 task, 4 GPUs per task (8GPUs)
# task 1:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# task 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Multiple Nodes Training

# 2 node, 8 GPUs per node (16GPUs)
# node 1:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# node 2:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Slurm Cluster Usage

# see srun --help 
# and https://slurm.schedmd.com/ for details

# example: 64 GPUs
# batch size = 64 * 128 = 8192
# itertaion = 128k / 8192 = 156 
# lr = 64 * 0.1 = 6.4

srun --partition=openai-a100 \
     -n 64 \
     --gres=gpu:8 \
     --ntasks-per-node=8 \
     --job-name=Distribuuuu \
     python -u train_net.py --cfg config/resnet18.yaml \
     TRAIN.BATCH_SIZE 128 \
     OUT_DIR ./resnet18_8192bs \
     OPTIM.BASE_LR 6.4

Baselines

Baseline models trained by Distribuuuu:

We use SGD with momentum of 0.9, a half-period cosine schedule, and train for 100 epochs.
We use a reference learning rate of 0.1 and a weight decay of 5e-5 (1e-5 For EfficientNet).
The actual learning rate(Base LR) for each model is computed as (batch-size / 128) * reference-lr.
Only standard data augmentation techniques(RandomResizedCrop and RandomHorizontalFlip) are used.

PS: use other robust tricks(more epochs, efficient data augmentation, etc.) to get better performance.

Arch	Params(M)	Total batch	Base LR	[email protected]	[email protected]	model / config
resnet18	11.690	256 (32*8GPUs)	0.2	70.902	89.894	Drive / cfg
resnet18	11.690	1024 (128*8GPUs)	0.8	70.994	89.892
resnet18	11.690	8192 (128*64GPUs)	6.4	70.165	89.374
resnet18	11.690	16384 (256*64GPUs)	12.8	68.766	88.381
efficientnet_b0	5.289	512 (64*8GPUs)	0.4	74.540	91.744	Drive / cfg
resnet50	25.557	256 (32*8GPUs)	0.2	77.252	93.430	Drive / cfg
botnet50	20.859	256 (32*8GPUs)	0.2	77.604	93.682	Drive / cfg
regnetx_160	54.279	512 (64*8GPUs)	0.4	79.992	95.118	Drive / cfg
regnety_160	83.590	512 (64*8GPUs)	0.4	80.598	95.090	Drive / cfg
regnety_320	145.047	512 (64*8GPUs)	0.4	80.824	95.276	Drive / cfg

Zombie processes problem

Before PyTorch1.8, torch.distributed.launch will leave some zombie processes after using Ctrl + C, try to use the following cmd to kill the zombie processes. (fairseq/issues/487):

kill $(ps aux | grep YOUR_SCRIPT.py | grep -v grep | awk '{print $2}')

PyTorch >= 1.8 is suggested, which fixed the issue about zombie process. (pytorch/pull/49305)

Acknowledgments

Provided codes were adapted from:

I strongly recommend you to choose pycls, a brilliant image classification codebase and adopted by a number of projects at Facebook AI Research.

Citation

@misc{bigballon2021distribuuuu,
  author = {Wei Li},
  title = {Distribuuuu: The pure and clear PyTorch Distributed Training Framework},
  howpublished = {\url{https://github.com/BIGBALLON/distribuuuu}},
  year = {2021}
}

Feel free to contact me if you have any suggestions or questions, issues are welcome, create a PR if you find any bugs or you want to contribute. 🍰

The pure and clear PyTorch Distributed Training Framework.

Related tags

Overview

Introduction

Requirements and Usage

Dependency

Dataset

Basic Usage

Slurm Cluster Usage

Baselines

Zombie processes problem

Acknowledgments

Citation

Owner

WILL LEE

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Controlling the MicriSpotAI robot from scratch

Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch

PyTorch implementation for OCT-GAN Neural ODE-based Conditional Tabular GANs (WWW 2021)

Satellite labelling tool for manual labelling of storm top features such as overshooting tops, above-anvil plumes, cold U/Vs, rings etc.

PyTorch implementation of neural style transfer algorithm

This repo is a C++ version of yolov5_deepsort_tensorrt. Packing all C++ programs into .so files, using Python script to call C++ programs further.

A set of tools for converting a darknet dataset to COCO format working with YOLOX

TensorFlow implementation of ENet

The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)

ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプル

SMORE: Knowledge Graph Completion and Multi-hop Reasoning in Massive Knowledge Graphs

Model Quantization Benchmark

Official repository for "Orthogonal Projection Loss" (ICCV'21)

Twin-deep neural network for semi-supervised learning of materials properties

simple_pytorch_example project is a toy example of a python script that instantiates and trains a PyTorch neural network on the FashionMNIST dataset

A Peer-to-peer Platform for Secure, Privacy-preserving, Decentralized Data Science

A lightweight deep network for fast and accurate optical flow estimation.

Code for reproducing our paper: LMSOC: An Approach for Socially Sensitive Pretraining

Instance-wise Feature Importance in Time (FIT)