PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Last update: Jan 08, 2023

Related tags

Deep Learning moco-v3

Overview

MoCo v3 for Self-supervised ResNet and ViT

Introduction

This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.

ResNet-50, linear classification

pretrain epochs	pretrain crops	linear acc
100	2x224	68.9
300	2x224	72.8
1000	2x224	74.6

ViT, linear classification

model	pretrain epochs	pretrain crops	linear acc
ViT-Small	300	2x224	73.2
ViT-Base	300	2x224	76.7

ViT, end-to-end fine-tuning

model	pretrain epochs	pretrain crops	e2e acc
ViT-Small	300	2x224	81.4
ViT-Base	300	2x224	83.2

The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).

Usage: Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code. For ViT models, install timm (timm==0.4.9).

The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.

Usage: Self-supervised Pre-Training

Below are three examples for MoCo v3 pre-training.

ResNet-50 with 2-node (16-GPU) training, batch 4096

On the first node, run:

python main_moco.py \
  --moco-m-cos --crop-min=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command with --rank 1. With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.

ViT-Small with 1-node (8-GPU) training, batch 1024

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Base with 8-node training, batch 4096

With a batch size of 4096, ViT-Base is trained with 8 nodes:

python main_moco.py \
  -a vit_base \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 8 --rank 0 \
  [your imagenet-folder with train and val folders]

On other nodes, run the same command with --rank 1, ..., --rank 7 respectively.

Notes:

The batch size specified by -b is the total batch size across all GPUs.
The learning rate specified by --lr is the base lr, and is adjusted by the linear lr scaling rule in this line.
Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:

python convert_to_deit.py \
  --input [your checkpoint path]/[your checkpoint file].pth.tar \
  --output [target checkpoint file].pth

Then run the training (in the DeiT repo) with the converted checkpoint:

python $DEIT_DIR/main.py \
  --resume [target checkpoint file].pth \
  --epochs 150

This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.
Our ViT-Small is with heads=12 in the Transformer block, while by default in DeiT it is heads=6. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.

Model Configs

See the commands listed in CONFIG.md.

Transfer Learning

See the instruction in the transfer dir.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Related tags

Overview

MoCo v3 for Self-supervised ResNet and ViT

Introduction

Main Results

ResNet-50, linear classification

ViT, linear classification

ViT, end-to-end fine-tuning

Usage: Preparation

Usage: Self-supervised Pre-Training

ResNet-50 with 2-node (16-GPU) training, batch 4096

ViT-Small with 1-node (8-GPU) training, batch 1024

ViT-Base with 8-node training, batch 4096

Notes:

Usage: Linear Classification

Usage: End-to-End Fine-tuning ViT

Model Configs

Transfer Learning

License

Citation

Owner

Facebook Research

Paper list of log-based anomaly detection

🔥 TensorFlow Code for technical report: "YOLOv3: An Incremental Improvement"

A PyTorch Lightning Callback for pushing models to the Hugging Face Hub 🤗⚡️

Code and hyperparameters for the paper "Generative Adversarial Networks"

Official implementation of Sparse Transformer-based Action Recognition

Light-Head R-CNN

Semantic Image Synthesis with SPADE

The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

Lab Materials for MIT 6.S191: Introduction to Deep Learning

Hierarchical Time Series Forecasting with a familiar API

A collection of resources on GAN Inversion.

This project deploys a yolo fastest model in the form of tflite on raspberry 3b+. The model is from another repository of mine called -Trash-Classification-Car

Official Repository for our ICCV2021 paper: Continual Learning on Noisy Data Streams via Self-Purified Replay

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

The code for 'Deep Residual Fourier Transformation for Single Image Deblurring'

Public scripts, services, and configuration for running a smart home K3S network cluster

Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

As-ViT: Auto-scaling Vision Transformers without Training

To model the probability of a soccer coach leave his/her team during Campeonato Brasileiro for 10 chosen teams and considering years 2018, 2019 and 2020.

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.