TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Last update: Dec 20, 2022

Overview

FunMatch-Distillation

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

The techniques have been demonstrated using three datasets:

This repository provides Kaggle Kernel notebooks so that we can leverage the free TPu v3-8 to run the long training schedules. Please refer to this section.

Importance

The importance of knowledge distillation lies in its practical usefulness. With the recipes from "function matching", we can now perform knowledge distillation using a principled approach yielding student models that can actually match the performance of their teacher models. This essentially allows us to compress bigger models into (much) smaller ones thereby reducing storage costs and improving inference speed.

Key ingredients

No use of ground-truth labels during distillation.
Teacher and student should see same images during distillation as opposed to differently augmented views of same images.
Aggressive form of MixUp as the key augmentation recipe. MixUp is paired with "Inception-style" cropping (implemented in this script).
A LONG training schedule for distillation. At least 1000 epochs to get good results without overfitting. The importance of a long training schedule is paramount as studied in the paper.

Results

The table below summarizes the results of my experiments. In all cases, teacher is a BiT-ResNet101x3 model and student is a BiT-ResNet50x1. For fun, you can also try to distill into other model families. BiT stands for "Big Transfer" and it was proposed in this paper.

Dataset	Teacher/Student	Top-1 Acc on Test	Location
Flowers102	Teacher	98.18%	Link
Flowers102	Student (1000 epochs)	81.02%	Link
Pet37	Teacher	90.92%	Link
Pet37	Student (300 epochs)	81.3%	Link
Pet37	Student (1000 epochs)	86%	Link
Food101	Teacher	85.52%	Link
Food101	Student (100 epochs)	76.06%	Link

^{(Location denotes the trained model location.)}

These results are consistent with Table 4 of the original paper.

It should be noted that none of the above student training regimes showed signs of overfitting. Further improvements can be done by training for longer. The authors also showed that Shampoo can get to similar performance much quicker than Adam during distillation. So, it may very well be possible to get this performance with fewer epochs with Shampoo.

A few differences from the original implementation:

The authors use BiT-ResNet152x2 as a teacher.
The mixup() variant I used will produce a pair of duplicate images if the number of images is even. Now, for 8 workers it will become 8 pairs. This may have led to the reduced performance. We can overcome this by using tf.roll(images, 1, axis=0) instead of tf.reverse in the mixup() function. Thanks to Lucas Beyer for pointing this out.

About the notebooks

All the notebooks are fully runnable on Kaggle Kernel. The only requirement is that you'd need a billing enabled GCP account to use GCS Buckets to store data.

Notebook	Description	Kaggle Kernel
`train_bit.ipynb`	Shows how to train the teacher model.	Link
`train_bit_keras_tuner.ipynb`	Shows how to run hyperparameter tuning using Keras Tuner for the teacher model.	Link
`funmatch_distillation.ipynb`	Shows an implementation of the recipes from "function matching".	Link

These are only demonstrated on the Pet37 dataset but will work out-of-the-box for the other datasets too.

TFRecords

For convenience, TFRecords of different datasets are provided:

Dataset	TFRecords
Flowers102	Link
Pet37	Link
Food101	Link

Paper citation

@misc{beyer2021knowledge,
      title={Knowledge distillation: A good teacher is patient and consistent}, 
      author={Lucas Beyer and Xiaohua Zhai and Amélie Royer and Larisa Markeeva and Rohan Anil and Alexander Kolesnikov},
      year={2021},
      eprint={2106.05237},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

Huge thanks to Lucas Beyer (first author of the paper) for providing suggestions on the initial version of the implementation.

Thanks to the ML-GDE program for providing GCP credits.

Thanks to TRC for providing Cloud TPU access.

Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

121 Sep 26, 2022

Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

Data Efficient Stagewise Knowledge Distillation Table of Contents Data Efficient Stagewise Knowledge Distillation Table of Contents Requirements Image

112 Dec 2, 2022

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

9 Nov 14, 2022

PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

FKD: A Fast Knowledge Distillation Framework for Visual Recognition Official PyTorch implementation of paper A Fast Knowledge Distillation Framework f

129 Dec 24, 2022

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

19 Jun 7, 2022

Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms

PCOS Prediction 🥼 Predicts the likelihood of Polycystic Ovary Syndrome based on

1 Jan 10, 2022

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

71 Dec 28, 2022

Unet network with mean teacher for altrasound image segmentation

5 Nov 21, 2022

Details about the wide minima density hypothesis and metrics to compute width of a minima

wide-minima-density-hypothesis Details about the wide minima density hypothesis and metrics to compute width of a minima This repo presents the wide m

9 Dec 27, 2022

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Related tags

Overview

FunMatch-Distillation

Importance

Key ingredients

Results

About the notebooks

TFRecords

Paper citation

Acknowledgements

You might also like...

Implementation of momentum^2 teacher

Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Unet network with mean teacher for altrasound image segmentation

Details about the wide minima density hypothesis and metrics to compute width of a minima

Releases(v4.0.0)

v4.0.0(Jul 28, 2021)

v3.0.0(Jul 28, 2021)

v2.0.0(Jul 27, 2021)

v1.0.0(Jul 25, 2021)

Owner

Sayak Paul

A clean and extensible PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Scientific Computation Methods in C and Python (Open for Hacktoberfest 2021)

Bayesian Generative Adversarial Networks in Tensorflow

Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

A distributed deep learning framework that supports flexible parallelization strategies.

PyGAD, a Python 3 library for building the genetic algorithm and training machine learning algorithms (Keras & PyTorch).

GANTheftAuto is a fork of the Nvidia's GameGAN

Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

CowHerd is a partially-observed reinforcement learning environment

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection, AAAI 2021.

An AutoML Library made with Optuna and PyTorch Lightning

BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

performing moving objects segmentation using image processing techniques with opencv and numpy

We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目

Message Passing on Cell Complexes

Gym for multi-agent reinforcement learning