[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

Last update: Oct 16, 2022

Overview

DSM

The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Project Website;

Datasets list and some visualizations/provided weights are preparing now.

1. Introduction (scene-dominated to motion-dominated)

Video datasets are usually scene-dominated, We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.

The generated triplet is as below:

What DSM learned?

With DSM pretrain, the model learn to focus on motion region (Not necessarily actor) powerful without one label available.

2. Installation

Dataset

Please refer dataset.md for details.

Requirements

Python3
pytorch1.1+
PIL
Intel (on the fly decode)

3. Structure

datasets
- list
  - hmdb51: the train/val lists of HMDB51
  - ucf101: the train/val lists of UCF101
  - kinetics-400: the train/val lists of kinetics-400
  - diving48: the train/val lists of diving48
experiments
- logs: experiments record in detials
- gradientes: grad check
- visualization:
src
- data: load data
- loss: the loss evaluate in this paper
- model: network architectures
- scripts: train/eval scripts
- augment: detail implementation of Spatio-temporal Augmentation
- utils
- feature_extract.py: feature extractor given pretrained model
- main.py: the main function of finetune
- trainer.py
- option.py
- pt.py: self-supervised pretrain
- ft.py: supervised finetune

DSM(Triplet)/DSM/Random

Self-supervised Pretrain

Kinetics

bash scripts/kinetics/pt.sh

UCF101

bash scripts/ucf101/pt.sh

Supervised Finetune (Clip-level)

HMDB51

bash scripts/hmdb51/ft.sh

UCF101

bash scripts/ucf101/ft.sh

Kinetics

bash scripts/kinetics/ft.sh

Video-level Evaluation

Following common practice TSN and Non-local. The final video-level result is average by 10 temporal window sampling + corner crop, which lead to better result than clip-level. Refer test.py for details.

Pretrain And Eval In one step

bash scripts/hmdb51/pt_and_ft_hmdb51.sh

Notice: More Training Options and ablation study Can be find in scripts

Video Retrieve and other visualization

(1). Feature Extractor

As STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.

python feature_extractor.py

The feature will be saved as a single numpy file in the format [video_nums,features_dim] for further visualization.

(2). Reterival Evaluation

modify line60-line62 in reterival.py.

python reterival.py

Results

Action Recognition

UCF101 Pretrained (I3D)

Method	UCF101	HMDB51
Random Initialization	47.9	29.6
MoCo Baseline	62.3	36.5
DSM(Triplet)	70.7	48.5
DSM	74.8	52.5

Kinetics Pretrained

Video Retrieve (UCF101-C3D)

Method	@1	@5	@10	@20	@50
DSM	16.8	33.4	43.4	54.6	70.7

Video Retrieve (HMDB51-C3D)

Method	@1	@5	@10	@20	@50
DSM	8.2	25.9	38.1	52.0	75.0

More Visualization

Acknowledgement

This work is partly based on STN, UEL and MoCo.

License

Citation

If you use our code in your research or wish to refer to the baseline results, pleasuse use the followint BibTex entry.

@inproceedings{wang2020enhancing,
  author    = {Lin, Ji and Zhang, Richard and Ganz, Frieder and Han, Song and Zhu, Jun-Yan},
  title     = {Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion},
  booktitle = {AAAI},
  year      = {2021},
}

[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

Related tags

Overview

DSM

1. Introduction (scene-dominated to motion-dominated)

What DSM learned?

2. Installation

Dataset

Requirements

3. Structure

DSM(Triplet)/DSM/Random

Self-supervised Pretrain

Kinetics

UCF101

Supervised Finetune (Clip-level)

HMDB51

UCF101

Kinetics

Video-level Evaluation

Pretrain And Eval In one step

Video Retrieve and other visualization

(1). Feature Extractor

(2). Reterival Evaluation

Results

Action Recognition

UCF101 Pretrained (I3D)

Kinetics Pretrained

Video Retrieve (UCF101-C3D)

Video Retrieve (HMDB51-C3D)

More Visualization

Acknowledgement

License

Citation

Owner

Jinpeng Wang

Emotion Recognition from Facial Images

Accelerated Multi-Modal MR Imaging with Transformers

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

ICON: Implicit Clothed humans Obtained from Normals (CVPR 2022)

Deep Learning for Human Part Discovery in Images - Chainer implementation

A quick recipe to learn all about Transformers

Make your first PR. A beginner friendly repository made specifically for open source beginners. Add any program under any language (it can be anything from a simple program to a complex data structure algorithm). Happy coding...

Unofficial PyTorch implementation of "RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving" (ECCV 2020)

Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Stream images from a connected camera over MQTT, view using Streamlit, record to file and sqlite

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Video-Music Transformer

以孤立语假设和宽度优先搜索为基础，构建了一种多通道堆叠注意力Transformer结构的斗地主ai

Calibrated Hyperspectral Image Reconstruction via Graph-based Self-Tuning Network.

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Source Code For Template-Based Named Entity Recognition Using BART

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

商品推荐系统

Linear image-to-image translation