In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Overview

Contrastive Learning of Object Representations

Supervisor:

Institutions:

Project Description

Contrastive Learning is an unsupervised method for learning similarities or differences in a dataset, without the need of labels. The main idea is to provide the machine with similar (so called positive samples) and with very different data (negative or corrupted samples). The task of the machine then is to leverage this information and to pull the positive examples in the embedded space together, while pushing the negative examples further apart. Next to being unsupervised, another major advantage is that the loss is applied on the latent space rather than being pixel-base. This saves computation and memory, because there is no need for a decoder and also delivers more accurate results.

eval_3_obj

In this work, we will investigate the SetCon model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. [1] (Paper) The SetCon model has been published in November 2020 by the Google Brain Team and introduces an attention-based object extraction in combination with contrastive learning. It incorporates a novel slot-attention module [2](Paper), which is an iterative attention mechanism to map the feature maps from the CNN-Encoder to a predefined number of object slots and has been inspired by the transformer models from the NLP world.

We investigate the utility of this architecture when used together with realistic video footage. Therefore, we implemented the SetCon with PyTorch according to its description and build upon it to meet our requirements. We then created two different datasets, in which we film given objects from different angles and distances, similar to Pirk [3] (Github, Paper). However, they relied on a faster-RCNN for the object detection, whereas the goal of the SetCon is to extract the objects solely by leveraging the contrastive loss and the slot attention module. By training a decoder on top of the learned representations, we found that in many cases the model can successfully extract objects from a scene.

This repository contains our PyTorch-implementation of the SetCon-Model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. Implementation is based on the description in the article. Note, this is not the official implementation. If you have questions, feel free to reach out to me.

Results

For our work, we have taken two videos, a Three-Object video and a Seven-Object video. In these videos we interacted with the given objects and moved them to different places and constantly changed the view perspective. Both are 30mins long, such that each contains about 54.000 frames.

eval_3_obj
Figure 1: An example of the object extraction on the test set of the Three-Object dataset.

We trained the contrastive pretext model (SetCon) on the first 80% and then evaluated the learned representations on the remaining 20%. Therefore, we trained a decoder, similar to the evaluation within the SetCon paper and looked into the specialisation of each slot. Figures 1 and 2 display two evaluation examples, from the test-set of the Three-Object Dataset and the Seven-Object Dataset. Bot figures start with the ground truth for three timestamps. During evaluation only the ground truth at t will be used to obtain the reconstructed object slots as well as their alpha masks. The Seven-Object video is itended to be more complex and one can perceive in figure 2 that the model struggles more than on the Three-Obejct dataset to route the objects to slots. On the Three-Object dataset, we achieved 0.0043 ± 0.0029 MSE and on the Seven-Object dataset 0.0154 ± 0.0043 MSE.

eval_7_obj
Figure 2: An example of the object extraction on the test set of the Seven-Object dataset.

How to use

For our work, we have taken two videos, a Three-Object video and Seven-Object video. Both datasets are saved as frames and are then encoded in a h5-files. To use a different dataset, we further provide a python routine process frames.py, which converts frames to h5 files.

For the contrastive pretext-task, the training can be started by:

python3 train_pretext.py --end 300000 --num-slots 7
        --name pretext_model_1 --batch-size 512
        --hidden-dim=1024 --learning-rate 1e-5
        --feature-dim 512 --data-path ’path/to/h5file’

Further arguments, like the size of the encoder or for an augmentation pipeline, use the flag -h for help. Afterwards, we froze the weights from the encoder and the slot-attention-module and trained a downstream decoder on top of it. The following command will train the decoder upon the checkpoint file from the pretext task:

python3 train_decoder.py --end 250000 --num-slots 7
        --name downstream_model_1 --batch-size 64
        --hidden-dim=1024 --feature-dim 512
        --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --learning-rate 1e-5

For MSE evaluation on the test-set, use both checkpoints, from the pretext- model for the encoder- and slot-attention-weights and from the downstream- model for the decoder-weights and run:

python3 eval.py --num-slots 7 --name evaluation_1
        --batch-size 64 --hidden-dim=1024
        --feature-dim 512 --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --decoder-path "path/to/decoder.pth.tar"

Implementation Adjustments

Instead of many small sequences of artificially created frames, we need to deal with a long video-sequence. Therefore, each element in our batch mirrors a single frame at a given time t, not a sequence. For this single frame at time t, we load its two predecessors, which are then used to predict the frame at t, and thereby create a positive example. Further, we found, that the infoNCE-loss to be numerically unstable in our case, hence we opted for the almost identical but more stable NT-Xent in our implementation.

References

[1] Löwe, Sindy et al. (2020). Learning object-centric video models by contrasting sets. Google Brain team.

[2] Locatello, Francesco et al. Object-centric learning with slot attention.

[3] Pirk, Sören et al. (2019). Online object representations with contrastive learning. Google Brain team.

Owner
Dirk Neuhäuser
Dirk Neuhäuser
Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022
Efficient Lottery Ticket Finding: Less Data is More

The lottery ticket hypothesis (LTH) reveals the existence of winning tickets (sparse but critical subnetworks) for dense networks, that can be trained in isolation from random initialization to match

VITA 20 Sep 04, 2022
A machine learning malware analysis framework for Android apps.

🕵️ A machine learning malware analysis framework for Android apps. ☢️ DroidDetective is a Python tool for analysing Android applications (APKs) for p

James Stevenson 77 Dec 27, 2022
Fight Recognition from Still Images in the Wild @ WACVW2022, Real-world Surveillance Workshop

Fight Detection from Still Images in the Wild Detecting fights from still images is an important task required to limit the distribution of social med

Şeymanur Aktı 10 Nov 09, 2022
Semantic segmentation models, datasets and losses implemented in PyTorch.

Semantic Segmentation in PyTorch Semantic Segmentation in PyTorch Requirements Main Features Models Datasets Losses Learning rate schedulers Data augm

Yassine 1.3k Jan 07, 2023
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking Datasets You can download datasets that have been pre-pr

25 May 29, 2022
End-to-End Object Detection with Fully Convolutional Network

This project provides an implementation for "End-to-End Object Detection with Fully Convolutional Network" on PyTorch.

472 Dec 22, 2022
A web application that provides real time temperature and humidity readings of a house.

About A web application which provides real time temperature and humidity readings of a house. If you're interested in the data collected so far click

Ben Thompson 3 Jan 28, 2022
A pytorch reproduction of { Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation }.

A PyTorch Reproduction of HCN Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. Ch

Guyue Hu 210 Dec 31, 2022
Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

pytorch-inpainting-with-partial-conv Official implementation is released by the authors. Note that this is an ongoing re-implementation and I cannot f

Naoto Inoue 525 Jan 01, 2023
Turn based roguelike in python

pyTB Turn based roguelike in python Documentation can be found here: http://mcgillij.github.io/pyTB/index.html Screenshot Dependencies Written in Pyth

Jason McGillivray 4 Sep 29, 2022
Code for the paper "Next Generation Reservoir Computing"

Next Generation Reservoir Computing This is the code for the results and figures in our paper "Next Generation Reservoir Computing". They are written

OSU QuantInfo Lab 105 Dec 20, 2022
YoloAll is a collection of yolo all versions. you you use YoloAll to test yolov3/yolov5/yolox/yolo_fastest

官方讨论群 QQ群:552703875 微信群:15158106211(先加作者微信,再邀请入群) YoloAll项目简介 YoloAll是一个将当前主流Yolo版本集成到同一个UI界面下的推理预测工具。可以迅速切换不同的yolo版本,并且可以针对图片,视频,摄像头码流进行实时推理,可以很方便,直观

DL-Practise 244 Jan 01, 2023
WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking [Paper Link] Abstract In this work, we contribute a new million-scale Un

25 Jan 01, 2023
Latex code for making neural networks diagrams

PlotNeuralNet Latex code for drawing neural networks for reports and presentation. Have a look into examples to see how they are made. Additionally, l

Haris Iqbal 18.6k Jan 01, 2023
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Selfplay In MultiPlayer Environments

This project allows you to train AI agents on custom-built multiplayer environments, through self-play reinforcement learning.

200 Jan 08, 2023
Updated for TTS(CE) = Also Known as TTN V3. The code requires the first server to be 'ttn' protocol.

Updated Updated for TTS(CE) = Also Known as TTN V3. The code requires the first server to be 'ttn' protocol. Introduction This balenaCloud (previously

Remko 1 Oct 17, 2021
PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

Christoph Reich 100 Dec 01, 2022