[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

Overview

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition

Paper | Model Checkpoint

  • This is the official PyTorch implementation of Collaborative Transformers for Grounded Situation Recognition.
  • CoFormer (Collaborative Glance-Gaze TransFormer) achieves state-of-the-art accuracy in every evaluation metric on the SWiG dataset.
  • This repository contains instructions, code and model checkpoint.

prediction_results


Overview

Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation. Glance transformer predicts the main activity with the help of Gaze transformer that analyzes entities and their relations, while Gaze transformer estimates the grounded entities by focusing only on the entities relevant to the activity predicted by Glance transformer. Our CoFormer achieves the state of the art in all evaluation metrics on the SWiG dataset.

overall_architecture Following conventions in the literature, we call an activity verb and an entity noun. Glance transformer predicts a verb with the help of Gaze-Step1 transformer that analyzes nouns and their relations by leveraging role features, while Gaze-Step2 transformer estimates the grounded nouns for the roles associated with the predicted verb. Prediction results are obtained by feed forward networks (FFNs).

Environment Setup

We provide instructions for environment setup.

# Clone this repository and navigate into the repository
git clone https://github.com/jhcho99/CoFormer.git    
cd CoFormer                                          

# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name CoFormer python=3.9              
conda activate CoFormer                             
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge 

# Install requirements via pip
pip install -r requirements.txt                   

SWiG Dataset

Annotations are given in JSON format, and annotation files are under "SWiG/SWiG_jsons/" directory. Images can be downloaded here. Please download the images and store them in "SWiG/images_512/" directory.

In the SWiG dataset, each image is associated with Verb, Frame and Groundings.

A) Verb: each image is paired with a verb. In the annotation file, "verb" denotes the salient action for an image.

B) Frame: a frame denotes the set of semantic roles for a verb. For example, the frame for verb "Drinking" denotes the set of semantic roles "Agent", "Liquid", "Container" and "Place". In the annotation file, "frames" show the set of semantic roles for a verb, and noun annotations for each role. There are three noun annotations for each role, which are given by three different annotators.

C) Groundings: each grounding is described in [x1, y1, x2, y2] format. In the annotation file, "bb" denotes bounding-box groundings for roles. Note that nouns can be labeled without groundings, e.g., in the case of occluded objects. When there is no grounding for a role, [-1, -1, -1, -1] is given.

# an example of annotation for an image

"drinking_235.jpg": {
    "verb": "drinking",
    "height": 512, 
    "width": 657, 
    "bb": {"agent": [0, 1, 654, 512], 
           "liquid": [128, 273, 293, 382], 
           "container": [111, 189, 324, 408],
           "place": [-1, -1, -1, -1]},
    "frames": [{"agent": "n10787470", "liquid": "n14845743", "container": "n03438257", "place": ""}, 
               {"agent": "n10129825", "liquid": "n14845743", "container": "n03438257", "place": ""}, 
               {"agent": "n10787470", "liquid": "n14845743", "container": "n03438257", "place": ""}]
    }

In imsitu_space.json file, there is additional information for verb and noun.

# an example of additional verb information

"drinking": {
    "framenet": "Ingestion", 
    "abstract": "the AGENT drinks a LIQUID from a CONTAINER at a PLACE", 
    "def": "take (a liquid) into the mouth and swallow", 
    "order": ["agent", "liquid", "container", "place"], 
    "roles": {"agent": {"framenet": "ingestor", "def": "The entity doing the drink action"},
              "liquid": {"framenet": "ingestibles", "def": "The entity that the agent is drinking"}
              "container": {"framenet": "source", "def": "The container in which the liquid is in"}        
              "place": {"framenet": "place", "def": "The location where the drink event is happening"}}
    }
# an example of additional noun information

"n14845743": {
    "gloss": ["water", "H2O"], 
    "def": "binary compound that occurs at room temperature as a clear colorless odorless tasteless liquid; freezes into ice below 0 degrees centigrade and boils above 100 degrees centigrade; widely used as a solvent"
    }

Additional Details

  • All images should be under "SWiG/images_512/" directory.
  • train.json file is for train set.
  • dev.json file is for development set.
  • test.json file is for test set.

Training

To train CoFormer on a single node with 4 GPUs for 40 epochs, run:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
           --num_workers 4 --num_glance_enc_layers 3 --num_gaze_s1_dec_layers 3 \
           --num_gaze_s1_enc_layers 3 --num_gaze_s2_dec_layers 3 --dropout 0.15 --hidden_dim 512 \
           --output_dir CoFormer

To train CoFormer on a Slurm cluster with submitit using 4 RTX 3090 GPUs for 40 epochs, run:

python run_with_submitit.py --ngpus 4 --nodes 1 --job_dir CoFormer \
        --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
        --num_workers 4 --num_glance_enc_layers 3 --num_gaze_s1_dec_layers 3 \
        --num_gaze_s1_enc_layers 3 --num_gaze_s2_dec_layers 3 --dropout 0.15 --hidden_dim 512 \
        --partition rtx3090
  • A single epoch takes about 45 minutes. Training CoFormer for 40 epochs takes around 30 hours on a single machine with 4 RTX 3090 GPUs.
  • We use AdamW optimizer with learning rate 10-4 (10-5 for backbone), weight decay 10-4 and β = (0.9, 0.999).
    • Those learning rates are divided by 10 at epoch 30.
  • Random Color Jittering, Random Gray Scaling, Random Scaling and Random Horizontal Flipping are used for augmentation.

Evaluation

To evaluate CoFormer on the dev set with the saved model, run:

python main.py --saved_model CoFormer_checkpoint.pth --output_dir CoFormer --dev

To evaluate CoFormer on the test set with the saved model, run:

python main.py --saved_model CoFormer_checkpoint.pth --output_dir CoFormer --test
  • Model checkpoint can be downloaded here.

Inference

To run an inference on a custom image, run:

python inference.py --image_path inference/filename.jpg \
                    --saved_model CoFormer_checkpoint.pth \
                    --output_dir inference

Results

We provide several experimental results.

quantitative qualitative_1 qualitative_2

Our Previous Work

We proposed GSRTR for this task using a simple transformer encoder-decoder architecture:

Acknowledgements

Our code is modified and adapted from these amazing repositories:

Contact

Junhyeong Cho ([email protected])

Citation

If you find our work useful for your research, please cite our paper:

@InProceedings{cho2022CoFormer,
    title={Collaborative Transformers for Grounded Situation Recognition},
    author={Junhyeong Cho and Youngseok Yoon and Suha Kwak},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2022}
}

License

CoFormer is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Owner
Junhyeong Cho
Studied @ POSTECH, Stanford, UIUC, UC Berkeley
Junhyeong Cho
Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

1 Feb 26, 2022
Perform zero-order Hankel Transform for an 1D array (float or real valued).

perform zero-order Hankel Transform for an 1D array (float or real valued). An discrete form of Parseval theorem is guaranteed. Suit for iterative problems.

1 Jan 17, 2022
[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

LBYL-Net This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021. Getting Started Prerequ

SVIP Lab 45 Dec 12, 2022
O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis

O-CNN This repository contains the implementation of our papers related with O-CNN. The code is released under the MIT license. O-CNN: Octree-based Co

Microsoft 607 Dec 28, 2022
Simple machine learning library / 簡單易用的機器學習套件

FukuML Simple machine learning library / 簡單易用的機器學習套件 Installation $ pip install FukuML Tutorial Lesson 1: Perceptron Binary Classification Learning Al

Fukuball Lin 279 Sep 15, 2022
Async API for controlling Hue Lights

Hue API Async API for controlling Hue Lights Documentation: hue-api.nirantak.com Source: github.com/nirantak/hue-api Installation This is an async cli

Nirantak Raghav 4 Nov 16, 2022
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 08, 2023
Reusable constraint types to use with typing.Annotated

annotated-types PEP-593 added typing.Annotated as a way of adding context-specific metadata to existing types, and specifies that Annotated[T, x] shou

125 Dec 26, 2022
This is a re-implementation of TransGAN: Two Pure Transformers Can Make One Strong GAN (CVPR 2021) in PyTorch.

TransGAN: Two Transformers Can Make One Strong GAN [YouTube Video] Paper Authors: Yifan Jiang, Shiyu Chang, Zhangyang Wang CVPR 2021 This is re-implem

Ahmet Sarigun 79 Jan 05, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intenti

NVIDIA Corporation 6.9k Jan 03, 2023
RL and distillation in CARLA using a factorized world model

World on Rails Learning to drive from a world on rails Dian Chen, Vladlen Koltun, Philipp Krähenbühl, arXiv techical report (arXiv 2105.00636) This re

Dian Chen 131 Dec 16, 2022
Offical code for the paper: "Growing 3D Artefacts and Functional Machines with Neural Cellular Automata" https://arxiv.org/abs/2103.08737

Growing 3D Artefacts and Functional Machines with Neural Cellular Automata Video of more results: https://www.youtube.com/watch?v=-EzztzKoPeo Requirem

Robotics Evolution and Art Lab 51 Jan 01, 2023
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
A solution to ensure Crowd Management with Contactless and Safe systems.

CovidTrack A Solution to ensure Crowd Management with Contactless and Safe systems. ML Model Mask Detection Social Distancing Detection Analytics Page

Om Khare 1 Nov 10, 2021
Implementation of Online Label Smoothing in PyTorch

Online Label Smoothing Pytorch implementation of Online Label Smoothing (OLS) presented in Delving Deep into Label Smoothing. Introduction As the abst

83 Dec 14, 2022
A Collection of Papers and Codes for ICCV2021 Low Level Vision and Image Generation

A Collection of Papers and Codes for ICCV2021 Low Level Vision and Image Generation

196 Jan 05, 2023
The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

FCPS Fundamental Clustering Problems Suite The package provides over sixty state-of-the-art clustering algorithms for unsupervised machine learning pu

9 Nov 27, 2022
Internship Assessment Task for BaggageAI.

BaggageAI Internship Task Problem Statement: You are given two sets of images:- background and threat objects. Background images are the background x-

Arya Shah 10 Nov 14, 2022
EmoTag helps you train emotion detection model for Chinese audios

emoTag emoTag helps you train emotion detection model for Chinese audios. Environment pip install -r requirement.txt Data We used Emotional Speech Dat

_zza 4 Sep 07, 2022
Image Segmentation Animation using Quadtree concepts.

QuadTree Image Segmentation Animation using QuadTree concepts. Usage usage: quad.py [-h] [-fps FPS] [-i ITERATIONS] [-ws WRITESTART] [-b] [-img] [-s S

Alex Eidt 29 Dec 25, 2022