Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Related tags

Computer VisionMCQ
Overview

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral)

Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained Model image

News

2022-04-17 We release the pre-trained model initialized from CLIP (ViT-B/32) and its usage (text-to-video retrieval and video feature extraction).

2022-04-08 We release the pre-training and downstream evaluation code, and the pre-trained model.

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

image

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

image

Visualization

Answer Noun Questions

We visualize cross-modality attention between the text tokens of noun questions and video tokens from BridgeFormer. In the second and fifth column, the noun phrase marked in blue (Q1) is erased as the question, and in the third and sixth column, the noun phrase marked in green (Q2) is erased as the question. BridgeFormer attends to video patches with specific object information to answer noun questions.

image

Answer Verb Questions

We visualize cross-modality attention between the text tokens of verb questions and video tokens from BridgeFormer. Three frames sampled from a video are shown and the verb phrase marked in blue (Q) is erased as the question. BridgeFormer focuses on object motions of video tokens to answer verb questions.

image

Dependencies and Installation

Installation

  1. Clone repo

    git clone https://github.com/TencentARC/MCQ.git
    cd MCQ
  2. Install dependent packages

    pip install -r requirements.txt
  3. Download the DistilBERT base model from Hugging Face in hugging face or in distilbert-base-uncased. Put "distilbert-base-uncased" under the directory of this repo.

Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

Pre-training

We adopt the curriculum learning to train the model, which pre-trains the model on the image dataset CC3M and video dataset WebVid-2M using 1 frame, and then on the video dataset WebVid-2M using 4 frames.

  1. For 1-frame pre-training, since a single frame does not contain temporal dynamics to correspond to verb phrases, we train the model to answer only noun questions.

    bash sctripts/train_1frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_1frame.pth".

  2. For 4-frame pre-training, to save computation cost to enable a comparatively large batch size for contrastive learning, we train the model to anwer noun and verb questions sequentially. We first train the model to answer noun questions with "MCQ_1frame.pth" loaded in "configs/dist-4frame-mask-noun.json".

    bash sctripts/train_4frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_4frame_noun.pth". We then train the model to answer verb questions with "MCQ_4frame_noun.pth" loaded in "configs/dist-4frame-mask-verb.json".

    bash sctripts/train_4frame_mask_verb.sh
    

    When the training loss converges, we get the final model.

  3. Our repo adopts Multi-Machine and Multi-GPU training, with 32 A100 GPU for 1-frame pre-training and 40 A100 GPU for 4-frame pre-training.

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of VideoFormer, TextFormer and BridgeFormer. For downstream evaluation, you only need to load the weights of VideoFormer and TextFormer, with BridgeFormer removed.

Downstream Retrieval (Zero-shot on MSR-VTT)

  1. Download our pre-trained model in Pre-trained Model (Or use your own pre-traind model).

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k.json".

    bash sctripts/test_retrieval.sh
    

CLIP-initialized Pre-trained Model

We also initialize our model from CLIP weights to pre-train a model with MCQ. Specifically, we use the pre-trained CLIP (ViT-B/32) as the backbone of VideoFormer and TextFormer, and randomly initialize BridgeFormer. Our VideoFormer does not incur any additional parameters compared to the ViT of CLIP, with a parameter-free modification to allow for the input of video frames with variable length.

To evaluate the performance of the CLIP-initialized pre-trained model on text-to-video retrieval,

  1. Download the model in CLIP-Initialized Pre-trained Model.

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_clip.json".

    bash sctripts/test_retrieval_CLIP.sh
    

We also provide a script to extract video features of any given videos from the CLIP-initialized pre-trained model,

python extract_video_features_clip.py

To Do

  • Release pre-training code
  • Release pre-trained model
  • Release downstream evaluation code
  • Release CLIP-initialized model
  • Release video representation extraction code

License

MCQ is released under BSD 3-Clause License.

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022bridgeformer,
  title={BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Li, Dian and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2201.04850},
  year={2022}
}
Owner
Applied Research Center (ARC), Tencent PCG
Applied Research Center (ARC), Tencent PCG
BoxToolBox is a simple python application built around the openCV library

BoxToolBox is a simple python application built around the openCV library. It is not a full featured application to guide you through the w

František Horínek 1 Nov 12, 2021
Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

Ahmet Haydar Ornek 6 Apr 16, 2022
Repository for Scene Text Detection with Supervised Pyramid Context Network with tensorflow.

Scene-Text-Detection-with-SPCNET Unofficial repository for [Scene Text Detection with Supervised Pyramid Context Network][https://arxiv.org/abs/1811.0

121 Oct 15, 2021
A Python script to capture images from multiple webcams at once and save them into your local machine

Capturing multiple images at once from Webcam Using OpenCV Capture multiple image by accessing the webcam of your system and save it to your machine.

Fazal ur Rehman 2 Apr 16, 2022
CNN+Attention+Seq2Seq

Attention_OCR CNN+Attention+Seq2Seq The model and its tensor transformation are shown in the figure below It is necessary ch_ train and ch_ test the p

Tsukinousag1 2 Jul 14, 2022
Text Detection from images using OpenCV

EAST Detector for Text Detection OpenCV’s EAST(Efficient and Accurate Scene Text Detection ) text detector is a deep learning model, based on a novel

Abhishek Singh 88 Oct 20, 2022
Brief idea about our project is mentioned in project presentation file.

Brief idea about our project is mentioned in project presentation file. You just have to run attendance.py file in your suitable IDE but we prefer jupyter lab.

Dhruv ;-) 3 Mar 20, 2022
Toolbox for OCR post-correction

Ochre Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress! Overview of OCR pos

National Library of the Netherlands / Research 117 Nov 10, 2022
This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the robots of the future.

This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the

Elkin Javier Guerra Galeano 17 Nov 03, 2022
A tensorflow implementation of EAST text detector

EAST: An Efficient and Accurate Scene Text Detector Introduction This is a tensorflow re-implementation of EAST: An Efficient and Accurate Scene Text

2.9k Jan 02, 2023
This Repository contain Opencv Projects in python

Python-Opencv OpenCV OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was

Yash Sakre 2 Nov 06, 2021
Make OpenCV camera loops less of a chore by skipping the boilerplate and getting right to the interesting stuff

camloop Forget the boilerplate from OpenCV camera loops and get to coding the interesting stuff Table of Contents Usage Install Quickstart More advanc

Gabriel Lefundes 9 Nov 12, 2021
A list of hyperspectral image super-solution resources collected by Junjun Jiang

A list of hyperspectral image super-resolution resources collected by Junjun Jiang. If you find that important resources are not included, please feel free to contact me.

Junjun Jiang 301 Jan 05, 2023
code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

DeepCAD This repository provides source code for our paper: DeepCAD: A Deep Generative Network for Computer-Aided Design Models Rundi Wu, Chang Xiao,

Rundi Wu 85 Dec 31, 2022
Automatically download multiple papers by keywords in CVPR

CVFPaperHelper Automatically download multiple papers by keywords in CVPR Install mkdir PapersToRead cd PaperToRead pip install requests tqdm git clon

46 Jun 08, 2022
Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

OCR.space OCR Result Checker = Draw OCR overlay on top of image Python tool that takes the OCR.space JSON output as input, and draws an overlay on to

a9t9 4 Oct 18, 2022
Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, rastreia padrões de gestos em vez de um mouse físico.

mouserController Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, r

Vinícius Azevedo 6 Jun 28, 2022
Um RPG de texto orientado a objetos.

RPG de texto Um RPG de texto orientado a objetos, sem história. Um RPG (Role-playing game) baseado em texto em que você pode viajar para alguns locais

Vinicius 3 Oct 05, 2022
Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Andreas Büttner 15 Nov 09, 2022
CRAFT-Pyotorch:Character Region Awareness for Text Detection Reimplementation for Pytorch

CRAFT-Reimplementation Note:If you have any problems, please comment. Or you can join us weChat group. The QR code will update in issues #49 . Reimple

453 Dec 28, 2022