Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Overview

Dual Encoding for Video Retrieval by Text

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

image

Table of Contents

Environments

  • Ubuntu 16.04
  • CUDA 10.1
  • Python 3.8
  • PyTorch 1.5.1

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.

conda create --name ws_dual_py3 python=3.8
conda activate ws_dual_py3
git clone https://github.com/danieljf24/hybrid_space.git
cd hybrid_space
pip install -r requirements.txt
conda deactivate

Dual Encoding on MSRVTT10K

Required Data

Run the following script to download and extract MSR-VTT (msrvtt10k-resnext101_resnet152.tar.gz(4.3G)) dataset and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). For more information about the dataset, please refer to here. The extracted data is placed in $HOME/VisualSearch/.

ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH

# download and extract dataset
wget http://8.210.46.84:8787/msrvtt10k-resnext101_resnet152.tar.gz
tar zxf msrvtt10k-resnext101_resnet152.tar.gz -C $ROOTPATH

# download and extract pre-trained word2vec
wget http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz
tar zxf word2vec.tar.gz -C $ROOTPATH

Model Training and Evaluation

Run the following script to train and evaluate Dual Encoding network with hybrid space on the official partition of MSR-VTT. The video features are the concatenation of ResNeXt-101 and ResNet-152 features. The code of video feature extraction we used in the paper is available at here.

conda activate ws_dual_py3
./do_all.sh msrvtt10k hybrid resnext101-resnet152

Running the script will do the following things:

  1. Train Dual Encoding network with hybrid space and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space.
  2. Evaluate the final model on the test set. Note that the dataset has already included vocabulary and concept annotations. If you would like to generate vocabulary and concepts by yourself, run the script ./do_vocab_concept.sh msrvtt10k 1 $ROOTPATH.

If you would like to train Dual Encoding network with the latent space learning (Conference Version), please run the following scrip:

./do_all.sh msrvtt10k latent resnext101-resnet152 $ROOTPATH

To train the model on the Test1k-Miech partition and Test1k-Yu partition of MSR-VTT, please run the following scrip:

./do_all.sh msrvtt10kmiech hybrid resnext101-resnet152 $ROOTPATH
./do_all.sh msrvtt10kyu hybrid resnext101-resnet152 $ROOTPATH

Evaluation using Provided Checkpoints

The overview of pre-trained checkpoints on MSR-VTT is as follows.

Split Pre-trained Checkpoints
Official msrvtt10k_model_best.pth.tar(264M)
Test1k-Miech msrvtt10kmiech_model_best.pth.tar(267M)
Test1k-Yu msrvtt10kyu_model_best.pth.tar(267M)

Note that if you would like to evaluate using our trained checkpoints, please make sure to use the vocabulary and concept annotations that are provided in the msrvtt10k-resnext101_resnet152.tar.gz.

On the official split

Run the following script to download and evaluate our trained checkpoints on the official split of MSR-VTT. The trained checkpoints can also be downloaded from Baidu pan (url, password:p3p0).

MODELDIR=$HOME/VisualSearch/checkpoints
mkdir -p $MODELDIR

# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/msrvtt10k_model_best.pth.tar

# evaluate on the official split of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10k --logger_name $MODELDIR  --checkpoint_name msrvtt10k_model_best.pth.tar

On Test1k-Miech and Test1k-Yu splits

In order to evaluate on Test1k-Miech and Test1k-Yu splits, please run the following script.

MODELDIR=$HOME/VisualSearch/checkpoints

# download trained checkpoints on Test1k-Miech
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/msrvtt10kmiech_model_best.pth.tar

# evaluate on Test1k-Miech of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10kmiech --logger_name $MODELDIR  --checkpoint_name msrvtt10kmiech_model_best.pth.tar
MODELDIR=$HOME/VisualSearch/checkpoints

# download trained checkpoints on Test1k-Yu
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/msrvtt10kyu_model_best.pth.tar

# evaluate on Test1k-Yu of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10kyu --logger_name $MODELDIR  --checkpoint_name msrvtt10kyu_model_best.pth.tar

Expected Performance

The expected performance of Dual Encoding on MSR-VTT is as follows. Notice that due to random factors in SGD based training, the numbers differ slightly from those reported in the paper.

Split Text-to-Video Retrieval Video-to-Text Retrieval SumR
[email protected] [email protected] [email protected] MedR mAP [email protected] [email protected] [email protected] MedR mAP
Official 11.8 30.6 41.8 17 21.4 21.6 45.9 58.5 7 10.3 210.2
Test1k-Miech 22.7 50.2 63.1 5 35.6 24.7 52.3 64.2 5 37.2 277.2
Test1k-Yu 21.5 48.8 60.2 6 34.0 21.7 49.0 61.4 6 34.6 262.6

Dual Encoding on VATEX

Required Data

Download VATEX dataset (vatex-i3d.tar.gz(3.0G)) and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G)). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). For more information about the dataset, please refer to here. Please extract data into $HOME/VisualSearch/.

Model Training and Evaluation

Run the following script to train and evaluate Dual Encoding network with hybrid space on VATEX.

# download and extract dataset
wget http://8.210.46.84:8787/vatex-i3d.tar.gz
tar zxf vatex-i3d.tar.gz -C $ROOTPATH

./do_all.sh vatex hybrid i3d_kinetics $ROOTPATH

Expected Performance

Run the following script to download and evaluate our trained model (vatex_model_best.pth.tar(230M)) on VATEX.

MODELDIR=$HOME/VisualSearch/checkpoints

# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/vatex_model_best.pth.tar

CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection vatex --logger_name $MODELDIR  --checkpoint_name vatex_model_best.pth.tar

The expected performance of Dual Encoding with hybrid space learning on MSR-VTT is as follows.

Split Text-to-Video Retrieval Video-to-Text Retrieval SumR
[email protected] [email protected] [email protected] MedR mAP [email protected] [email protected] [email protected] MedR mAP
VATEX 35.8 72.8 82.9 2 52.0 47.5 76.0 85.3 2 39.1 400.3

Dual Encoding on Ad-hoc Video Search (AVS)

Required Data

The following datasets are used for training, validation and testing: the joint collection of MSR-VTT and TGIF, tv2016train and IACC.3. For more information about these datasets, please refer to here.

Frame-level feature data

Please download the frame-level features from Baidu pan (url, password:qwlc). The filename of feature data are summarized as follows.

Datasets 2048-dim ResNeXt-101 2048-dim ResNet-152
MSR-VTT msrvtt10k_ResNext-101.tar.gz msrvtt10k_ResNet-152.tar.gz
TGIF tgif_ResNext-101.tar.gz tgif_ResNet-152.tar.gz
tv2016train tv2016train_ResNext-101.tar.gz tv2016train_ResNet-152.tar.gz
IACC.3 iacc.3_ResNext-101.tar.gz iacc.3_ResNet-152.tar.gz

Note if you have already download MSR-VTT data we provide above, you need not download msrvtt10k_ResNext-101.tar.gz and msrvtt10k_ResNet-152.tar.gz.

Sentence data

Please download the above data, and run the following scripts to extract them into $HOME/VisualSearch/.

ROOTPATH=$HOME/VisualSearch

# extract ResNext-101
tar zxf tgif_ResNext-101.tar.gz -C $ROOTPATH
tar zxf msrvtt10k_ResNext-101.tar.gz -C $ROOTPATH
tar zxf tv2016train_ResNext-101.tar.gz -C $ROOTPATH
tar zxf iacc.3_ResNext-101.tar.gz -C $ROOTPATH

# extract ResNet-152
tar zxf tgif_ResNet-152.tar.gz -C $ROOTPATH
tar zxf msrvtt10k_ResNet-152.tar -C $ROOTPATH
tar zxf tv2016train_ResNet-152.tar.gz -C $ROOTPATH
tar zxf iacc.3_ResNet-152.tar.gz -C $ROOTPATH

# combine feature of tgif and msrvtt10k
./do_combine_features.sh

Train Dual Encoding model from scratch

ROOTPATH=$HOME/VisualSearch
trainCollection=tgif-msrvtt10k
overwrite=0

# Generate a vocabulary on the training set
./util/do_get_vocab.sh $trainCollection $ROOTPATH $overwrite

# Generate concepts according to video captions
./util/do_get_tags.sh $trainCollection $ROOTPATH $overwrite

# Generate video frame info
visual_feature=resnext101-resnet152
./util/do_get_frameInfo.sh $trainCollection $visual_feature $ROOTPATH $overwrite

# training and testing
./do_all_avs.sh $ROOTPATH

How to run Dual Encoding on other datasets?

Our code supports dataset structure:

  • One-folder structure: train, validation and test subset are stored in a folder.
  • Multiple-folder structure: train, validation and test subset are stored in three folders respectively.

One-folder structure

Store the train, validation and test subset into a folder in the following structure.

${collection}
├── FeatureData
│   └── ${feature_name}
│       ├── feature.bin
│       ├── shape.txt
│       └── id.txt
└── TextData
    └── ${collection}train.caption.txt
    └── ${collection}val.caption.txt
    └── ${collection}test.caption.txt
  • FeatureData: video frame features. Using txt2bin.py to convert video frame feature in the required binary format.
  • ${collection}train.caption.txt: training caption data.
  • ${collection}val.caption.txt: validation caption data.
  • ${collection}test.caption.txt: test caption data. The file structure is as follows, in which the video and sent in the same line are relevant.
video_id_1#1 sentence_1
video_id_1#2 sentence_2
...
video_id_n#1 sentence_k
...

Please run the script to generate vocabulary and concepts:

./util/do_vocab_concept.sh $collection 0 $ROOTPATH

Run the following script to train and evaluate Dual Encoding on your own dataset:

./do_all.sh ${collection} hybrid ${feature_name} ${rootpath}

Multiple-folder structure

Store the training, validation and test subsets into three folders in the following structure respectively.

${subset_name}
├── FeatureData
│   └── ${feature_name}
│       ├── feature.bin
│       ├── shape.txt
│       └── id.txt
└── TextData
    └── ${subset_name}.caption.txt
  • FeatureData: video frame features.
  • ${dsubset_name}.caption.txt: caption data of corresponding subset.

You can run the following script to check whether the data is ready:

./do_format_check.sh ${train_set} ${val_set} ${test_set} ${rootpath} ${feature_name}

where train_set, val_set and test_set indicate the name of training, validation and test set, respectively, ${rootpath} denotes the path where datasets are saved and feature_name is the video frame feature name.

Please run the script to generate vocabulary and concepts:

./util/do_vocab_concept.sh ${train_set} 0 $ROOTPATH

If you pass the format check, use the following script to train and evaluate Dual Encoding on your own dataset:

./do_all_multifolder.sh ${train_set} ${val_set} ${test_set} hybrid ${feature_name} ${rootpath}

References

If you find the package useful, please consider citing our TPAMI'21 or CVPR'19 paper:

@article{dong2021dual,
  title={Dual Encoding for Video Retrieval by Text},
  author={Dong, Jianfeng and Li, Xirong and Xu, Chaoxi and Yang, Xun and Yang, Gang and Wang, Xun and Wang, Meng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  doi = {10.1109/TPAMI.2021.3059295},
  year={2021}
}
@inproceedings{cvpr2019-dual-dong,
title = {Dual Encoding for Zero-Example Video Retrieval},
author = {Jianfeng Dong and Xirong Li and Chaoxi Xu and Shouling Ji and Yuan He and Gang Yang and Xun Wang},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019},
}
POT : Python Optimal Transport

This open source Python library provide several solvers for optimization problems related to Optimal Transport for signal, image processing and machine learning.

Python Optimal Transport 1.7k Jan 04, 2023
Sort By Face

Sort-By-Face This is an application with which you can either sort all the pictures by faces from a corpus of photos or retrieve all your photos from

0 Nov 29, 2021
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

81 Dec 01, 2022
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft 235 Dec 22, 2022
原神风花节自动弹琴辅助

GenshinAutoPlayBalladsofBreeze 原神风花节自动弹琴辅助(已适配1920*1080分辨率) 本程序基于opencv图像识别技术,不存在任何封号。 因为正确率取决于你的cpu性能,10900k都不一定全对。 由于图像识别存在误差,根本无法确定出错时间。更不用说被检测到了。

晓轩 20 Oct 27, 2022
一键翻译各类图片内文字

一键翻译各类图片内文字 针对群内、各个图站上大量不太可能会有人去翻译的图片设计,让我这种日语小白能够勉强看懂图片 主要支持日语,不过也能识别汉语和小写英文 支持简单的涂白和嵌字

574 Dec 28, 2022
Code for paper "Role-based network embedding via structural features reconstruction with degree-regularized constraint"

Role-based network embedding via structural features reconstruction with degree-regularized constraint Train python main.py --dataset brazil-flights

wang zhang 1 Jun 28, 2022
PyTorch Re-Implementation of EAST: An Efficient and Accurate Scene Text Detector

Description This is a PyTorch Re-Implementation of EAST: An Efficient and Accurate Scene Text Detector. Only RBOX part is implemented. Using dice loss

365 Dec 20, 2022
Distort a video using Seam Carving (video) and Vibrato effect (sound)

Distort videos Applies a Seam Carving algorithm (aka liquid rescale) on every frame of a video, and a vibrato effect on the audio to distort the video

AlexZeGamer 6 Dec 06, 2022
Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

Learning to Segment Every Thing This repository contains the code for the following paper: R. Hu, P. Dollár, K. He, T. Darrell, R. Girshick, Learning

Ronghang Hu 417 Oct 03, 2022
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
An interactive interface for using OpenCV's GrabCut algorithm for image segmentation.

Interactive GrabCut An interactive interface for using OpenCV's GrabCut algorithm for image segmentation. Setup Install dependencies: pip install nump

Jason Y. Zhang 16 Oct 10, 2022
pulse2percept: A Python-based simulation framework for bionic vision

pulse2percept: A Python-based simulation framework for bionic vision Retinal degenerative diseases such as retinitis pigmentosa and macular degenerati

67 Dec 29, 2022
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 05, 2023
This is a real life mario project using python and mediapipe

real-life-mario This is a real life mario project using python and mediapipe How to run to run this just run - realMario.py file requirements This req

Programminghut 42 Dec 22, 2022
Repository collecting all the submodules for the new PyTorch-based OCR System.

OCRopus3 is being replaced by OCRopus4, which is a rewrite using PyTorch 1.7; release should be soonish. Please check github.com/tmbdev/ocropus for up

NVIDIA Research Projects 138 Dec 09, 2022
Regions sanitàries (RS), Sectors Sanitàris (SS) i Àrees Bàsiques de Salut (ABS) de Catalunya

Regions sanitàries (RS), Sectors Sanitaris (SS), Àrees de Gestió Assistencial (AGA) i Àrees Bàsiques de Salut (ABS) de Catalunya Fitxers GeoJSON de le

Glòria Macià Muñoz 2 Jan 23, 2022
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 07, 2023
Amazing 3D explosion animation using Pygame module.

3D Explosion Animation 💣 💥 🔥 Amazing explosion animation with Pygame. 💣 Explosion physics An Explosion instance is made of a set of Particle objec

Dylan Tintenfich 12 Mar 11, 2022
Generate text images for training deep learning ocr model

New version release:https://github.com/oh-my-ocr/text_renderer Text Renderer Generate text images for training deep learning OCR model (e.g. CRNN). Su

Qing 1.2k Jan 04, 2023