CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Overview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP.

CLIP2Video is a video-text retrieval model based on CLIP (ViT-B/32), which transfers the image-language pre-training model to video-text retrieval in an end-to-end manner. Our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Pipeline Blocks

Introduction

This is the source code of CLIP2Video, a method for Video-Text Retrieval based on temporal correlations. It is built on top of the CLIP4Clip by ( Huaishao Luo et al.) in PyTorch.

Requirement

pip install -r requirements.txt 

Download data and Pre-trained Model

Supported public training sets:

  • MSR-VTT(9k)
  • MSR-VTT(full)
  • MSVD
  • VATEX-English Version

Supported public testing protocols:

  • MSR-VTT 1k-A protocol (SOTA)
  • MSR-VTT full protocol (SOTA)
  • MSVD(SOTA
  • VATEX-English version(SOTA

Download official video: Official videos of different data can be found as follows:

Pre-process

To train and test the above datasets: you should use sample_frame.py to transform video into frames.

python sample_frame.py --input_path [raw video path] --output_path [frame path]

(Optional) The splits and captions can be found in the links of used dataset. For the convenience, you can also use the split in data/ directly.

Download CLIP model

To train and test the above datasets based on pre-trained CLIP model, you should visit CLIP and download ViT-B/32.

Test Model

We provide three models trained on MSVD, MSR-VTT and VATEX-English.

Model Name checkpoint
CLIP2Video_MSVD link
CLIP2Video_MSRVTT9k link
CLIP2Video_VATEX link

To test the trained model, please refer test/.

(Optional) If the path of trained model(--checkpoint) doesn't exist, the parameters of basic CLIP (--clip_path) will be loaded.

Main Article Results of CLIP2Video

T2V:

Protocol [email protected] [email protected] [email protected] Median Rank Mean Rank
MSVD 47.0 76.8 85.9 2 9.6
MSRVTT-9k 45.6 72.6 81.7 2 14.6
MSRVTT-Full 29.8 55.5 66.2 4 45.5
Vatex (English) random 1k5 split 57.3 90.0 95.5 1 3.6
Vatex (English) HGR split 61.2 90.9 95.6 1 3.4

V2T:

Protocol [email protected] [email protected] [email protected] Median Rank Mean Rank
MSVD 58.7 85.6 91.6 1 4.3
MSRVTT-9k 43.5 72.3 82.1 2 10.2
MSRVTT-Full 54.6 82.1 90.8 1 5.3
Vatex (English) random 1k5 split 76.0 97.7 99.9 1 1.5
Vatex (English) HGR split 77.9 98.1 99.1 1 1.6

(Optional:) Clarification of different results in VATEX:

  1. In our paper, we do not strictly follow HGR's split, but randomly split the test set by ourselves, which is the split in

    • data/vatex_data/test1k5_sec_list.txt
  2. In HGR split, we adopt the totally same split following HGR, and the split can be seen as:

    • data/vatex_data/test_list.txt
    • data/vatex_data/val_list.txt

We will revise the results strictly following HGR split for fair comparison in the paper later!


Citation

If you find CLIP2Video useful in your work, you can cite the following paper:

@article{fang2021clip2video,
  title={CLIP2Video: Mastering Video-Text Retrieval via Image CLIP},
  author={Fang, Han and Xiong, Pengfei and Xu, Luhui and Chen, Yu},
  journal={arXiv preprint arXiv:2106.11097},
  year={2021}
}

Acknowledgments

Some components of this code implementation are adopted from CLIP and CLIP4Clip. We sincerely appreciate for their contributions.

Technical Analysis library in pandas for backtesting algotrading and quantitative analysis

bta-lib - A pandas based Technical Analysis Library bta-lib is pandas based technical analysis library and part of the backtrader family. Links Main P

DRo 393 Dec 20, 2022
Medical image analysis framework merging ANTsPy and deep learning

ANTsPyNet A collection of deep learning architectures and applications ported to the python language and tools for basic medical image processing. Bas

Advanced Normalization Tools Ecosystem 118 Dec 24, 2022
Dynamical Wasserstein Barycenters for Time Series Modeling

Dynamical Wasserstein Barycenters for Time Series Modeling This is the code related for the Dynamical Wasserstein Barycenter model published in Neurip

8 Sep 09, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022
I3-master-layout - Simple master and stack layout script

Simple master and stack layout script | ------ | ----- | | | | | Ma

Tobias S 18 Dec 05, 2022
Motion planning environment for Sampling-based Planners

Sampling-Based Motion Planners' Testing Environment Sampling-based motion planners' testing environment (sbp-env) is a full feature framework to quick

Soraxas 23 Aug 23, 2022
Official implementation for the paper: Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Generating Smooth Pose Sequences for Diverse Human Motion Prediction This is official implementation for the paper Generating Smooth Pose Sequences fo

Wei Mao 28 Dec 10, 2022
Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

Autonomous Learning Group 3 Jun 26, 2022
Latent Network Models to Account for Noisy, Multiply-Reported Social Network Data

VIMuRe Latent Network Models to Account for Noisy, Multiply-Reported Social Network Data. If you use this code please cite this article (preprint). De

6 Dec 15, 2022
A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).

Attention Walk ⠀⠀ A PyTorch Implementation of Watch Your Step: Learning Node Embeddings via Graph Attention (NIPS 2018). Abstract Graph embedding meth

Benedek Rozemberczki 303 Dec 09, 2022
Underwater image enhancement

LANet Our work proposes an adaptive learning attention network (LANet) to solve the problem of color casts and low illumination in underwater images.

LiuShiBen 7 Sep 14, 2022
Official Implementation of VAT

Semantic correspondence Few-shot segmentation Cost Aggregation Is All You Need for Few-Shot Segmentation For more information, check out project [Proj

Hamacojr 114 Dec 27, 2022
Privacy as Code for DSAR Orchestration: Privacy Request automation to fulfill GDPR, CCPA, and LGPD data subject requests.

Meet Fidesops: Privacy as Code for DSAR Orchestration A part of the greater Fides ecosystem. ⚡ Overview Fidesops (fee-dez-äps, combination of the Lati

Ethyca 44 Dec 06, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
The official repository for "Intermediate Layers Matter in Momentum Contrastive Self Supervised Learning" paper.

Intermdiate layer matters - SSL The official repository for "Intermediate Layers Matter in Momentum Contrastive Self Supervised Learning" paper. Downl

Aakash Kaku 35 Sep 19, 2022
BARF: Bundle-Adjusting Neural Radiance Fields 🤮 (ICCV 2021 oral)

BARF 🤮 : Bundle-Adjusting Neural Radiance Fields Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey IEEE International Conference on Comp

Chen-Hsuan Lin 539 Dec 28, 2022
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models This repository is the official implementation of the fol

DistributedML 41 Dec 06, 2022
This repo holds the code of TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation

TransFuse This repo holds the code of TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation Requirements Pytorch=1.6.0, 1.9.0 (=1.

Rayicer 93 Dec 19, 2022
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision

HugsVision is an open-source and easy to use all-in-one huggingface wrapper for computer vision. The goal is to create a fast, flexible and user-frien

Labrak Yanis 166 Nov 27, 2022
World Models with TensorFlow 2

World Models This repo reproduces the original implementation of World Models. This implementation uses TensorFlow 2.2. Docker The easiest way to hand

Zac Wellmer 234 Nov 30, 2022