UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Related tags

Deep LearningUMT
Overview

Unified Multi-modal Transformers

arXiv License

This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection by Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie, which has been accepted by CVPR 2022.

Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

  • CUDA 11.5.0
  • CUDNN 8.3.2.44
  • Python 3.10.0
  • PyTorch 1.11.0
  • NNCore 0.3.6

Install from source

  1. Clone the repository from GitHub.
git clone https://github.com/TencentARC/UMT.git
cd UMT
  1. Install dependencies.
pip install -r requirements.txt

Getting Started

Download and prepare the datasets

  1. Download and extract the datasets.
  1. Prepare the files in the following structure.
UMT
├── configs
├── datasets
├── models
├── tools
├── data
│   ├── qvhighlights
│   │   ├── *features
│   │   ├── highlight_{train,val,test}_release.jsonl
│   │   └── subs_train.jsonl
│   ├── charades
│   │   ├── *features
│   │   └── charades_sta_{train,test}.txt
│   ├── youtube
│   │   ├── *features
│   │   └── youtube_anno.json
│   └── tvsum
│       ├── *features
│       └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···

Train a model

Run the following command to train a model using a specified config.

# Single GPU
python tools/launch.py ${path-to-config}

# Multiple GPUs
torchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}

Test a model and evaluate results

Run the following command to test a model and evaluate results.

python tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval

Pre-train with ASR captions on QVHighlights

Run the following command to pre-train a model using ASR captions on QVHighlights.

torchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py

Model Zoo

We provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.

Dataset Model Type MR mAP HD mAP Download
[email protected] [email protected] [email protected] [email protected]
QVHighlights UMT-B 38.59 39.85 model | metrics
UMT-B w/ PT 39.26 40.10 model | metrics
Charades-STA UMT-B V + A 48.31 29.25 88.79 56.08 model | metrics
UMT-B V + O 49.35 26.16 89.41 54.95 model | metrics
YouTube
Highlights
UMT-S Dog 65.93 model | metrics
UMT-S Gymnastics 75.20 model | metrics
UMT-S Parkour 81.64 model | metrics
UMT-S Skating 71.81 model | metrics
UMT-S Skiing 72.27 model | metrics
UMT-S Surfing 82.71 model | metrics
TVSum UMT-S VT 87.54 model | metrics
UMT-S VU 81.51 model | metrics
UMT-S GA 88.22 model | metrics
UMT-S MS 78.81 model | metrics
UMT-S PK 81.42 model | metrics
UMT-S PR 86.96 model | metrics
UMT-S FM 75.96 model | metrics
UMT-S BK 86.89 model | metrics
UMT-S BT 84.42 model | metrics
UMT-S DS 79.63 model | metrics

Here, w/ PT means initializing the model using pre-trained weights on ASR captions. V, A, and O indicate video, audio, and optical flow, respectively.

Citation

If you find this project useful for your research, please kindly cite our paper.

@inproceedings{liu2022umt,
  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},
  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}
Owner
Applied Research Center (ARC), Tencent PCG
Applied Research Center (ARC), Tencent PCG
The open source code of SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation.

SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation(ICPR 2020) Overview This code is for the paper: Spatial Attention U-Net for Retinal V

Changlu Guo 151 Dec 28, 2022
Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

Face Recognition: Too Bias, or Not Too Bias? Robinson, Joseph P., Gennady Livitz, Yann Henon, Can Qin, Yun Fu, and Samson Timoner. "Face recognition:

Joseph P. Robinson 41 Dec 12, 2022
[NIPS 2021] UOTA: Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration.

UOTA: Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration This repository is the official PyTorch implementation of UOT

6 Jun 29, 2022
Dashboard for the COVID19 spread

COVID-19 Data Explorer App A streamlit Dashboard for the COVID-19 spread. The app is live at: [https://covid19.cwerner.ai]. New data is queried from G

Christian Werner 22 Sep 29, 2022
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话). Many modifications t

291 Jan 02, 2023
A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Kordel K. France 2 Nov 14, 2022
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR) This is the official implementation of our paper Personalized Tran

Yongchun Zhu 81 Dec 29, 2022
Adversarial Attacks are Reversible via Natural Supervision

Adversarial Attacks are Reversible via Natural Supervision ICCV2021 Citation @InProceedings{Mao_2021_ICCV, author = {Mao, Chengzhi and Chiquier

Computer Vision Lab at Columbia University 20 May 22, 2022
PyTorch implementation of federated learning framework based on the acceleration of global momentum

Federated Learning with Acceleration of Global Momentum PyTorch implementation of federated learning framework based on the acceleration of global mom

0 Dec 23, 2021
CONetV2: Efficient Auto-Channel Size Optimization for CNNs

CONetV2: Efficient Auto-Channel Size Optimization for CNNs Exciting News! CONetV2: Efficient Auto-Channel Size Optimization for CNNs has been accepted

Mahdi S. Hosseini 3 Dec 13, 2021
Stochastic Normalizing Flows

Stochastic Normalizing Flows We introduce stochasticity in Boltzmann-generating flows. Normalizing flows are exact-probability generative models that

AI4Science group, FU Berlin (Frank Noé and co-workers) 50 Dec 16, 2022
audioLIME: Listenable Explanations Using Source Separation

audioLIME This repository contains the Python package audioLIME, a tool for creating listenable explanations for machine learning models in music info

Institute of Computational Perception 27 Dec 01, 2022
Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

Shakespeare translations using TensorFlow This is an example of using the new Google's TensorFlow library on monolingual translation going from modern

Motoki Wu 245 Dec 28, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Cleaned up code for DSTC 10: SIMMC 2.0 track: subtask 2: multimodal coreference resolution

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input: arXiv MMCoref_cleaned Code for the MMCoref task of the SIMMC 2.0 dataset. Pre

Yichen (William) Huang 2 Dec 05, 2022
A machine learning project which can detect and predict the skin disease through image recognition.

ML-Project-2021 A machine learning project which can detect and predict the skin disease through image recognition. The dataset used for this is the H

Debshishu Ghosh 1 Jan 13, 2022
STEM: An approach to Multi-source Domain Adaptation with Guarantees

STEM: An approach to Multi-source Domain Adaptation with Guarantees Introduction This is the official implementation of ``STEM: An approach to Multi-s

5 Dec 19, 2022
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022