ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

Last update: Nov 08, 2022

Related tags

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

This repository contains the code for our ICCV 2021 paper:

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Sangho Lee*, Jiwan Chung*, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song (*: equal contribution)
[paper]

@inproceedings{lee2021acav100m,
    title="{ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning}",
    author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song},
    booktitle={ICCV},
    year=2021
}

System Requirements

Python >= 3.8.5
FFMpeg 4.3.1

Installation

Install PyTorch 1.6.0, torchvision 0.7.0 and torchaudio 0.6.0 for your environment. Follow the instructions in HERE.
Install the other required packages.

pip install -r requirements.txt
python -m nltk.downloader 'punkt'
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/<cuda version>/torch1.6/index.html
pip install git+https://github.com/jiwanchung/slowfast
pip install torch-scatter==2.0.5 -f https://pytorch-geometric.com/whl/torch-1.6.0+<cuda version>.html

e.g. Replace <cuda version> with cu102 for CUDA 10.2.

Input File Structure

Create the data directory

mkdir data

Prepare the input file.

data/metadata.tsv should be structured as follows. We provide an example input file in examples/metadata.tsv

YOUTUBE_ID\t{"LatestDAFeature": {"Title": TITLE, "Description": DESCRIPTION, "YouTubeCategory": YOUTUBE_CATEGORY, "VideoLength": VIDEO_LENGTH}, "MediaVersionList": [{"Duration": DURATION}]}

Data Curation Pipeline

One-Liner

bash ./run.sh

To enable GPU computation, modify the CUDA_VISIBLE_DEVICES environment variable accordingly. For example, run the above command as export CUDA_VISIBLE_DEVICES=2,3; bash ./run.sh.

Step-by-Step

Filter the videos with metadata.

bash ./metadata_filtering/code/run.sh

The above command will build the data/filtered.tsv file.

Download the actual video files from youtube.

bash ./video_download/code/run.sh

Although we provide a simple download script, we recommend more scalable solutions for downloading large-scale data.

The above command will download the files to data/videos/raw directory.

Segment the videos into 10-second clips.

bash ./clip_segmentation/code/run.sh

The above command will save the segmented clips to data/videos directory.

Extract features from the clips.

bash ./feature_extraction/code/run.sh

The above command will save the extracted features to data/features directory.

This step requires GPU for faster computation.

Perform clustering with the extracted features.

bash ./clustering/code/run.sh

The above command will save the extracted features to data/clusters directory.

This step requires GPU for faster computation.

Select subset with high audio-visual correspondence using the clustering results.

bash ./subset_selection/code/run.sh

The above command will save the selected clip indices to data/datasets directory.

This step requires GPU for faster computation.

The final output should be saved in the data/output.csv file.

Output File Structure

output.csv is structured as follows. We provide an example output file at examples/output.csv.

# SHARD_NAME,FILENAME,YOUTUBE_ID,SEGMENT
shard-000009,qpxektwhzra_292.mp4,qpxektwhzra,"[292.3329999997, 302.3329999997]"

Evaluation

Instructions on downstream evaluation are provided in Evaluation.

Correspondence Retrieval

Instructions on correspondence retrieval experiments are provided in Correspondence Retrieval.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

Related tags

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

System Requirements

Installation

Input File Structure

Data Curation Pipeline

One-Liner

Step-by-Step

Output File Structure

Evaluation

Correspondence Retrieval

Owner

sangho.lee

PyTorch framework for Deep Learning research and development.

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

A playable implementation of Fully Convolutional Networks with Keras.

[ICCV21] Self-Calibrating Neural Radiance Fields

This is code of book "Learn Deep Learning with PyTorch"

Implementation of the pix2pix model on satellite images

Links to works on deep learning algorithms for physics problems, TUM-I15 and beyond

Tensorflow implementation of DeepLabv2

An experimentation and research platform to investigate the interaction of automated agents in an abstract simulated network environments.

This program will stylize your photos with fast neural style transfer.

Automatic differentiation with weighted finite-state transducers.

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

FFCV: Fast Forward Computer Vision (and other ML workloads!)

Background Matting: The World is Your Green Screen

Pytorch implementation for "Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter".

A curated list of awesome deep long-tailed learning resources.

Hamiltonian Dynamics with Non-Newtonian Momentum for Rapid Sampling

An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Wikidated : An Evolving Knowledge Graph Dataset of Wikidata’s Revision History

Deep Learning Package based on TensorFlow