Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/twvv541sbg0qqux/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
Execute shell command lines in parallel on Slurm, S(on) of Grid Engine (SGE), PBS/Torque clusters

qbatch Execute shell command lines in parallel on Slurm, S(on) of Grid Engine (SGE), PBS/Torque clusters qbatch is a tool for executing commands in pa

Jon Pipitone 26 Dec 12, 2022
A simple automation script that logs into your kra account and files your taxes with one command

EASY_TAX A simple automation script that logs into your kra account and files your taxes with one command Currently works for Chrome users. Will creat

leon koech 13 Sep 23, 2021
A python command line tool to calculate options max pain for a given company symbol and options expiry date.

Options-Max-Pain-Calculator A python command line tool to calculate options max pain for a given company symbol and options expiry date. Overview - Ma

13 Dec 26, 2022
Jupyter notebook client in neovim

🪐 Jupyter-Nvim Read jupyter notebooks in neovim Note: The plugin is still in alpha stage 👾 Usage Just open any *.ipynb file and voila! ✨ Contributin

Ahmed Khalf 85 Dec 29, 2022
A handy command-line utility for generating and sending iCalendar events

A handy command-line utility for generating and sending iCalendar events This simple command-line utility is designed to generate an iCalendar event,

Baochun Li 17 Nov 21, 2022
A **CLI** folder organizer written in Python.

Fsorter Introduction A CLI folder organizer written in Python. Dependencies Before installing, install the following dependencies: Ubuntu/Debain Based

1 Nov 17, 2021
Several tools that can be added to your `PATH` to make your life easier.

CK-CLI Tools Several tools that can be added to your PATH to make your life easier. prettypath Prints the $PATH variable in a human-readable way. It a

Christopher Kumm 2 Apr 21, 2022
Play WORDLE game in your terminal.

Wordle TUI Play WORDLE game in your terminal. The game will be kept the same as the Web version. Prerequisites Python 3.7+ Linux/MacOS (Windows is not

Frost Ming 61 Oct 30, 2022
Command-line interface to PyPI Stats API to get download stats for Python packages

pypistats Python 3.6+ interface to PyPI Stats API to get aggregate download statistics on Python packages on the Python Package Index without having t

Hugo van Kemenade 140 Jan 03, 2023
🔖 Lemnos: A simple, light-weight command-line to-do list manager.

🔖 Lemnos: CLI To-do List Manager This is a simple program that allows one to manage a to-do list via the command-line. Example $ python3 todo.py add

Rohan Sikand 1 Dec 07, 2022
Another (unofficial) Qt CLI Installer on multi-platforms

Another Qt installer(aqt) Release: Documentation: Test status: and Coverage: This is a utility alternative to the official graphical Qt installer, for

Hiroshi Miura 528 Jan 02, 2023
Tmux Based Dropdown Dashboard For Python

sextans It's a private configuration and an ongoing experiment while I use Archlinux. A simple drop down dashboard based on tmux. It includes followin

秋葉 4 Dec 22, 2021
Command line interface for testing internet bandwidth using speedtest.net

speedtest-cli Command line interface for testing internet bandwidth using speedtest.net Versions speedtest-cli works with Python 2.4-3.7 Installation

Matt Martz 12.4k Jan 08, 2023
CLI tool to develop StarkNet projects written in Cairo

OpenZeppelin Nile ⛵ Navigate your StarkNet projects written in Cairo. Getting started Create a folder for your project and cd into it: mkdir myproject

OpenZeppelin 305 Dec 30, 2022
A Python script for finding a food-truck based on latitude and longitude coordinates that you can run in your shell

Food Truck Finder Project Description This repository contains a Python script for finding a food-truck based on latitude and longitude coordinates th

1 Jan 22, 2022
cli simple python script to interact with iphone afc api based on python library( tidevice )

afcclient cli simple python script to interact with iphone afc api based on python library( tidevice ) installation pip3 install -U tidevice cp afccli

fyst_14 2 Jul 15, 2022
Ssl-tool - A simple interactive CLI wrapper around openssl to make creation and installation of self-signed certs easy

What's this? A simple interactive CLI wrapper around openssl to make self-signin

Aniket Teredesai 9 May 17, 2022
git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's partial clone and sparse checkout features.

Partial Submodules for Git git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's

Nathan Reed 15 Sep 22, 2022
iTerm2 Shell integration for Xonsh shell.

iTerm2 Shell Integration iTerm2 Shell integration for Xonsh shell. Installation To install use pip: xpip install xontrib-iterm2 # or: xpip install -U

Noorhteen Raja NJ 6 Dec 29, 2022
🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

Fabian Keller 132 Dec 27, 2022