Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/twvv541sbg0qqux/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
A CLI Application to detect plagiarism in Source Code Files.

Plag Description A CLI Application to detect plagiarism in Source Code Files. Features Compare source code files for plagiarism. Extract code features

default=dev 2 Nov 10, 2022
Set of scripts & tools for converting between numbers and major system encoded words.

major-system-converter Set of scripts & tools for converting between numbers and major system encoded words. Uses phonetics instead of letters to conv

4 Aug 09, 2022
A CLI Spigot plugin manager that adheres to Unix conventions and Python best practices.

Spud A cross-platform, Spigot plugin manager that adheres to the Unix philosophy and Python best practices. Some focuses of the project are: Easy and

Tommy Dougiamas 9 Dec 02, 2022
Several tools that can be added to your `PATH` to make your life easier.

CK-CLI Tools Several tools that can be added to your PATH to make your life easier. prettypath Prints the $PATH variable in a human-readable way. It a

Christopher Kumm 2 Apr 21, 2022
Command-line tool for downloading and extending the RedCaps dataset.

Command-line tool for downloading and extending the RedCaps dataset.

RedCaps dataset 33 Dec 14, 2022
flora-dev-cli (fd-cli) is command line interface software to interact with flora blockchain.

Install git clone https://github.com/Flora-Network/fd-cli.git cd fd-cli python3 -m venv venv source venv/bin/activate pip install -e . --extra-index-u

14 Sep 11, 2022
🎮 An easy to use tool to change the mapping of your input device buttons.

Input Remapper Formerly Key Mapper An easy to use tool to change the mapping of your input device buttons. Supports mice, keyboards, gamepads, X11, Wa

Tobi 1.9k Jan 05, 2023
CLI tool for typescript tasks & migrations

typed CLI tool for typescript tasks & migrations Installation Usage $ typed --list Subcommands: bootstrap 🔨 Bootstrap your environment for TypeS

Lob 1 Nov 15, 2021
Write Django management command using the click CLI library

Django Click Project information: Automated code metrics: django-click is a library to easily write Django management commands using the click command

Jonathan Stoppani 215 Dec 19, 2022
CLI tool to view your VIT timetable from terminal anytime!

VITime CLI tool to view your timetable from terminal anytime! Table of contents Preview Installation PyPI Source code Updates Setting up Add timetable

16 Oct 04, 2022
Cthulhu is a simple python CLI application that streams torrents directly from 1337x.

Cthulhu is a simple python CLI application that facilitates the streaming of torrents directly from 1337x. It uses webtorrent to stream video

Raiyan 27 Dec 27, 2022
Analysis of a daily word game "Wordle"

Wordle Analysis of a daily word game "Wordle" https://www.powerlanguage.co.uk/wordle/ Description Worlde is a daily word game in which a player attemp

Bartek 1 Feb 07, 2022
Tablicate - Python library for easy table creation and output to terminal

Tablicate Tablicate - Python library for easy table creation and output to terminal Features Column-wise justification alignment (left, right, center)

3 Dec 14, 2022
A set of libraries and functions for simplifying automating Cisco devices through SecureCRT.

This is a set of libraries for automating Cisco devices (and to a lesser extent, bash prompts) over ssh/telnet in SecureCRT.

Matthew Spangler 7 Mar 30, 2022
a-shell: A terminal for iOS, with multiple windows

a-shell: A terminal for iOS, with multiple windows

Nicolas Holzschuch 1.7k Jan 02, 2023
A command line interface to interact with the Hypixel api allowing the user to get stats, leaderboards, etc

HyConsole is a way to get data on players and leaderboards from the Hypixel Minecraft server from the command line. Keep in mind I have no a

1 Feb 14, 2022
This CLI give the possibility to do a queries in Star Wars API and returns a JSON in a terminal.

Star Wars CLI (swcli) This CLI give the possibility to do a queries in Star Wars API and returns a JSON in a terminal. Install $ pip install swcli Qu

Pery Lemke 5 Nov 05, 2021
Microsoft Azure CLI - Azure Command-Line Interface

A great cloud needs great tools; we're excited to introduce Azure CLI, our next generation multi-platform command line experience for Azure.

Microsoft Azure 3.4k Dec 30, 2022
An open source terminal project made in python

Calamity-Terminal An open source terminal project made in python. Calamity Terminal is a free and open source lightweight terminal. Its made 100% off

1 Mar 08, 2022
A startpage configured aesthetically with terminal-esque link formatting

Terminal-y Startpage Setup Clone the repository, then make an unformatted.txt file following the specifications in example.txt. Run format.py Open ind

belkarx 13 May 01, 2022