An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Last update: Dec 20, 2022

Related tags

Overview

Simple Tar Dataset

An unopinionated replacement for PyTorch's Dataset and ImageFolder classes, for datasets stored as uncompressed Tar archives.

Just Tar it: No particular structure is enforced in the Tar archive. This means that you can just archive your files with no modification, and handle any data/meta-data with your dataset code.

Why? Storing a dataset as millions of small files makes access inefficient, and can create other difficulties in large-scale scenarios (e.g. running out of inodes, inneficient operations in distributed filesystems which are optimised for fewer large files). A Tar file is a simple and uncompressed archive format for which numerous utilities exist, and it allows fast random access into a single archive file.

Example

The default TarDataset simply loads all PNG, JPG and JPEG images from a Tar file, and allows you to iterate them.

Images are returned as Tensor. Here some RGB values are printed.

from tardataset import TarDataset

dataset = TarDataset('example-data/colors.tar')

for (idx, image) in enumerate(dataset):
  print(f"Image #{idx}, color: {image[:,0,0]}")

Usage

For image classification datasets, where images are usually stored in one folder per class (e.g. ImageNet), TarImageFolder is a drop-in replacement for torchvision.dataset.ImageFolder.

For more complex scenarios -- say, you store some data in one or more JSON files, or you have folders with video frames in specific formats -- you can subclass TarDataset, and read the data in any format you like.

Jupyter notebook tutorial

There is a more comprehensive set of examples as a Jupyter notebook in example.ipynb.

Full "ImageNet in a Tar file" example

A large-scale data loading example is given in imagenet-example.py. Only the section of code responsible for data loading was modified from the official PyTorch ImageNet example.

First, ensure that the data is in the expected format for the original example to work, in a folder named ILSVRC12. Then, create a Tar archive from it (tar cf ILSVRC12.tar ILSVRC12 on Linux or a utility like 7-Zip on Windows). Finally, run our modified imagenet-example.py, passing it the path to the Tar archive instead.

Author

João Henriques, Visual Geometry Group (VGG), University of Oxford

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Related tags

Overview

Simple Tar Dataset

Example

Usage

Jupyter notebook tutorial

Full "ImageNet in a Tar file" example

Author

Owner

Joao Henriques

Face recognition project by matching the features extracted using SIFT.

Codes for CVPR2021 paper "PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization"

PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Neighborhood Contrastive Learning for Novel Class Discovery

library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Anonymous implementation of KSL

Aydin is a user-friendly, feature-rich, and fast image denoising tool

Python-experiments - A Repository which contains python scripts to automate things and make your life easier with python

M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images

[ACM MM2021] MGH: Metadata Guided Hypergraph Modeling for Unsupervised Person Re-identification

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

🎁 3,000,000+ Unsplash images made available for research and machine learning

HyperDict - Self linked dictionary in Python

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

A Fast Monotone Rotating Shallow Water model

Implementation of Online Label Smoothing in PyTorch

PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.

This code is for eCaReNet: explainable Cancer Relapse Prediction Network.