Large dataset storage format for Pytorch

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

  • Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:

    • large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
    • database approach : depend on what kind of database engine used, usually multi-process read is not supported
    • the above method scale non linear in terms of data - storage size
  • TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )

  • However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

  • Support multi-process read

  • Relatively simple to use and low technical debt

  • Support compression/de-compression on the fly

  • Quick load to memory if required

Simple usage

pip install h5record
  1. Sentence Similarity
from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type File size Read speed row/second
no compression 2.0G 2084.55 it/s
lzf 1.7G 1496.14 it/s
gzip 1.1G 843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...
A large-scale video dataset for the training and evaluation of 3D human pose estimation models
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

A large-scale video dataset for the training and evaluation of 3D human pose estimation models
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography
A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination
Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination (ICCV 2021) Dataset License This work is l

LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

A large-scale face dataset for face parsing, recognition, generation and editing.
A large-scale face dataset for face parsing, recognition, generation and editing.

CelebAMask-HQ [Paper] [Demo] CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA da

N-Omniglot is a large neuromorphic few-shot learning dataset
N-Omniglot is a large neuromorphic few-shot learning dataset

N-Omniglot [Paper] || [Dataset] N-Omniglot is a large neuromorphic few-shot learning dataset. It reconstructs strokes of Omniglot as videos and uses D

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

This is the dataset and code release of the OpenRooms Dataset.
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Comments
  • Example about Image dataset

    Example about Image dataset

    Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?

    documentation question 
    opened by meet-minimalist 1
Releases(1.0.4)
Owner
theblackcat102
¯\_(ツ)_/¯
theblackcat102
Train CPPNs as a Generative Model, using Generative Adversarial Networks and Variational Autoencoder techniques to produce high resolution images.

cppn-gan-vae tensorflow Train Compositional Pattern Producing Network as a Generative Model, using Generative Adversarial Networks and Variational Aut

hardmaru 343 Dec 29, 2022
Audio Visual Emotion Recognition using TDA

Audio Visual Emotion Recognition using TDA RAVDESS database with two datasets analyzed: Video and Audio dataset: Audio-Dataset: https://www.kaggle.com

Combinatorial Image Analysis research group 3 May 11, 2022
A curated list of Machine Learning and Deep Learning tutorials in Jupyter Notebook format ready to run in Google Colaboratory

Awesome Machine Learning Jupyter Notebooks for Google Colaboratory A curated list of Machine Learning and Deep Learning tutorials in Jupyter Notebook

Carlos Toxtli 245 Jan 01, 2023
A simple API wrapper for Discord interactions.

Your ultimate Discord interactions library for discord.py. About | Installation | Examples | Discord | PyPI About What is discord-py-interactions? dis

james 641 Jan 03, 2023
Reproduction process of AlexNet

PaddlePaddle论文复现杂谈 背景 注:该repo基于PaddlePaddle,对AlexNet进行复现。时间仓促,难免有所疏漏,如果问题或者想法,欢迎随时提issue一块交流。 飞桨论文复现赛地址:https://aistudio.baidu.com/aistudio/competitio

19 Nov 29, 2022
Towards Understanding Quality Challenges of the Federated Learning: A First Look from the Lens of Robustness

FL Analysis This repository contains the code and results for the paper "Towards Understanding Quality Challenges of the Federated Learning: A First L

3 Oct 17, 2022
A PyTorch implementation of "CoAtNet: Marrying Convolution and Attention for All Data Sizes".

CoAtNet Overview This is a PyTorch implementation of CoAtNet specified in "CoAtNet: Marrying Convolution and Attention for All Data Sizes", arXiv 2021

Justin Wu 268 Jan 07, 2023
Attentional Focus Modulates Automatic Finger‑tapping Movements

"Attentional Focus Modulates Automatic Finger‑tapping Movements", in Scientific Reports

Xingxun Jiang 1 Dec 02, 2021
Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

Portrait Segmentation using Tensorflow This script removes the background from an input image. You can read more about segmentation here Setup The scr

291 Dec 24, 2022
Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust.

Subspace Adversarial Training Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust. However,

15 Sep 02, 2022
Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming

Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming. Outperforming `GPT-3` on SuperGLUE Few-Shot text classification.

YerevaNN 75 Nov 06, 2022
This project hosts the code for implementing the ISAL algorithm for object detection and image classification

Influence Selection for Active Learning (ISAL) This project hosts the code for implementing the ISAL algorithm for object detection and image classifi

25 Sep 11, 2022
Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis, including human motion imitation, appearance transfer, and novel view synthesis. Currently the paper is under review

2.3k Jan 05, 2023
Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

On the Generative Utility of Cyclic Conditionals This repository is the official implementation of "On the Generative Utility of Cyclic Conditionals"

Chang Liu 44 Nov 16, 2022
《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Image2Reverb Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments

Nikhil Singh 48 Nov 27, 2022
Hydra Lightning Template for Structured Configs

Hydra Lightning Template for Structured Configs Template for creating projects with pytorch-lightning and hydra. How to use this template? Create your

Model-driven Machine Learning 4 Jul 19, 2022
Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

Sai Kumar Dwivedi 83 Nov 27, 2022
Hi Guys, here I am providing examples, which will help you in Lerarning Python

LearningPython Hi guys, here I am trying to include as many practice examples of Python Language, as i Myself learn, and hope these will help you in t

4 Feb 03, 2022
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection

FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection This repository contains an implementation of FCAF3D, a 3D object detection method introdu

SamsungLabs 153 Dec 29, 2022
Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Detecting Twenty-thousand Classes using Image-level Supervision Detic: A Detector with image classes that can use image-level labels to easily train d

Meta Research 1.3k Jan 04, 2023