A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Last update: Dec 30, 2022

Related tags

Overview

TorchData ( 🚨 Warning: Unstable Prototype 🚨 )

This is a prototype library currently under heavy development. It does not currently have stable releases, and as such will likely be modified significantly in BC-breaking ways until beta release (targeting early 2022), and can only be used with the PyTorch nighly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a github issue. We'd love to hear thoughts and feedback.

torchdata is a prototype library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

It aims to provide composable iter-style and map-style building blocks called DataPipes that work well out of the box with the PyTorch DataLoader. Right now it only contains basic functionality to reproduce several datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking). We plan to expand and harden this set considerably over the coming months.

To understand the basic structure of DataPipes, please see What are DataPipes? below, and to see how DataPipes can be practically composed into datasets, please see our examples/ directory.

Note that because many features of the original DataLoader have been modularized into DataPipes, some now live as standard DataPipes in pytorch/pytorch rather than torchdata to preserve BC functional parity within torch.

Why composable data loading?

Over many years of feedback and organic community usage of the PyTorch DataLoader and DataSets, we've found that:

The original DataLoader bundled too many features together, making them difficult to extend, manipulate, or replace. This has created a proliferation of use-case specific DataLoader variants in the community rather than an ecosystem of interoperable elements.
Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these table-stakes elements.

Installation

Colab

Follow the instructions in this Colab notebook

Local pip or conda

First, set up an environment. We will be installing a nightly PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Next, install one of the following following PyTorch nightly binaries.

# For CUDA 10.2
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
# For CUDA 11.1
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
# For CPU-only build
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

If you already have a nightly of PyTorch installed and wanted to upgrade it (recommended!), append --upgrade to one of those commands.

Install torchdata:

pip install --user "git+https://github.com/pytorch/data.git"

Run a quick sanity check in python:

from torchdata.datapipes.iter import HttpReader
URL = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
ag_news_train = HttpReader([URL]).parse_csv().map(lambda t: (int(t[0]), " ".join(t[1:])))
agn_batches = ag_news_train.batch(2).map(lambda batch: {'labels': [sample[0] for sample in batch],\
                                      'text': [sample[1].split() for sample in batch]})
batch = next(iter(agn_batches))
assert batch['text'][0][0:8] == ['Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black']

From source

$ pip install -e git+https://github.com/pytorch/data#egg=torchdata

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritence of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs):
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data)

    def __len__(self):
        return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sohpisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes. Note that the vast majority of initial support is focused on IterDataPipes, while more MapDataPipes support will come later.

Implementing DataPipes

As a guiding example, let's implement an IterDataPipe that applies a callable to the input iterator. For MapDataPipes, take a look at the map folder for examples, and follow the steps below for the __getitem__ method instead of __iter__.

Naming

The naming convention for DataPipes is "Operation"-er, followed by IterDataPipe or MapDataPipe, as each DataPipe is essentially a container to apply an operation to data yielded from a source DataPipe. For succintness, we alias to just "Operation-er" in init files. For our IterDataPipe example, we'll name the module MapperIterDataPipe and alias it as iter.Mapper under datapipes.

Constructor

DataSets are now generally constructed as stacks of DataPipes, so each DataPipe typically takes a source DataPipe as its first argument.

class MapperIterDataPipe(IterDataPipe):
    def __init__(self, dp, fn):
        super().__init__()
        self.dp = dp
        self.fn = fn

Note:

Avoid loading data from the source DataPipe in __init__ function, in order to support lazy data loading and save memory.
If IterDataPipe instance holds data in memory, please be ware of the in-place modification of data. When second iterator is created from the instance, the data may have already changed. Please take IterableWrapper class as reference to deepcopy data for each iterator.

Iterator

For IterDataPipes, an __iter__ function is needed to consume data from the source IterDataPipe then apply the operation over the data before yield.

class MapperIterDataPipe(IterDataPipe):
    ...

    def __iter__(self):
        for d in self.dp:
            yield self.fn(d)

Length

In many cases, as in our MapperIterDataPipe example, the __len__ method of a DataPipe returns the length of the source DataPipe.

class MapperIterDataPipe(IterDataPipe):
    ...

    def __len__(self):
        return len(self.dp)

However, note that __len__ is optional for IterDataPipe and often inadvisable. For CSVParserIterDataPipe in the using DataPipes section below, __len__ is not implemented because the number of rows in each file is unknown before loading it. In some special cases, __len__ can be made to either return an integer or raise an Error depending on the input. In those cases, the Error must be a TypeError to support Python's build-in functions like list(dp).

Registering DataPipes with the functional API

Each DataPipe can be registered to support functional invocation using the decorator functional_datapipe.

@functional_datapipe("map")
class MapperIterDataPipe(IterDataPipe):
    ...

The stack of DataPipes can then be constructed in functional form:

>>> import torch.utils.data.datapipes as dp
>>> datapipes1 = dp.iter.FileLoader(['a.file', 'b.file']).map(fn=decoder).shuffle().batch(2)

>>> datapipes2 = dp.iter.FileLoader(['a.file', 'b.file'])
>>> datapipes2 = dp.iter.Mapper(datapipes2)
>>> datapipes2 = dp.iter.Shuffler(datapipes2)
>>> datapipes2 = dp.iter.Batcher(datapipes2, 2)

In the above example, datapipes1 and datapipes2 represent the exact same stack of IterDataPipes.

Using DataPipes

For a complete example, suppose we want to load data from CSV files with the following steps:

List all csv files in a directory
Load csv files
Parse csv file and yield rows

To support the above pipeline, CSVParser is registered as parse_csv_files to consume file streams and expand them as rows.

@functional_datapipe("parse_csv_files")
class CSVParserIterDataPipe(IterDataPipe):
    def __init__(self, dp, **fmtparams):
        self.dp = dp
        self.fmtparams = fmtparams

    def __iter__(self):
        for filename, stream in self.dp:
            reader = csv.reader(stream, **self.fmtparams)
            for row in reader:
                yield filename, row

Then, the pipeline can be assembled as follows:

>>> import torch.utils.data.datapipes as dp

>>> FOLDER = 'path/2/csv/folder'
>>> datapipe = dp.iter.FileLister([FOLDER]).filter(fn=lambda filename: filename.endswith('.csv'))
>>> datapipe = dp.iter.FileLoader(datapipe, mode='rt')
>>> datapipe = datapipe.parse_csv_files(delimiter=' ')

>>> for d in datapipe: # Start loading data
...     pass

Contributing

We welcome PRs! See the CONTRIBUTING file.

Prototype Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

Future Plans

We hope to sufficiently expand the library, harden APIs, and gather feedback to enable a beta release at the time of the PyTorch 1.11 release (early 2022).

License

TorchData is BSD licensed, as found in the LICENSE file.

Comments

S3 datapipes
Changes

Added S3FileLister and S3FileLoader IterDataPipes.

Added pybind11 build for s3 io cpp files and python scripts.

TODO

[x] clean up setup files and link pybind11 in CMAKE_PREFIX automatically.

[x] remove aws-cpp-sdk dependency at build with BUILD_S3 env var & pop exceptions when missing dependencies at usage.

[x] new api changes for list_files.

[x] clean up cpp files (naming, new structure, new logic etc.)

[x] expose timeouts, regions.

[x] thorough tests

[x] different correct usage: bucket (with or without / at last), folder (with or without / at last), prefix, item.

[x] different incorrect usage: non-existing files, wrong s3 urls, etc.

[x] region changes

[x] choice of public datasets

[x] benchmarks

[x] performance test

[x] README.md

[x] user guide & recommendations

[x] dependencies

CLA Signed
opened by ydaiming 39
Add list_file() functional API to FSSpecFileLister and IoPathFileLister
Fixes #387

Changes

Adds list_file() method on IoPathFileListerIterDataPipe

Adds list_file() method on FSSpecFileListerIterDataPipe

Add tests for those methods

Additional comments

I feel as if the implementation is quite naive. Would appreciate any feedback on it.
CLA Signed
opened by xiurobert 25

Graph traversal is broken for custom iter datapipes

from torch.utils.data.graph import traverse
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper


class CustomIterDataPipe(IterDataPipe):
    def noop(self, x):
        return x

    def __init__(self):
        self._dp = IterableWrapper([]).map(self.noop)

    def __iter__(self):
        yield from self._dp


traverse(CustomIterDataPipe())

RecursionError: maximum recursion depth exceeded

Without the .map() call it works fine. I don't think this is specific to .map() though. From trying a few datapipes, this always happens if self._dp is composed in some way.

bug high priority

opened by pmeier 24

Refactoring and renaming to KeyZipper to IterKeyZipper and MapZipper to MapKeyZipper
Stack from ghstack:

-> #50

Since MapZipper has been added, the name KeyZipper is confusing and should be changed to IterKeyZipper instead. We are also changing MapZipper to MapKeyZipper to ensure the names stay matching.

Note that this renaming is BC breaking for users.

Differential Revision: D31487393
CLA Signed Merged
opened by NivekT 22
[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton)
This is the initial draft. I will complete it shortly.

State of Iterator is attached to each IterDataPipe instance. This is super useful for:

Determinism

Snapshotting

Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.

Implementation Options:

Each DataPipe has an attribute of _iterator as the place holder for __iter__ calls.

Implement __next__. (My Preference)

It would make the instance pickable. Previously generator function (__iter__) is not picklable -> Help multiprocessing and snapshotting)

__iter__ return self (Forker(self) may be another option, not 100% sure)

IMO, this is super useful as we can track the number of __next__ call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from __iter__, which we couldn't track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached to self instance)

As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn't match the actual execution graph.

DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork) Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.

cc: @VitalyFedyunin @NivekT
opened by ejguan 22
Issue during import of portalocker on windows
🐛 Describe the bug

Currently TorchText CI is broken on windows due to following error:

ImportError: DLL load failed while importing win32file: The specified module could not be found

The error occurred during import of portalocker.

cc: @vitaly-fedyunin

Versions

Latest from main
opened by parmeet 19
Refactor OnDiskCache
Fixes https://github.com/facebookexternal/torchdata/issues/114 and https://github.com/facebookexternal/torchdata/issues/140

Stack from ghstack:

#61 Refactor OnDiskCache

~This PR relies on a patch in PyTorch Core https://github.com/pytorch/pytorch/pull/67783~ (Landed)

Refactor OnDiskCacheHolder to track a sequence of DataPipe operations.

Yield filepath rather than file handle

filepath_fn also supports multiple outputs like list or tuple of file paths or generator function to yield multiple file paths.

hash_dict and hash_type is used to support hash check. If specified, the pipeline will check the data before saving to local file system against the hash. Will raise Error when data doesn't meet the hash.

Optional extra_check_fn can be used to do extra check for each file (This function should take filepath as input)

To track the sequence of DataPipe operations, users could use functional API or DataPipe constructor

The returned data at the end of operations should be (metadata, bytes/string) or (metadata, filehandle)

For end_caching:

Refactor it to a separate DataPipe class

mode is used to determine how to save the data or how to read from file handles

filepath_fn is an optional function to be applied to the metadata of result DataPipe

same_filepath_fn is used to indicate that the same filepath_fn from OnDiskCacheHolder will be used.

skip_read is a flag to skip reading from file handles before saved to local file system.

Features

Supports both functional API and DataPipe constructor

Supports multiple on_disk_cache in the pipeline

Use case

Single file with hash check

temp_dir = tempfile.TemporaryDirectory() tar_file_dp = IterableWrapper([tar_file_url]) def _filepath_fn(url): filename = os.path.basename(url) return os.path.join(temp_dir.name, filename) tar_hash_dict = {"xxxx": "yyyy"} tar_cache_dp = tar_file_dp.on_disk_cache(filepath_fn=_filepath_fn, hash_dict=tar_hash_dict, hash_type="md5") # Option 1 # Add map function to transform url to file path # tar_cache_dp = HttpReader(tar_cache_dp).map(fn=_filepath_fn, input_col=0) # tar_cache_dp = tar_cache_dp.end_caching(mode="wb") # Option2 use `same_filepath_fn` tar_cache_dp = HttpReader(tar_cache_dp).end_caching(mode="wb", same_filepath_fn=True)

Multiple files

# - csv.tar # | - 0.csv # | - 1.csv # | - 2.csv archive_dp = IterableWrapper([archive_file_path]) def _gen_filepath_fn(archive_path): # Generator function for i in range(3): yield os.path.join(os.path.dirname(archive_path), "csv", "{}.csv".format(i)) file_cache_dp = OnDiskCacheHolder(archive_dp, filepath_fn=_gen_filepath_fn) file_cache_dp = FileLoader(file_cache_dp, mode="rb") file_cache_dp = TarArchiveReader(file_cache_dp) file_cache_dp = file_cache_dp.map(fn=lambda x: x.read().decode(), input_col=1) def _csv_filepath_fn(csv_path): return os.path.join(os.path.dirname(os.path.dirname(csv_path)), "csv", os.path.basename(csv_path)) # Text mode and skip_read as the data is read and decoded file_cache_dp = EndOnDiskCacheHolder(file_cache_dp, mode="w", filepath_fn=_csv_filepath_fn, skip_read=True)

cc: @pmeier

Differential Revision: D31734382
CLA Signed Merged ciflow/slow
opened by ejguan 18
[DataPipe] Adding kwargs for `fs.open()` in fsspec DataPipes
Stack from ghstack:

-> #804

Fixes #803

I left FSSpecFileLister untouched since I don't think it will be useful for fs.ls() to accept kwargs.

Differential Revision: D40038331
CLA Signed
opened by NivekT 17
Exception: Could not get the file at https://s3.amazonaws.com/... [RequestException] None.

🐛 Describe the bug

Code that throw the excpetion:

from torchtext.datasets import WikiText2 train_iter, val_iter, test_iter = WikiText2()

code refuse to download via torchtext.datasets, but I can download the data right off the browser just fine.

Exception raised: Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 700, in urlopen self._prepare_proxy(conn) File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 994, in _prepare_proxy
conn.connect() File "C:\python\Python310\lib\site-packages\urllib3\connection.py", line 364, in connect self.sock = conn = self._connect_tls_proxy(hostname, conn) File "C:\python\Python310\lib\site-packages\urllib3\connection.py", line 499, in connect_tls_proxy
socket = ssl_wrap_socket( File "C:\python\Python310\lib\site-packages\urllib3\util\ssl.py", line 453, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls) File "C:\python\Python310\lib\site-packages\urllib3\util\ssl.py", line 495, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock) File "C:\python\Python310\lib\ssl.py", line 512, in wrap_socket return self.sslsocket_class._create( File "C:\python\Python310\lib\ssl.py", line 1070, in _create self.do_handshake() File "C:\python\Python310\lib\ssl.py", line 1341, in do_handshake self._sslobj.do_handshake() ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\requests\adapters.py", line 440, in send resp = conn.urlopen( File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen retries = retries.increment( File "C:\python\Python310\lib\site-packages\urllib3\util\retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 17, in _get_response_from_http r = session.get(url, stream=True) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 542, in get return self.request('GET', url, **kwargs) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "C:\python\Python310\lib\site-packages\requests\adapters.py", line 517, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "c:\python\my\1. pytorch\language_modeling_transf.py", line 83, in vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['']) File "C:\python\Python310\lib\site-packages\torchtext\vocab\vocab_factory.py", line 92, in build_vocab_from_iterator for tokens in iterator: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\plain_text_reader.py", line 116, in iter for path, file in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\fileopener.py", line 60, in iter yield from get_file_binaries_from_pathnames(self.datapipe, self.mode, self.encoding) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\utils\common.py", line 85, in get_file_binaries_from_pathnames for pathname in pathnames: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 46, in iter for data in dp: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\filelister.py", line 51, in iter for path in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\grouping.py", line 140, in iter for element in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 356, in next return next(self.iterator) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 190, in get_generator_by_instance yield from self.main_datapipe.get_next_element_by_instance(self.instance_id) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 301, in get_next_element_by_instance yield self._find_next(instance_id) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 275, in _find_next value = next(self._datapipe_iterator) File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 46, in iter for data in dp: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\saver.py", line 48, in iter for filepath, data in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\hashchecker.py", line 62, in iter for file_name, data in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 56, in iter yield _get_response_from_http(url, timeout=self.timeout) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 24, in _get_response_from_http raise Exception(f"Could not get the file at {url}. [RequestException] {e.response}.") Exception: Could not get the file at https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip. [RequestException] None.

Versions

PyTorch version: 1.11.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044 Is CUDA available: True CUDA runtime version: 11.3.58 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 512.15 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\bin\cudnn_ops_train64_8.dll HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.22.3 [pip3] torch==1.11.0+cu113 [pip3] torchaudio==0.11.0+cu113 [pip3] torchdata==0.3.0 [pip3] torchtext==0.12.0 [pip3] torchvision==0.12.0+cu113 [conda] Could not collect
bug

opened by Creative-Ataraxia 17
Implement DistribtuedReadingService
Add DistributedReadingService

Single process

Share shuffle seeds across distributed process

Automatically distributed sharding

Add tests for both DataLoader2 and DataLoader.

Spawn processes

Elastic training

CLA Signed
opened by ejguan 16
update s3 test cases
Please read through our contribution guide prior to creating your pull request.

Note that there is a section on requirements related to adding a new DataPipe.

Fixes #460

Changes

update the s3 test cases due to an update in the public dataset

CLA Signed
opened by ydaiming 16

Weird behaviour of `InMemoryCacheHolder` not really speeding things up

🐛 Describe the bug

Weird behaviour of InMemoryCacheHolder not really speeding things up

First iteration took 9s, all the others 4s. Why? Shouldn't it be cached?

# download camvid and place it here
import torchdata.datapipes.iter as pipes
from pathlib import Path
from torchvision.io import read_image
from torch.utils.data import DataLoader
from time import perf_counter
from PIL import Image

dataset_dir = Path('./camvid')

pipe = pipes.Zipper(
    pipes.FileLister([dataset_dir / "images"], masks='*png'),
).map(lambda x: (read_image(x[0])))

pipe = pipes.InMemoryCacheHolder(pipe, size=32000).sharding_filter() # 8GB
dl = DataLoader(pipe, batch_size=32, num_workers=8, persistent_workers=True, prefetch_factor=2)

for i in range(10):
    start = perf_counter()
    for data in dl:
        # print(image.shape)
        continue

    print(f"[{i}]Elapsed {perf_counter() - start: .2f}")

Output

[0]Elapsed  18.8
[1]Elapsed  4.41
[2]Elapsed  4.47
[3]Elapsed  4.75
[4]Elapsed  4.53
[5]Elapsed  4.41
[6]Elapsed  4.38
[7]Elapsed  4.41
[8]Elapsed  4.41
[9]Elapsed  4.41

If I set num_workers=1, the first iteration is faster, and then all the others are the same

If I use .batch(32), useless in RL since to my understand I need more workers to prepare the next batches, I see a speed up

...
pipe = pipes.Zipper(
    pipes.FileLister([dataset_dir / "images"], masks='*png'),
).map(lambda x: (read_image(x[0])))

pipe = pipes.InMemoryCacheHolder(pipe, size=32000).batch(32) # 8GB

for i in range(10):
    start = perf_counter()
    for data in pipe:
        # print(image.shape)
        continue

    print(f"[{i}]Elapsed {perf_counter() - start: .2f}")

[0]Elapsed  15.99
[1]Elapsed  0.03
[2]Elapsed  0.03
[3]Elapsed  0.03
[4]Elapsed  0.03
[5]Elapsed  0.03
[6]Elapsed  0.03
[7]Elapsed  0.03
[8]Elapsed  0.03
[9]Elapsed  0.03

Thanks!

Versions

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] torch==1.13.1
[pip3] torchdata==0.5.1
[pip3] torchvision==0.14.1
[conda] Could not collect

opened by FrancescoSaverioZuppichini 0

`_DataPipeSerializationWrapper` doesn't work with multiprocessing Queue
🐛 Describe the bug

After https://github.com/pytorch/data/pull/919 is landed, a hanging problem happens on MacOS or Windows, where spawn is used to create subprocesses by default. See: https://github.com/pytorch/data/actions/runs/3794926183 I was able to mitigate the issue by removing the SeializationWrapper from https://github.com/pytorch/data/blob/e15e1453967ce2f25f6fcd2838caadfd0e2fa811/torchdata/dataloader2/dataloader2.py#L112

And, the reason that the SerializationWrapper doesn't work is multiprocessing.Queue is attached to a DataPipe and sent to subprocesses. Even though I am able to solve my hanging problem in a different way, it's better to solve this problem directly via SerializationWrapper.

The following should be a minimum repro example

ctx = mp.get_context("spawn") q = ctx.Queue() dp = IterableWrapper(list(range(10))) # Attach a Queue dp.q = q dl = DataLoader2(dp, reading_service=PrototypeMultiProcessingReadingService(2, "spawn")) for d in dl: pass

cc: @NivekT

Versions

main
opened by ejguan 1
Start to graduate `PrototypeMultiProcessingReadingService` from "prototype mode"
unrelated: we should start to graduate it from "prototype mode" and starts find initial pioneering use case adoption~

Originally posted by @wenleix in https://github.com/pytorch/data/pull/919#discussion_r1055161856
opened by ejguan 3
Potential circular import in `prefetcher`

🐛 Describe the bug

In prefetcher.py, dataloader2 is imported. There is a potential circular import issue if dataloader2 needs to take some dependency on DataPipe. https://github.com/pytorch/data/blob/fbee6f75c9e630ea793116caea58911d5ad7d6e0/torchdata/datapipes/iter/util/prefetcher.py#L13

We need to guarantee the dependency flow is dataloader2 depends on datapipes but not vice versa.

Versions

main

opened by ejguan 10
Emphasize `shutdown` should be called for `DataLoader2` at the end of loop
📚 The doc issue

When ReadingService is presented, it's better to call shutdown from DataLoader2 to properly clean up either distributed process group or persistent worker processes.

[ ] We should add a note regarding shutdown to DataLoader2.

[ ] We need to add a tutorial section for DataLoader2

[ ] Beyond documentation, we need to clean up our test cases to make sure shutdown is called as examples for our customers.

Suggest a potential alternative/fix

No response
opened by ejguan 0

Releases(v0.5.1)

v0.5.1(Dec 16, 2022)

This is a minor release to update PyTorch dependency from 1.13.0 to 1.13.1. Please check the release note of TorchData 0.5.0 major release for more detail.
Source code(tar.gz)
Source code(zip)

v0.5.0(Oct 27, 2022)

TorchData 0.5.0 Release Notes

Highlights
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug Fixes
Performance
Documentation
Future Plans
Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
- AWS S3 Benchmarking result
Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https://github.com/pytorch/pytorch/pull/83202)

IterDataPipe is used to to preserve data order

MapDataPipe.shuffle
0.4.1	0.5.0
_{>>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False}	_{>>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True}

MapDataPipe.shuffle

0.4.1

0.5.0

_{>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False}

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https://github.com/pytorch/data/pull/810)

on_disk_cache
0.4.1	0.5.0
_{>>> url_dp = IterableWrapper(["https://path/to/filename", ]) >>> def filepath_gen_fn(url): … yield from [url + f”/{i}” for i in range(3)] >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)}	_{>>> url_dp = IterableWrapper(["https://path/to/filename", ]) >>> def filepath_gen_fn(url): … yield from [url + f”/{i}” for i in range(3)] >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn) # AssertionError}

on_disk_cache

0.4.1

0.5.0

_{>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)}

DataLoader2

Imposed single iterator constraint on `DataLoader2` (https://github.com/pytorch/data/pull/700)

DataLoader2 with a single iterator
0.4.1	0.5.0
_{>>> dl = DataLoader2(IterableWrapper(range(10))) >>> it1 = iter(dl) >>> print(next(it1)) 0 >>> it2 = iter(dl) # No reset here >>> print(next(it2)) 1 >>> print(next(it1)) 2}	_{>>> dl = DataLoader2(IterableWrapper(range(10))) >>> it1 = iter(dl) >>> print(next(it1)) 0 >>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator >>> print(next(it2)) 0 >>> print(next(it1)) # Raises exception, since it1 is no longer valid}

DataLoader2 with a single iterator

0.4.1

0.5.0

_{>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2}

_{>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid}

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Deep copy DataPipe during DataLoader2 constructor
0.4.1	0.5.0
_{>>> dp = IterableWrapper([0, 1, 2, 3, 4]) >>> dl1 = DataLoader2(dp) >>> dl2 = DataLoader2(dp) >>> for x, y in zip(dl1, dl2): … print(x, y) # RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...}	_{>>> dp = IterableWrapper([0, 1, 2, 3, 4]) >>> dl1 = DataLoader2(dp) >>> dl2 = DataLoader2(dp) >>> for x, y in zip(dl1, dl2): … print(x, y) 0 0 1 1 2 2 3 3 4 4}

Deep copy DataPipe during DataLoader2 constructor

0.4.1

0.5.0

_{>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...}

_{>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4}

Deprecations

DataLoader2

Deprecated `traverse` function and `only_datapipe` argument (https://github.com/pytorch/pytorch/pull/85667)

Please use traverse_dps with the behavior the same as only_datapipe=True. (https://github.com/pytorch/data/pull/793)

DataPipe traverse function
0.4.1	0.5.0
_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)}	_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False) FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.}

DataPipe traverse function

0.4.1

0.5.0

_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)}

_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.}

New Features

DataPipe

Added AIStore DataPipe (https://github.com/pytorch/data/pull/545, https://github.com/pytorch/data/pull/667)
Added support for IterDataPipe to trace DataFrames operations (https://github.com/pytorch/pytorch/pull/71931,
Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (https://github.com/pytorch/data/pull/537)
Added graph snapshotting by counting number of successful yields for IterDataPipe (https://github.com/pytorch/pytorch/pull/79479, https://github.com/pytorch/pytorch/pull/79657)
Implemented drop operation for IterDataPipe to drop column(s) (https://github.com/pytorch/data/pull/725)
Implemented FullSyncIterDataPipe to synchronize distributed shards (https://github.com/pytorch/data/pull/713)
Implemented slice and flatten operations for IterDataPipe (https://github.com/pytorch/data/pull/730)
Implemented repeat operation for IterDataPipe (https://github.com/pytorch/data/pull/748)
Added LengthSetterIterDataPipe (https://github.com/pytorch/data/pull/747)
Added RandomSplitter (without buffer) (https://github.com/pytorch/data/pull/724)
Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (https://github.com/pytorch/data/pull/789)
Implemented thread based PrefetcherIterDataPipe (https://github.com/pytorch/data/pull/770, https://github.com/pytorch/data/pull/818, https://github.com/pytorch/data/pull/826, https://github.com/pytorch/data/pull/842)

DataLoader2

Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (https://github.com/pytorch/data/pull/571)
Added DistribtuedReadingService to support uneven data sharding (https://github.com/pytorch/data/pull/727)
Added PrototypeMultiProcessingReadingService
- Added prefetching (https://github.com/pytorch/data/pull/826)
- Fixed process termination (https://github.com/pytorch/data/pull/837)
- Enabled deterministic training in distributed/non-distributed environment (https://github.com/pytorch/data/pull/827)
- Handled empty queue exception properly (https://github.com/pytorch/data/pull/785)

Releng

Provided pre-compiled torchdata binaries for arm64 Apple Silicon (https://github.com/pytorch/data/pull/692)

Improvements

DataPipe

Fixed error message coming from singler iterator constraint (https://github.com/pytorch/pytorch/pull/79547)
Enabled profiler record context in __next__ for IterDataPipe (https://github.com/pytorch/pytorch/pull/79757)
Raised warning for unpickable local function (#547) (https://github.com/pytorch/pytorch/pull/80232, https://github.com/pytorch/data/pull/547)
Cleaned up opened streams on the best effort basis (https://github.com/pytorch/data/pull/560, https://github.com/pytorch/pytorch/pull/78952)
Used streaming reading mode for unseekable streams in TarArchiveLoader (https://github.com/pytorch/data/pull/653) Improved GDrive 'content-disposition' error message (https://github.com/pytorch/data/pull/654)
Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (https://github.com/pytorch/data/pull/646)
Raised Error when HTTPReader get 404 Response (#160) (https://github.com/pytorch/data/pull/569)
Added default no-op behavior for flatmap (https://github.com/pytorch/data/pull/749)
Added support to validate input_col with the provided map function for DataPipe (https://github.com/pytorch/pytorch/pull/80267, https://github.com/pytorch/data/pull/755, https://github.com/pytorch/pytorch/pull/84279)
Made ShufflerIterDataPipe support snapshotting (#83535)
Unified implementations between in_batch_shuffle with shuffle for IterDataPipe (https://github.com/pytorch/data/pull/745)
Made IterDataPipe.to_map_datapipe loading data lazily (https://github.com/pytorch/data/pull/765)
Added kwargs to open files for FSSpecFileLister and FSSpecSaver (https://github.com/pytorch/data/pull/804)
Added missing functional name for FileLister (#86497)

DataLoader

Controlled shuffle option to all DataPipes with set_shuffle API https://github.com/pytorch/pytorch/pull/83741)
Made distributed process group lazily initialized & share seed via the process group (https://github.com/pytorch/pytorch/pull/85279)

DataLoader2

Improved graph traverse function
- Added support for unhashable DataPipe (https://github.com/pytorch/pytorch/pull/80509, https://github.com/pytorch/data/pull/559)
- Added support for all python collection objects (https://github.com/pytorch/pytorch/pull/84079, https://github.com/pytorch/data/pull/773)
Ensured finalize and finalize_iteration are called during shutdown or exception (https://github.com/pytorch/data/pull/846)

Releng

Enabled conda release to support GLIBC_2.27 (https://github.com/pytorch/data/pull/859)

Bug Fixes

DataPipe

Fixed error for static typing (https://github.com/pytorch/data/pull/572, https://github.com/pytorch/data/pull/645, https://github.com/pytorch/data/pull/651, https://github.com/pytorch/pytorch/pull/81275, https://github.com/pytorch/data/pull/758)
Fixed fork and unzip operations for the case of a single child (https://github.com/pytorch/pytorch/pull/81502)
Corrected the type of exception that is being raised by ShufflerMapDataPipe (https://github.com/pytorch/pytorch/pull/82666)
Fixed buffer overflow for unzip when columns_to_skip is specified (https://github.com/pytorch/data/pull/658)
Fixed TarArchiveLoader to skip open for opened TarFile stream (https://github.com/pytorch/data/pull/679)
Fixed mishandling of exception message in IterDataPipe (https://github.com/pytorch/pytorch/pull/84676)
Fixed interface generation in setup.py (#87081)

Performance

DataLoader2

Added benchmarking for DataLoader2
- Added AWS cloud configurations (https://github.com/pytorch/data/pull/680)
- Added benchmark from torchvision training references (https://github.com/pytorch/data/pull/714)

Documentation

DataPipe

Added examples for data loading with DataPipe
- Read Criteo TSV and Parquet files and apply TorchArrow operations (https://github.com/pytorch/data/pull/561)
- Read caltech256 and coco with AIStoreDataPipe (https://github.com/pytorch/data/pull/582)
- Read from tigergraph database (https://github.com/pytorch/data/pull/783)
Improved docstring for DataPipe
- DataPipe converters (https://github.com/pytorch/data/pull/710)
- S3 DataPipe (https://github.com/pytorch/data/pull/784)
- FileOpenerIterDataPipe (https://github.com/pytorch/pytorch/pull/81407)
- buffer_size for MaxTokenBucketizer (https://github.com/pytorch/data/pull/834)
- Prefetcher (https://github.com/pytorch/data/pull/835)
Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (https://github.com/pytorch/data/pull/812, https://github.com/pytorch/data/pull/836)
Improved tutorial
- Fixed tutorial for newline on Windows in generate_csv (https://github.com/pytorch/data/pull/675)
- Improved note on shuffling behavior (https://github.com/pytorch/data/pull/688)
- Fixed tutorial about shuffing before sharding (https://github.com/pytorch/data/pull/715)
- Added random_split example (https://github.com/pytorch/data/pull/843)
Simplified long type names for online doc (https://github.com/pytorch/data/pull/838)

DataLoader2

Improved docstring for DataLoader2 (https://github.com/pytorch/data/pull/581, https://github.com/pytorch/data/pull/817)
Added training examples using DataLoader2, ReadingService and DataPipe (https://github.com/pytorch/data/pull/563, https://github.com/pytorch/data/pull/664, https://github.com/pytorch/data/pull/670, https://github.com/pytorch/data/pull/787)

Releng

Added contribution guide for third-party library (https://github.com/pytorch/data/pull/663)

Future Plans

We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making DataLoader2 and related ReadingService more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

Source code(tar.gz)
Source code(zip)

v0.4.1(Aug 5, 2022)
TorchData 0.4.1 Release Notes

Bug fixes

Fixed DataPipe working with DataLoader in the distributed environment (https://github.com/pytorch/pytorch/pull/80348, https://github.com/pytorch/pytorch/pull/81071, https://github.com/pytorch/pytorch/pull/81071)

Documentation

Updated TorchData tutorial (#675, #688, #715)

Releng

Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)

Python [3.8~3.10]

Source code(tar.gz)
Source code(zip)

v0.4.0(Jun 28, 2022)

TorchData 0.4.0 Release Notes

Highlights
Backwards Incompatible Change
Deprecations
New Features
Improvements
Performance
Documentation
Future Plans
Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.4.0 updates are focused on consolidating the DataPipe APIs and supporting more remote file systems. Highlights include:

DataPipe graph is now backward compatible with DataLoader regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here.
AWSSDK is integrated to support listing/loading files from AWS S3.
Adding support to read from TFRecord and Hugging Face Hub.
DataLoader2 became available in prototype mode. For more details, please check our future plans.

Backwards Incompatible Change

DataPipe

Updated `Multiplexer` (functional API `mux`) to stop merging multiple `DataPipes` whenever the shortest one is exhausted (https://github.com/pytorch/pytorch/pull/77145)

Please use MultiplexerLongest (functional API mux_longgest) to achieve the previous functionality.

0.3.0	0.4.0
_{>>> dp1 = IterableWrapper(range(3)) >>> dp2 = IterableWrapper(range(10, 15)) >>> dp3 = IterableWrapper(range(20, 25)) >>> output_dp = dp1.mux(dp2, dp3) >>> list(output_dp) [0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24] >>> len(output_dp) 13}	_{>>> dp1 = IterableWrapper(range(3)) >>> dp2 = IterableWrapper(range(10, 15)) >>> dp3 = IterableWrapper(range(20, 25)) >>> output_dp = dp1.mux(dp2, dp3) >>> list(output_dp) [0, 10, 20, 1, 11, 21, 2, 12, 22] >>> len(output_dp) 9}

0.3.0

0.4.0

_{>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13}

Enforcing single valid iterator for `IterDataPipes` w/wo multiple outputs https://github.com/pytorch/pytorch/pull/70479, (https://github.com/pytorch/pytorch/pull/75995)

If you need to reference the same IterDataPipe multiple times, please apply .fork() on the IterDataPipe instance.

IterDataPipe with a single output
0.3.0	0.4.0
_{>>> source_dp = IterableWrapper(range(10)) >>> it1 = iter(source_dp) >>> list(it1) [0, 1, ..., 9] >>> it1 = iter(source_dp) >>> next(it1) 0 >>> it2 = iter(source_dp) >>> next(it2) 0 >>> next(it1) 1 # Multiple references of DataPipe >>> source_dp = IterableWrapper(range(10)) >>> zip_dp = source_dp.zip(source_dp) >>> list(zip_dp) [(0, 0), ..., (9, 9)]}	_{>>> source_dp = IterableWrapper(range(10)) >>> it1 = iter(source_dp) >>> list(it1) [0, 1, ..., 9] >>> it1 = iter(source_dp) # This doesn't raise any warning or error >>> next(it1) 0 >>> it2 = iter(source_dp) >>> next(it2) # Invalidates `it1` 0 >>> next(it1) RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10)) This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary. For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45. # Multiple references of DataPipe >>> source_dp = IterableWrapper(range(10)) >>> zip_dp = source_dp.zip(source_dp) >>> list(zip_dp) RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10)) This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary. For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.}

IterDataPipe with a single output

0.3.0

0.4.0

_{>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]}

_{>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp) # This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2) # Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.}

IterDataPipe with multiple outputs
0.3.0	0.4.0
_{>>> source_dp = IterableWrapper(range(10)) >>> cdp1, cdp2 = source_dp.fork(num_instances=2) >>> it1, it2 = iter(cdp1), iter(cdp2) >>> list(it1) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> list(it2) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> it1, it2 = iter(cdp1), iter(cdp2) >>> it3 = iter(cdp1) # Basically share the same reference as `it1` # doesn't reset because `cdp1` hasn't been read since reset >>> next(it1) 0 >>> next(it2) 0 >>> next(it3) 1 # The next line resets all ChildDataPipe # because `cdp2` has started reading >>> it4 = iter(cdp2) >>> next(it3) 0 >>> list(it4) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}	_{>>> source_dp = IterableWrapper(range(10)) >>> cdp1, cdp2 = source_dp.fork(num_instances=2) >>> it1, it2 = iter(cdp1), iter(cdp2) >>> list(it1) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> list(it2) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> it1, it2 = iter(cdp1), iter(cdp2) >>> it3 = iter(cdp1) # This invalidates `it1` and `it2` >>> next(it1) RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2). For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45. >>> next(it2) RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2). For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45. >>> next(it3) 0 # The next line should not invalidate anything, as there was no new iterator created # for `cdp2` after `it2` was invalidated >>> it4 = iter(cdp2) >>> next(it3) 1 >>> list(it4) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

IterDataPipe with multiple outputs

0.3.0

0.4.0

_{>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1) # This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
# The next line should not invalidate anything, as there was no new iterator created
# for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

Deprecations

DataPipe

Deprecated functional APIs of `open_file_by_fsspec` and `open_file_by_iopath` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/78970, https://github.com/pytorch/pytorch/pull/79302)

Please use open_files_by_fsspec and open_files_by_iopath

0.3.0	0.4.0
_{>>> dp = IterableWrapper([file_path, ]) >>> dp = dp.open_file_by_fsspec() # No Warning >>> dp = IterableWrapper([file_path, ]) >>> dp = dp.open_file_by_iopath() # No Warning}	_{>>> dp = IterableWrapper([file_path, ]) >>> dp = dp.open_file_by_fsspec() FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0. See https://github.com/pytorch/data/issues/163 for details. Please use `.open_files_by_fsspec()` instead. >>> dp = IterableWrapper([file_path, ]) >>> dp = dp.open_file_by_iopath() FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0. See https://github.com/pytorch/data/issues/163 for details. Please use `.open_files_by_iopath()` instead.}

0.3.0

0.4.0

_{>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec() # No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath() # No Warning}

_{>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.}

Argument `drop_empty_batches` of `Filter` (functional API `filter`) is deprecated and going to be removed in the future release (https://github.com/pytorch/pytorch/pull/76060)

0.3.0	0.4.0
_{>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)]) >>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)}	_{>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)]) >>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True) FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14. See https://github.com/pytorch/data/issues/163 for details.}

0.3.0

0.4.0

_{>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)}

_{>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.}

New Features

DataPipe

Added utility to visualize DataPipe graphs (https://github.com/pytorch/data/pull/330)

IterDataPipe

Added Bz2FileLoader with functional API of load_from_bz2 (https://github.com/pytorch/data/pull/312)
Added BatchMapper (functional API: map_batches) and FlatMapper (functional API: flat_map) (https://github.com/pytorch/data/pull/359)
Added support for WebDataset-style archives (https://github.com/pytorch/data/pull/367)
Added MultiplexerLongest with functional API of mux_longest (https://github.com/pytorch/data/pull/372)
Add ZipperLongest with functional API of zip_longest (https://github.com/pytorch/data/pull/373)
Added MaxTokenBucketizer with functional API of max_token_bucketize (https://github.com/pytorch/data/pull/283)
Added S3FileLister (functional API: list_files_by_s3) and S3FileLoader (functional API: load_files_by_s3) integrated with the native AWSSDK (https://github.com/pytorch/data/pull/165)
Added HuggingFaceHubReader (https://github.com/pytorch/data/pull/490)
Added TFRecordLoader with functional API of load_from_tfrecord (https://github.com/pytorch/data/pull/308)

MapDataPipe

Added UnZipper with functional API of unzip (https://github.com/pytorch/data/pull/325)
Added MapToIterConverter with functional API of to_iter_datapipe (https://github.com/pytorch/data/pull/327)
Added InMemoryCacheHolder with functional API of in_memory_cache (https://github.com/pytorch/data/pull/328)

Releng

Added nightly releases for TorchData. Users should be able to install nightly TorchData via
- pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu
- conda install -c pytorch-nightly torchdata
Added support of AWSSDK enabled DataPipes. See: README
- AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases

Improvements

DataPipe

Added optional encoding argument to FileOpener (https://github.com/pytorch/pytorch/pull/72715)
Renamed BucketBatcher argument to avoid name collision (https://github.com/pytorch/data/pull/304)
Removed default parameter of ShufflerIterDataPipe (https://github.com/pytorch/pytorch/pull/74370)
Made profiler wrapper can delegating function calls to DataPipe iterator (https://github.com/pytorch/pytorch/pull/75275)
Added input_col argument to flatmap for applying fn to the specific column(s) (https://github.com/pytorch/data/pull/363)
Improved debug message when exceptions are raised within IterDataPipe (https://github.com/pytorch/pytorch/pull/75618)
Improved debug message when argument is a tuple/list of DataPipes (https://github.com/pytorch/pytorch/pull/76134)
Add functional API to StreamReader (functional API: open_files) and FileOpener (functional API: read_from_stream) (https://github.com/pytorch/pytorch/pull/76233)
Enabled graph traversal for MapDataPipe (https://github.com/pytorch/pytorch/pull/74851)
Added input_col argument to filter for applying filter_fn to the specific column(s) (https://github.com/pytorch/pytorch/pull/76060)
Added functional APIs for OnlineReaders (https://github.com/pytorch/data/pull/369)
- HTTPReaderIterDataPipe: read_from_http
- GDriveReaderDataPipe: read_from_gdrive
- OnlineReaderIterDataPipe: read_from_remote
Cleared buffer for DataPipe during __del__ (https://github.com/pytorch/pytorch/pull/76345)
Overrode wrong python https proxy on Windows (https://github.com/pytorch/data/pull/371)
Exposed functional API of 'to_map_datapipe' from IterDataPipe's pyi interface (https://github.com/pytorch/data/pull/326)
Moved buffer for IterDataPipe from iterator to instance (self) (https://github.com/pytorch/data/pull/388)
Improved DataPipe serialization:
- Enabled serialization of ForkerIterDataPipe (https://github.com/pytorch/pytorch/pull/73118)
- Fixed issue with DataPipe serialization with dill (https://github.com/pytorch/pytorch/pull/72896)
- Applied special serialization when dill is installed (https://github.com/pytorch/pytorch/pull/74958)
- Applied dill serialization for demux and added cache to graph traverse (https://github.com/pytorch/pytorch/pull/75034)
- Revamp serialization logic of DataPipes (https://github.com/pytorch/pytorch/pull/74984)
- Prevented automatic reset after state is restored (https://github.com/pytorch/pytorch/pull/77774)
Moved IterDataPipe buffers from iter to instance (self) (#76999)
Refactored buffer of Multiplexer from __iter__ to instance (self) (https://github.com/pytorch/pytorch/pull/77775)
Made GDriveReader handling Virus Scan Warning (https://github.com/pytorch/data/pull/442)
Added **kwargs arguments to HttpReader to specify extra parameters for HTTP requests (https://github.com/pytorch/data/pull/392)
Updated FSSpecFileLister and IoPathFileLister to support multiple root paths and updated FSSpecFileLister to support S3 urls (https://github.com/pytorch/data/pull/383)
Fixed racing condition issue with writing files in multiprocessing
- Added filelock to IoPathSaver to prevent racing condition (https://github.com/pytorch/data/pull/413)
- Added lock mechanism to prevent on_disk_cache downloading twice https://github.com/pytorch/data/pull/409)
- Add instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
Added a 's' to the functional names of open/list DataPipes (https://github.com/pytorch/data/pull/479)
Added list_file functional API to FSSpecFileLister and IoPathFileLister (https://github.com/pytorch/data/pull/463)
Added list_files functional API to FileLister (https://github.com/pytorch/pytorch/pull/78419)
Improved FSSpec DataPipes to accept extra keyword arguments (https://github.com/pytorch/data/pull/495)
Pass through kwargs to json.loads call in JsonParse (https://github.com/pytorch/data/pull/518)

DataLoader

Added ability to use dill to pass DataPipes in multiprocessing (https://github.com/pytorch/pytorch/pull/77288))
DataLoader automatically apply sharding to DataPipe graph in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/78762, https://github.com/pytorch/pytorch/pull/78950, https://github.com/pytorch/pytorch/pull/79041, https://github.com/pytorch/pytorch/pull/79124, https://github.com/pytorch/pytorch/pull/79524)
Made ShufflerDataPipe deterministic with DataLoader in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/77741, https://github.com/pytorch/pytorch/pull/77855, https://github.com/pytorch/pytorch/pull/78765, https://github.com/pytorch/pytorch/pull/79829)
Prevented overriding shuffle settings in DataLoader for DataPipe (https://github.com/pytorch/pytorch/pull/75505)

Releng

Made requirements.txt as the single source of truth for TorchData version (https://github.com/pytorch/data/pull/414)
Prohibited Release GHA workflows running on forked branches. (https://github.com/pytorch/data/pull/361)

Performance

DataPipe

Lazily generated exception message for performance (https://github.com/pytorch/pytorch/pull/78673)
- Fixes regression introduced from single iterator constraint related PRs.
Disabled profiler for IterDataPipe by default (https://github.com/pytorch/pytorch/pull/78674)
- By skipping over the record function when the profiler is not enabled, the speedup is up to 5-6x for DataPipes when their internal operations are very simple (e.g. IterableWrapper)

Documentation

DataPipe

Fixed typo in TorchVision example (https://github.com/pytorch/data/pull/311)
Updated DataPipe naming guidelines (https://github.com/pytorch/data/pull/428)
Updated documents from DataSet to PyTorch Dataset (https://github.com/pytorch/data/pull/292)
Added examples for graphs, meshes and point clouds using DataPipe (https://github.com/pytorch/data/pull/337)
Added examples for semantic segmentation and time series using DataPipe (https://github.com/pytorch/data/pull/340)
Expanded the contribution guide, especially including instructions to add a new DataPipe (https://github.com/pytorch/data/pull/354)
Updated tutorial about placing sharding_filter (https://github.com/pytorch/data/pull/487)
Improved graph visualization documentation (https://github.com/pytorch/data/pull/504)
Added instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
Updated examples to avoid lambdas (https://github.com/pytorch/data/pull/524)
Updated documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)
Updated links for tutorial (https://github.com/pytorch/data/pull/543)

IterDataPipe

Fixed documentation for IterToMapConverter, S3FileLister and S3FileLoader (https://github.com/pytorch/data/pull/381)
Update documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)

MapDataPipe

Updated contributing guide and added guidance for MapDataPipe (https://github.com/pytorch/data/pull/379)
- Rather than re-implementing the same functionalities twice for both IterDataPipe and MapDataPipe, we encourage users to use the built-in functionalities of IterDataPipe and use the converter to MapDataPipe as needed.

DataLoader/DataLoader2

Fixed tutorial about DataPipe working with DataLoader (https://github.com/pytorch/data/pull/458)
Updated examples and tutorial after automatic sharding has landed (https://github.com/pytorch/data/pull/505)
Add README for DataLoader2 (https://github.com/pytorch/data/pull/526, https://github.com/pytorch/data/pull/541)

Releng

Added nightly documentation for TorchData in https://pytorch.org/data/main/
Fixed instruction to install TorchData (https://github.com/pytorch/data/pull/455)

Future Plans

For DataLoader2, we are introducing new ways to interact between DataPipes, DataLoading API, and backends (aka ReadingServices). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.

Beta Usage Note

Source code(tar.gz)
Source code(zip)

v0.3.0(Mar 10, 2022)
0.3.0 Release Notes

We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader.

Highlights

What are DataPipes?

Usage Example

New Features

Documentation

Usage in Domain Libraries

Future Plans

Beta Usage Note

Highlights

We are releasing DataPipes - there are Iterable-style DataPipe (IterDataPipe) and Map-style DataPipe (MapDataPipe).

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes , and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json class JsonParserIterDataPipe(IterDataPipe): def __init__(self, source_datapipe, **kwargs) -> None: self.source_datapipe = source_datapipe self.kwargs = kwargs def __iter__(self): for file_name, stream in self.source_datapipe: data = stream.read() yield file_name, json.loads(data) def __len__(self): return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes.

Usage Example

In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is included in the example folder.

url_dp = IterableWrapper([URL]) cache_compressed_dp = GDriveReader(cache_compressed_dp) # cache_decompressed_dp = ... # See source file for full code example # Opens and loads the content of the TAR archive file. cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar() # Filters for specific files based on the file name. cache_decompressed_dp = cache_decompressed_dp.filter( lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0] ) # Saves the decompressed file onto disk. cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True) data_dp = FileOpener(cache_decompressed_dp, mode="b") # Parses content of the decompressed CSV file and returns the result line by line. return return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))

New Features

[Beta] IterDataPipe

We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe.

[Beta] MapDataPipe

Similar to IterDataPipe, we have various, but a more limited number of MapDataPipe available for different functionalities. More MapDataPipes support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.

Documentation

The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones.

Usage in Domain Libraries

In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.

Future Plans

There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.
Source code(tar.gz)
Source code(zip)

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Related tags

Overview

TorchData ( 🚨 Warning: Unstable Prototype 🚨 )

Why composable data loading?

Installation

Colab

Local pip or conda

From source

What are DataPipes?

Implementing DataPipes

Naming

Constructor

Iterator

Length

Registering DataPipes with the functional API

Using DataPipes

Contributing

Prototype Usage and Feedback

Future Plans

License

Comments

Changes

TODO

Changes

Additional comments

🐛 Describe the bug

Versions

Refactor OnDiskCacheHolder to track a sequence of DataPipe operations.

For end_caching:

Features

Use case

🐛 Describe the bug

Versions

Changes

🐛 Describe the bug

Versions

🐛 Describe the bug

Versions

🐛 Describe the bug

Versions

📚 The doc issue

Suggest a potential alternative/fix

Releases(v0.5.1)

v0.5.1(Dec 16, 2022)

v0.5.0(Oct 27, 2022)

TorchData 0.5.0 Release Notes

Highlights

Backwards Incompatible Change

DataPipe

Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (https://github.com/pytorch/pytorch/pull/83202)

on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (https://github.com/pytorch/data/pull/810)

DataLoader2

Imposed single iterator constraint on DataLoader2 (https://github.com/pytorch/data/pull/700)

Deep copy DataPipe during DataLoader2 initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)

Deprecations

DataLoader2

Deprecated traverse function and only_datapipe argument (https://github.com/pytorch/pytorch/pull/85667)

New Features

DataPipe

DataLoader2

Releng

Improvements

DataPipe

DataLoader

DataLoader2

Releng

Bug Fixes

DataPipe

Performance

DataLoader2

Documentation

DataPipe

DataLoader2

Releng

Future Plans

Beta Usage Note

v0.4.1(Aug 5, 2022)

TorchData 0.4.1 Release Notes

Bug fixes

Refactor `OnDiskCacheHolder` to track a sequence of DataPipe operations.

For `end_caching`:

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https://github.com/pytorch/pytorch/pull/83202)

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https://github.com/pytorch/data/pull/810)

Imposed single iterator constraint on `DataLoader2` (https://github.com/pytorch/data/pull/700)

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)

Deprecated `traverse` function and `only_datapipe` argument (https://github.com/pytorch/pytorch/pull/85667)

Updated `Multiplexer` (functional API `mux`) to stop merging multiple `DataPipes` whenever the shortest one is exhausted (https://github.com/pytorch/pytorch/pull/77145)

Enforcing single valid iterator for `IterDataPipes` w/wo multiple outputs https://github.com/pytorch/pytorch/pull/70479, (https://github.com/pytorch/pytorch/pull/75995)

Deprecated functional APIs of `open_file_by_fsspec` and `open_file_by_iopath` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/78970, https://github.com/pytorch/pytorch/pull/79302)

Argument `drop_empty_batches` of `Filter` (functional API `filter`) is deprecated and going to be removed in the future release (https://github.com/pytorch/pytorch/pull/76060)