Installation:

pip install lm_dataloader

Design Philosophy

A library to unify lm dataloading at large scale
Simple interface, any tokenizer can be integrated
Minimal changes needed from small -> large scale (many multiple GPU nodes)
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- Easily split a dataset into train / val / test splits
- Easily build a weighted dataset out of a list of existing ones along with weights.
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Dataloader tools for language modelling

Related tags

Overview

Installation:

Design Philosophy

Example usage

Command line utilities

Owner

Repository for open research on optimizers.

Implementation of SSMF: Shifting Seasonal Matrix Factorization

PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

Control-Robot-Arm-using-PS4-Controller - A Robotic Arm based on Raspberry Pi and Arduino that controlled by PS4 Controller

Introducing neural networks to predict stock prices

Repo for FUZE project. I will also publish some Linux kernel LPE exploits for various real world kernel vulnerabilities here. the samples are uploaded for education purposes for red and blue teams.

GANsformer: Generative Adversarial Transformers Drew A

Rotation-Only Bundle Adjustment

Python-based Informatics Kit for Analysing Chemical Units

📚 Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

Code for ICCV2021 paper PARE: Part Attention Regressor for 3D Human Body Estimation

Plugin for Gaffer providing direct acess to asset from PolyHaven.com. Only HDRIs at the moment, Cycles and Arnold supported

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving

Notebooks for my "Deep Learning with TensorFlow 2 and Keras" course

AgML is a comprehensive library for agricultural machine learning

Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

Simple image captioning model - CLIP prefix captioning.

A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.