Data pipelines for both TensorFlow and PyTorch!

Overview

rapidnlp-datasets

Python package PyPI version Python

Data pipelines for both TensorFlow and PyTorch !

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use rapidnlp-datasets!

installation

pip install -U rapidnlp-datasets

If you work with PyTorch, you should install PyTorch first.

If you work with TensorFlow, you should install TensorFlow first.

Usage

Here are few examples to show you how to use this library.

sequence-classification-quickstart

In PyTorch,

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForSequenceClassification
>>> dataset = DatasetForSequenceClassification.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForSequenceClassifiation
>>> dataset, d = TFDatasetForSequenceClassifiation.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/sequence_classification.tfrecord")
2021-12-08 14:52:41,295    INFO             utils.py  128] Finished to write 2 examples to tfrecords.
>>> dataset = TFDatasetForSequenceClassifiation.from_tfrecord_files("testdata/sequence_classification.tfrecord")
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

question-answering-quickstart

In PyTorch:

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForQuestionAnswering
>>>
>>> dataset = DatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForQuestionAnswering
>>> dataset, d = TFDatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
2021-12-08 15:09:06,747    INFO question_answering_dataset.py  101] Read 3 examples in total.
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/qa.tfrecord")
2021-12-08 15:09:31,329    INFO             utils.py  128] Finished to write 3 examples to tfrecords.
>>> dataset = TFDatasetForQuestionAnswering.from_tfrecord_files(
        "testdata/qa.tfrecord",
        batch_size=32,
        padding="batch",
    )
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

You might also like...
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

🤗 Push your spaCy pipelines to the Hugging Face Hub
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.
This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

OpenVINO Inference API This is a repository for an object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operati

Machine learning framework for both deep learning and traditional algorithms
Machine learning framework for both deep learning and traditional algorithms

NeoML is an end-to-end machine learning framework that allows you to build, train, and deploy ML models. This framework is used by ABBYY engineers for

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

A transformer which can randomly augment VOC format dataset (both image and bbox) online.
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Releases(v0.2.0)
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
Plaything for Autistic Children (demo for PaddlePaddle/Wechaty/Mixlab project)

星星的孩子 - 一款为孤独症孩子设计的聊天机器人游戏 孤独症儿童是目前常常被忽视的一类群体。他们有着类似性格内向的特征,实际却受着广泛性发育障碍的折磨。 项目背景 这类儿童在与人交往时存在着沟通障碍,其特点表现在: 社交交流差,互动障碍明显 认知能力有限,被动认知 兴趣狭窄,重复刻板,缺乏变化和想象

Tianyi Pan 35 Nov 24, 2022
Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

Hammer Lab 106 Jan 06, 2023
Masked regression code - Masked Regression

Masked Regression MR - Python Implementation This repositery provides a python implementation of MR (Masked Regression). MR can efficiently synthesize

Arbish Akram 1 Dec 23, 2021
基于DouZero定制AI实战欢乐斗地主

DouZero_For_Happy_DouDiZhu: 将DouZero用于欢乐斗地主实战 本项目基于DouZero 环境配置请移步项目DouZero 模型默认为WP,更换模型请修改start.py中的模型路径 运行main.py即可 SL (baselines/sl/): 基于人类数据进行深度学习

1.5k Jan 08, 2023
GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022
Python package to add text to images, textures and different backgrounds

nider Python package for text images generation and watermarking Free software: MIT license Documentation: https://nider.readthedocs.io. nider is an a

Vladyslav Ovchynnykov 131 Dec 30, 2022
FinEAS: Financial Embedding Analysis of Sentiment 📈

FinEAS: Financial Embedding Analysis of Sentiment 📈 (SentenceBERT for Financial News Sentiment Regression) This repository contains the code for gene

LHF Labs 31 Dec 13, 2022
CNNs for Sentence Classification in PyTorch

Introduction This is the implementation of Kim's Convolutional Neural Networks for Sentence Classification paper in PyTorch. Kim's implementation of t

Shawn Ng 956 Dec 19, 2022
Easy-to-use,Modular and Extendible package of deep-learning based CTR models .

DeepCTR DeepCTR is a Easy-to-use,Modular and Extendible package of deep-learning based CTR models along with lots of core components layers which can

浅梦 6.6k Jan 08, 2023
A simple Python configuration file operator.

A simple Python configuration file operator This project provides a common way to read configurations using config42. Installation It is possible to i

Scott Lau 2 Nov 08, 2021
PyTorch implementation of MICCAI 2018 paper "Liver Lesion Detection from Weakly-labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox Detector"

Grouped SSD (GSSD) for liver lesion detection from multi-phase CT Note: the MICCAI 2018 paper only covers the multi-phase lesion detection part of thi

Sang-gil Lee 36 Oct 12, 2022
Official implementation of the PICASO: Permutation-Invariant Cascaded Attentional Set Operator

PICASO Official PyTorch implemetation for the paper PICASO:Permutation-Invariant Cascaded Attentive Set Operator. Requirements Python 3 torch = 1.0 n

Samira Zare 0 Dec 23, 2021
An educational AI robot based on NVIDIA Jetson Nano.

JetBot Looking for a quick way to get started with JetBot? Many third party kits are now available! JetBot is an open-source robot based on NVIDIA Jet

NVIDIA AI IOT 2.6k Dec 29, 2022
Control-Robot-Arm-using-PS4-Controller - A Robotic Arm based on Raspberry Pi and Arduino that controlled by PS4 Controller

Control-Robot-Arm-using-PS4-Controller You can see all details about this Robot

MohammadReza Sharifi 5 Jan 01, 2022
codes for IKM (arXiv2021, Submitted to IEEE Trans)

Image-specific Convolutional Kernel Modulation for Single Image Super-resolution This repository is for IKM introduced in the following paper Yuanfei

Yuanfei Huang 9 Dec 29, 2022
[제 13회 투빅스 컨퍼런스] OK Mugle! - 장르부터 멜로디까지, Content-based Music Recommendation

Ok Mugle! 🎵 장르부터 멜로디까지, Content-based Music Recommendation 'Ok Mugle!'은 제13회 투빅스 컨퍼런스(2022.01.15)에서 진행한 음악 추천 프로젝트입니다. Description 📖 본 프로젝트에서는 Kakao

SeongBeomLEE 5 Oct 09, 2022
Group Activity Recognition with Clustered Spatial Temporal Transformer

GroupFormer Group Activity Recognition with Clustered Spatial-TemporalTransformer Backbone Style Action Acc Activity Acc Config Download Inv3+flow+pos

28 Dec 12, 2022
Code for the bachelors-thesis flaky fault localization

Flaky_Fault_Localization Scripts for the Bachelors-Thesis: "Flaky Fault Localization" by Christian Kasberger. The thesis examines the usefulness of sp

Christian Kasberger 1 Oct 26, 2021
Learning to Initialize Neural Networks for Stable and Efficient Training

GradInit This repository hosts the code for experiments in the paper, GradInit: Learning to Initialize Neural Networks for Stable and Efficient Traini

Chen Zhu 124 Dec 30, 2022