Download and preprocess popular sequential recommendation datasets

Overview

Build Status codebeat badge

Sequential Recommendation Datasets

This repository collects some commonly used sequential recommendation datasets in recent research papers and provides a tool for downloading, preprocessing and batch-loading those datasets. The preprocessing method can be customized based on the task, for example: short-term recommendation (including session-based recommendation) and long-short term recommendation. Loading has faster version which intergrates the DataLoader of PyTorch.

Datasets

Install this tool

Stable version

pip install -U srdatasets —-user

Latest version

pip install git+https://github.com/guocheng2018/sequential-recommendation-datasets.git --user

Download datasets

Run the command below to download datasets. As some datasets are not directly accessible, you'll be warned to download them manually and place them somewhere it tells you.

srdatasets download --dataset=[dataset_name]

To get a view of downloaded and processed status of all datasets, run

srdatasets info

Process datasets

The generic processing command is

srdatasets process --dataset=[dataset_name] [--options]

Splitting options

Two dataset splitting methods are provided: user-based and time-based. User-based means that splitting is executed on every user behavior sequence given the ratio of validation set and test set, while time-based means that splitting is based on the date of user behaviors. After splitting some dataset, two processed datasets are generated, one for development, which uses the validation set as the test set, the other for test, which contains the full training set.

--split-by     User or time (default: user)
--test-split   Proportion of test set to full dataset (default: 0.2)
--dev-split    Proportion of validation set to full training set (default: 0.1)

NOTE: time-based splitting need you to manually input days at console by tipping you total days of that dataset, since you may not know.

Task related options

For short term recommnedation task, you use previous input-len items to predict next target-len items. To make user interests more focused, user behavior sequences can also be cut into sessions if session-interval is given. If the number of previous items is smaller than input-len, 0 is padded to the left.

For long and short term recommendation task, you use pre-sessions previous sessions and current session to predict target-len items. The target items are picked randomly or lastly from current session. So the length of current session is max-session-len - target-len while the length of any previous session is max-session-len. If any previous session or current session is shorter than the preset length, 0 is padded to the left.

--task              Short or long-short (default: short)
--input-len         Number of previous items (default: 5)
--target-len        Number of target items (default: 1)
--pre-sessions      Number of previous sessions (default: 10)
--pick-targets      Randomly or lastly pick items from current session (default: random)
--session-interval  Session splitting interval (minutes)  (default: 0)
--min-session-len   Sessions less than this in length will be dropped  (default: 2)
--max-session-len   Sessions greater than this in length will be cut  (default: 20)

Common options

--min-freq-item        Items less than this in frequency will be dropped (default: 5)
--min-freq-user        Users less than this in frequency will be dropped (default: 5)
--no-augment           Do not use data augmentation (default: False)
--remove-duplicates    Remove duplicated items in user sequence or user session (if splitted) (default: False)

Dataset related options

--rating-threshold  Interactions with rating less than this will be dropped (Amazon, Movielens, Yelp) (default: 4)
--item-type         Recommend artists or songs (Lastfm) (default: song)

Version

By using different options, a dataset will have many processed versions. You can run the command below to get configurations and statistics of all processed versions of some dataset. The config id shown in output is a required argument of DataLoader.

srdatasets info --dataset=[dataset_name]

DataLoader

DataLoader is a built-in class that makes loading processed datasets easy. Practically, once initialized a dataloder by passing the dataset name, processed version (config id), batch_size and a flag to load training data or test data, you can then loop it to get batch data. Considering that some models use rank-based learning, negative sampling is intergrated into DataLoader. The negatives are sampled from all items except items in current data according to popularity. By default it (negatives_per_target) is turned off. Also, the time of user behaviors is sometimes an important feature, you can include it into batch data by setting include_timestmap to True.

Arguments

  • dataset_name: dataset name (case insensitive)
  • config_id: configuration id
  • batch_size: batch size (default: 1)
  • train: load training dataset (default: True)
  • development: load the dataset aiming for development (default: False)
  • negatives_per_target: number of negative samples per target (default: 0)
  • include_timestamp: add timestamps to batch data (default: False)
  • drop_last: drop last incomplete batch (default: False)

Attributes

  • num_users: total users in training dataset
  • num_items: total items in training dataset (not including the padding item 0)

Initialization example

from srdatasets.dataloader import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True)

For pytorch users, there is a wrapper implementation of torch.utils.data.DataLoader, you can then set keyword arguments like num_workers and pin_memory to speed up loading data

from srdatasets.dataloader_pytorch import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True, num_workers=8, pin_memory=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True, num_workers=8, pin_memory=True)

Iteration template

For short term recommendation task

for epoch in range(10):
    # Train
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                  (batch_size,)
        #   input_items:            (batch_size, input_len)
        #   target_items:           (batch_size, target_len)
        #   input_item_timestamps:  (batch_size, input_len)
        #   target_item_timestamps: (batch_size, target_len)
        #   negative_samples:       (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps in testloader:
        pass

For long and short term recommendation task

for epoch in range(10):
    # Train
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                          (batch_size,)
        #   pre_sessions_items:             (batch_size, pre_sessions * max_session_len)
        #   cur_session_items:              (batch_size, max_session_len - target_len)
        #   target_items:                   (batch_size, target_len)
        #   pre_sessions_item_timestamps:   (batch_size, pre_sessions * max_session_len)
        #   cur_session_item_timestamps:    (batch_size, max_session_len - target_len)
        #   target_item_timestamps:         (batch_size, target_len)
        #   negative_samples:               (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps in testloader:
        pass

Disclaimers

This repo does not host or distribute any of the datasets, it is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

frida工具的缝合怪

fridaUiTools fridaUiTools是一个界面化整理脚本的工具。新人的练手作品。参考项目ZenTracer,觉得既然可以界面化,那么应该可以把功能做的更加完善一些。跨平台支持:win、mac、linux 功能缝合怪。把一些常用的frida的hook脚本简单统一输出方式后,整合进来。并且

diveking 997 Jan 09, 2023
Reimplementation of the paper "Attention, Learn to Solve Routing Problems!" in jax/flax.

JAX + Attention Learn To Solve Routing Problems Reinplementation of the paper Attention, Learn to Solve Routing Problems! using Jax and Flax. Fully su

Gabriela Surita 7 Dec 01, 2022
An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

hwanheelee 14 Nov 20, 2022
Supplementary code for the paper "Meta-Solver for Neural Ordinary Differential Equations" https://arxiv.org/abs/2103.08561

Meta-Solver for Neural Ordinary Differential Equations Towards robust neural ODEs using parametrized solvers. Main idea Each Runge-Kutta (RK) solver w

Julia Gusak 25 Aug 12, 2021
Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021)

Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021) Authors: Xinshi Chen, Haoran Sun, Caleb Ellington, Eric Xing, Le Song Link to pap

Xinshi Chen 2 Dec 20, 2021
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
[ICLR 2021] Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization

Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, Tengyu Ma

Kaidi Cao 29 Oct 20, 2022
A Pytorch implementation of "LegoNet: Efficient Convolutional Neural Networks with Lego Filters" (ICML 2019).

LegoNet This code is the implementation of ICML2019 paper LegoNet: Efficient Convolutional Neural Networks with Lego Filters Run python train.py You c

YangZhaohui 140 Sep 26, 2022
Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning

H-Transformer-1D Implementation of H-Transformer-1D, Transformer using hierarchical Attention for sequence learning with subquadratic costs. For now,

Phil Wang 123 Nov 17, 2022
A modular domain adaptation library written in PyTorch.

A modular domain adaptation library written in PyTorch.

Kevin Musgrave 225 Dec 29, 2022
Grow Function: Generate 3D Stacked Bifurcating Double Deep Cellular Automata based organisms which differentiate using a Genetic Algorithm...

Grow Function: A 3D Stacked Bifurcating Double Deep Cellular Automata which differentiates using a Genetic Algorithm... TLDR;High Def Trees that you can mint as NFTs on Solana

Nathaniel Gibson 4 Oct 08, 2022
It is a simple library to speed up CLIP inference up to 3x (K80 GPU)

CLIP-ONNX It is a simple library to speed up CLIP inference up to 3x (K80 GPU) Usage Install clip-onnx module and requirements first. Use this trick !

Gerasimov Maxim 93 Dec 20, 2022
Efficient 3D Backbone Network for Temporal Modeling

VoV3D is an efficient and effective 3D backbone network for temporal modeling implemented on top of PySlowFast. Diverse Temporal Aggregation and

102 Dec 06, 2022
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation

Segmenter: Transformer for Semantic Segmentation Segmenter: Transformer for Semantic Segmentation by Robin Strudel*, Ricardo Garcia*, Ivan Laptev and

594 Jan 06, 2023
ML-Decoder: Scalable and Versatile Classification Head

ML-Decoder: Scalable and Versatile Classification Head Paper Official PyTorch Implementation Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baru

189 Jan 04, 2023
A curated list of awesome Active Learning

Awesome Active Learning 🤩 A curated list of awesome Active Learning ! 🤩 Background (image source: Settles, Burr) What is Active Learning? Active lea

BAI Fan 431 Jan 03, 2023
pytorch implementation of GPV-Pose

GPV-Pose Pytorch implementation of GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting. (link) UPDATE A new version

40 Dec 01, 2022
Deep Learning ❤️ OneFlow

Deep Learning with OneFlow made easy 🚀 ! Carefree? carefree-learn aims to provide CAREFREE usages for both users and developers. User Side Computer V

21 Oct 27, 2022