MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

More details can be found in paper: https://arxiv.org/abs/2112.08663

The dataset is in JSON Lines format, where each line is a json object with the following schema:

, "category": , "paragraphs": [ { "text": , "source": }, ... ], "attributes": [ { "key": , "evidences": [ { "value": , "pid": , "begin": , "end": }, ... ] }, ... ] }">

{
   "id": 
           
            ,
   "category": 
            
             ,
   "paragraphs": [
      {
         "text": 
             
              ,
         "source": 
              
               
      },
      ...
   ],
   "attributes": [
      {
         "key": 
               
                , "evidences": [ { "value": 
                
                 , "pid": 
                 
                  , "begin": 
                  
                   , "end": 
                   
                     }, ... ] }, ... ] }

The product id is exactly the ASIN number in the All_Amazon_Meta.json file in the Amazon Review Data (2018). In this repo, we don't store paragraphs, instead we only store the labels. To obtain the full version of the dataset contaning the paragraphs, we suggest to first request the Amazon Review Data (2018), then run our binary to clean its product metadata and join with the labels as described below.

A json object contains a product and multiple attributes. A concrete example is shown as follows

{
   "id":"B0002H0A3S",
   "category":"Guitar Strings",
   "paragraphs":[
      {
         "text":"D'Addario EJ26 Phosphor Bronze Acoustic Guitar Strings, Custom Light, 11-52",
         "source":"title"
      },
      {
         "text":".011-.052 Custom Light Gauge Acoustic Guitar Strings, Phosphor Bronze",
         "source":"description"
      },
      ...
   ],
   "attributes":[
      {
         "key":"Core Material",
         "evidences":[
            {
               "value":"Bronze Acoustic",
               "pid":0,
               "begin":24,
               "end":39
            },
            ...
         ]
      },
      {
         "key":"Winding Material",
         "evidences":[
            {
               "value":"Phosphor Bronze",
               "pid":0,
               "begin":15,
               "end":30
            },
            ...
         ]
      },
      {
         "key":"Gauge",
         "evidences":[
            {
               "value":"Light",
               "pid":0,
               "begin":63,
               "end":68
            },
            {
               "value":"Light Gauge",
               "pid":1,
               "begin":17,
               "end":28
            },
            ...
         ]
      }
   ]
}

In addition to positive examples, we also provide a set of negative examples, i.e. (product, attribute name) pairs without any evidence. The overall statistics of the positive and negative sets are as follows

Counts	Positives	Negatives
# products	2226509	1248009
# product-attribute pairs	2987151	1780428
# products with 1-2 attributes	2102927	1140561
# products with 3-5 attributes	121897	99896
# products with >=6 attributes	1685	7552
# unique categories	1257	1114
# unique attributes	705	693
# unique category-attribute pairs	2535	2305

Creating the full version of the dataset

In this repo, we only open source the labels of the MAVE dataset and the code to deterministically clean the original Amazon product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the MAVE dataset. After this process, the attribute values, paragraph ids and begin/end span indices will be consistent with the cleaned product profiles.

Step 1

Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.

Step 2

Run script

./clean_amazon_product_metadata_main.sh

to clean the Amazon metadata and join with the positive and negative labels in the labels/ folder. The output full MAVE dataset will be stored in the reproduce/ folder.

The script runs the clean_amazon_product_metadata_main.py binary using an apache beam pipeline. The binary will run on a single CPU core, but distributed setup can be enabled by changing pipeline options. The binary contains all util functions used to clean the Amazon metadata and join with labels. The pipeline will finish within a few hours on a single Intel Xeon 3GHz CPU core.

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Related tags

Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Creating the full version of the dataset

Step 1

Step 2

Owner

Google Research Datasets

Code for the paper: Hierarchical Reinforcement Learning With Timed Subgoals, published at NeurIPS 2021

This repository contains the code for the ICCV 2019 paper "Occupancy Flow - 4D Reconstruction by Learning Particle Dynamics"

[CVPR 2021] Forecasting the panoptic segmentation of future video frames

High frequency AI based algorithmic trading module.

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Activity tragle - Google is tracking everything, we just look at it

Real-Time Seizure Detection using EEG: A Comprehensive Comparison of Recent Approaches under a Realistic Setting

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

A fast MoE impl for PyTorch

Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

A library for differentiable nonlinear optimization.

ML-based medical imaging using Azure

Model Zoo of BDD100K Dataset

Do you like Quick, Draw? Well what if you could train/predict doodles drawn inside Streamlit? Also draws lines, circles and boxes over background images for annotation.

Fully Convolutional DenseNets for semantic segmentation.

House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects

No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency

Pytorch reimplementation of the Mixer (MLP-Mixer: An all-MLP Architecture for Vision)

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。

Kaggleship: Kaggle Notebooks