Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

Fast, general, and tested differentiable structured prediction in PyTorch

This repository contains examples of Task-Informed Meta-Learning

LSTM model - IMDB review sentiment analysis

Harvis is designed to automate your C2 Infrastructure.

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Machine translation models released by the Gourmet project

An end to end ASR Transformer model training repo

An automated program that helps customers of Pizza Palour place their pizza orders

✨Fast Coreference Resolution in spaCy with Neural Networks

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

The entmax mapping and its loss, a family of sparse softmax alternatives.

Text vectorization tool to outperform TFIDF for classification tasks

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

ACL'22: Structured Pruning Learns Compact and Accurate Models

It analyze the sentiment of the user, whether it is postive or negative.

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。