A method for cleaning and classifying text using transformers.

Overview

NLP Translation and Classification

The repository contains a method for classifying and cleaning text using NLP transformers.

Overview

The input data are web-scraped product names gathered from various e-shops. The products are either monitors or printers. Each product in the dataset has a scraped name containing information about the product brand, and product model name, but also unwanted noise - irrelevant information about the item. Additionally, only some records are relevant, meaning that they belong to the correct category: monitor or printer, while other records belong to unwanted categories like accessories or TVs.

The goal of the tasks is to preprocess web-scraped data by removing noisy records and cleaning product names. Preliminary experiments showed that classic machine learning methods like tf-idf vectorization and classification struggled to achieve good results. Instead NLP transformers were employed:

  • First, DistilBERT was utilized for removing irrelevant records. The available data are monitors with annotated labels where the records are classified into three classes: "Monitor", "TV", and "Noise".
  • After, T5 was applied for cleaning product names by translating scraped name into clean name containing only product brand and product model name. For instance, for the given input "monitor led aoc 24g2e 24" ips 1080 ..." the desired output is "aoc | 24g2e". The available data are monitors and printers with annotated targets.

The datasets are split into training, validation and test sets without overlapping records.

The results and details about training and evaluation procedure can be found in the Jupyter Notebooks, see Content section below.

Content

The repository contains Jupyter Notebooks for training and evaluating NNs:

  • 01_data_exploration.ipynb - The notebook contains an exploration of the datasets for sequence classification and translation. It includes visualization of distributions of targets, and overview of available metadata.
  • 02a_classification_fine_tuning.ipynb - The notebook fine-tunes a DistilBERT classifier using training and validation sets, and saves the trained checkpoint.
  • 02b_classification_evaluation.ipynb - The notebook evaluates classification scores on the test set. It includes: a classification report with precision, recall and F1 scores; and a confusion matrix.
  • 03a_translation_fine_tuning.ipynb - The notebook fine-tunes a T5 translation network using training and validation sets, and saves the trained checkpoint.
  • 03b_translation_evaluation.ipynb - The notebook evaluates translation metrics on the test set. The metrics are: Text Accuracy (exact match of target and predicted sequences); Levenshtein Score (normalized reversed Levenshtein Distance where 1 is the best and 0 is the worst); and Jaccard Index.
  • 04_benchmarking.ipynb - The notebook evaluates GPU memory and time needed for running inference on DistilBERT and T5 models using various values of batch size and sequence length.

Getting Started

Package Dependencies

The method were developed using Python=3.7 with transformers=4.8 framework that uses PyTorch=1.9 machine learning framework on a backend. Additionally, the repository requires packages: numpy, pandas, matplotlib and datasets.

To install required packages with PyTorch for CPU run:

pip install -r requirements.txt

For PyTorch with GPU run:

pip install -r requirements_gpu.txt

The requirement files do not contain jupyterlab nor any other IDE. To install jupyterlab run

pip install jupyterlab

Contact

Rail Chamidullin - [email protected] - Github account

Owner
Ray Chamidullin
Ray Chamidullin
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022