A method for cleaning and classifying text using transformers.

Overview

NLP Translation and Classification

The repository contains a method for classifying and cleaning text using NLP transformers.

Overview

The input data are web-scraped product names gathered from various e-shops. The products are either monitors or printers. Each product in the dataset has a scraped name containing information about the product brand, and product model name, but also unwanted noise - irrelevant information about the item. Additionally, only some records are relevant, meaning that they belong to the correct category: monitor or printer, while other records belong to unwanted categories like accessories or TVs.

The goal of the tasks is to preprocess web-scraped data by removing noisy records and cleaning product names. Preliminary experiments showed that classic machine learning methods like tf-idf vectorization and classification struggled to achieve good results. Instead NLP transformers were employed:

  • First, DistilBERT was utilized for removing irrelevant records. The available data are monitors with annotated labels where the records are classified into three classes: "Monitor", "TV", and "Noise".
  • After, T5 was applied for cleaning product names by translating scraped name into clean name containing only product brand and product model name. For instance, for the given input "monitor led aoc 24g2e 24" ips 1080 ..." the desired output is "aoc | 24g2e". The available data are monitors and printers with annotated targets.

The datasets are split into training, validation and test sets without overlapping records.

The results and details about training and evaluation procedure can be found in the Jupyter Notebooks, see Content section below.

Content

The repository contains Jupyter Notebooks for training and evaluating NNs:

  • 01_data_exploration.ipynb - The notebook contains an exploration of the datasets for sequence classification and translation. It includes visualization of distributions of targets, and overview of available metadata.
  • 02a_classification_fine_tuning.ipynb - The notebook fine-tunes a DistilBERT classifier using training and validation sets, and saves the trained checkpoint.
  • 02b_classification_evaluation.ipynb - The notebook evaluates classification scores on the test set. It includes: a classification report with precision, recall and F1 scores; and a confusion matrix.
  • 03a_translation_fine_tuning.ipynb - The notebook fine-tunes a T5 translation network using training and validation sets, and saves the trained checkpoint.
  • 03b_translation_evaluation.ipynb - The notebook evaluates translation metrics on the test set. The metrics are: Text Accuracy (exact match of target and predicted sequences); Levenshtein Score (normalized reversed Levenshtein Distance where 1 is the best and 0 is the worst); and Jaccard Index.
  • 04_benchmarking.ipynb - The notebook evaluates GPU memory and time needed for running inference on DistilBERT and T5 models using various values of batch size and sequence length.

Getting Started

Package Dependencies

The method were developed using Python=3.7 with transformers=4.8 framework that uses PyTorch=1.9 machine learning framework on a backend. Additionally, the repository requires packages: numpy, pandas, matplotlib and datasets.

To install required packages with PyTorch for CPU run:

pip install -r requirements.txt

For PyTorch with GPU run:

pip install -r requirements_gpu.txt

The requirement files do not contain jupyterlab nor any other IDE. To install jupyterlab run

pip install jupyterlab

Contact

Rail Chamidullin - [email protected] - Github account

Owner
Ray Chamidullin
Ray Chamidullin
Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

Victor Zhong 33 Dec 27, 2022
To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela 0 Feb 08, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

21 Dec 02, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022
NLP tool to extract emotional phrase from tweets 🀩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
λ‚΄λΆ€ μž‘μ—…μš© django + vue(vuetify) boilerplate. μ§  ν•˜λ©΄ λŒμ•„κ°.

Pocket Galaxy μ•„μ£Ό κ°„λ‹¨ν•œ 개인용, ν˜Ήμ€ λ‚΄λΆ€μš© νˆ΄μ„ λ§Œλ“€μ–΄μ•Όν•˜λŠ”λ° 이왕이면 웹이 νŽΈν•˜μ£ ? κ·ΈλŸ΄λ•Œλ₯Ό μœ„ν•΄ λ§Œλ“€μ–΄λ‘” django와 vue(vuetify)둜 이뀄진 boilerplate μž…λ‹ˆλ‹€. 각 폴더에 μžˆλŠ” μ„€λͺ…μ„œλŒ€λ‘œ 싀행을 μ‹œν‚€λ©΄ 일단 λ‹Ήμž₯ λ­”κ°€κ°€ λŒμ•„κ°‘λ‹ˆ

Jamie J. Seol 16 Dec 03, 2021
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
πŸ’¬ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022