Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Last update: Jan 01, 2022

Overview

Product Reviews Summarizer

Version 1.0.0

A quick guide on installation of important libraries and running the code.

The project has three .ipynb files - Data Scraper.ipynb, cosine-similarity-wo-tf-idf.ipynb, and cosine-similarity-w-tf-idf.ipynb.

Data Scraper

For the Data Scraper python script, we need to import the following three libraries - requests, BeautifulSoup, and pandas. The installation process can be viewed by clicking on the respective library names.

Splash

In this project, instead of using the default web browser to scrape data, we have created a splash container using docker. Splash is a light-weight javascript rendering service with an HTTP API. For easy installation, you can watch this amazing video by John Watson Rooney on YouTube.

https://www.youtube.com/watch?v=8q2K41QC2nQ&t=361s

Note: You need to make sure that you give the Splash Localhost URL to the requests.get().

Running the code

After you have installed and configured everything, you can run the code by providing the URL of your choice. Suppose, you are taking a product from Amazon, make sure to go to All Reviews page and go to page #2. Copy this URL upto the last '=' and paste it as an f-string in the code. Add a '{x}' after the '='. The code is ready to run. It will scrape the product name, review title, star rating, and the review body from each page, until the last page is encountered, and save it in .xlsx format.

Note: Specify the required output name and destination.

cosine-similarity-wo-tf-idf

For the cosine similarity model, first we need to download the pretrained GloVe Word Embeddings. Run the Load GloVe Word Embeddings section in the script once. It is only required if the kernel is restarted.

For this script, we need to import the following libraries - numpy, pandas, nltk, nltk.tokenize, nltk.corpus, re, sklearn.metrics.pairwise, networkx, transformers, and time. Also run the nltk.download('punkt') and nltk.download('stopwords') lines to download them.

Next step is to load the data as a dataframe. Make sure to give the correct address. Pre-processing of the reviews is done for efficient results. The pre-processing steps include converting to string datatype, converting alphabetical characters to lowercase, removing stopwords, replacing non-alphabetical characters with blank character and tokenizing the sentences.

The pre-processed data is then grouped based on star ratings and sent to the cosine similarity and pagerank algorithm. The top 10 ranked sentences after the applying the pagerank algorithm are sent to huggingface transformers to create an extractive summary (min_lenght = 75, max_length = 300). The summary, along with the product name, star rating, no of reviews, % of total reviews, and the top 5 frequent words along with the count are saved in .xlsx format.

Note: Specify the required output name and destination.

cosine-similarity-w-tf-idf

For this model, along with the above libraries, we need to import the following additional libraries - spacy, and heapq. The cosine similarity algorithm has a time complexity of O(n^2). In order to have a fast execution, in this method, we are using tf-idf measure to score the frequent words, and hence the corresponding sentences. Only the top 1000 sentences are then sent to the cosine similarity algorithm. Usage of the tf-idf measure, ensures that each product, irrespective of the number of sentences in the reviews, gives an output within 120 seconds. This method makes sure no important feature is lost, giving similar results as the previous method but in considerately less time.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Related tags

Overview

Product Reviews Summarizer

Data Scraper

Splash

Running the code

cosine-similarity-wo-tf-idf

cosine-similarity-w-tf-idf

Contributors

Owner

Parv Bhatt

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Predict an emoji that is associated with a text

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

To be a next-generation DL-based phenotype prediction from genome mutations.

Conditional probing: measuring usable information beyond a baseline

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

Community and sentiment analysis based on tweets

100+ Chinese Word Vectors 上百种预训练中文词向量

Segmenter - Transformer for Semantic Segmentation

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Constituency Tree Labeling Tool

Outreachy TFX custom component project