Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Overview

Learning Opinion Summarizers by Selecting Informative Reviews

This repository contains the codebase and the dataset for the corresponding EMNLP 2021 paper. Please star the repository and cite the paper if you find it useful.

SelSum is a probabilistic (latent) model that selects informative reviews from large collections and subsequently summarizes them as shown in the diagram below.

AmaSum is the largest abstractive opinion summarization dataset, consisting of more than 33,000 human-written summaries for Amazon products. Each summary is paired, on average, with more than 320 customer reviews. Summaries consist of verdicts, pros, and cons, see the example below.

Verdict: The Olympus Evolt E-500 is a compact, easy-to-use digital SLR camera with a broad feature set for its class and very nice photo quality overall.

Pros:

  • Compact design
  • Strong autofocus performance even in low-light situations
  • Intuitive and easy-to-navigate menu system
  • Wide range of automated and manual features to appeal to both serious hobbyists and curious SLR newcomers

Cons:

  • Unreliable automatic white balance in some conditions
  • Slow start-up time when dust reduction is enabled
  • Compatible Zuiko lenses don't indicate focal distance

1. Setting up

1.1. Environment

The easiest way to proceed is to create a separate conda environment with Python 3.7.0.

conda create -n selsum python=3.7.0

Further, install PyTorch as shown below.

conda install -c pytorch pytorch=1.7.0

In addition, install the essential python modules:

pip install -r requirements.txt

The codebase relies on FairSeq. To avoid version conflicts, please download our version and store it to ../fairseq_lib. Please follow the installation instructions in the unzipped directory.

1.2. Environmental variables

Before running scripts, please add the environmental variables below.

export PYTHONPATH=../fairseq_lib/.:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MKL_THREADING_LAYER=GNU

1.3. Data

The dataset in various formats is available in the dataset folder. To run the model, please binarize the fairseq specific version.

1.4. Checkpoints

We also provide the checkpoints of the trained models. These should be allocated to artifacts/checkpoints.

2. Training

2.1. Posterior and Summarizer training

First, the posterior and summarizer need to be trained. The summarizer is initialized using the BART base model, please download the checkpoint and store it to artifacts/bart. Note: please adjust hyper-parameters and paths in the script if needed.

bash selsum/scripts/training/train_selsum.sh

Please note that REINFORCE-based loss for the posterior training can be negative as the forward pass does not correspond to the actual loss function. Instead, the loss is re-formulated to compute gradients in the backward pass (Eq. 5 in the paper).

2.2. Selecting reviews with the Posterior

Once the posterior is trained (jointly with the summarizer), informative reviews need to be selected. The script below produces binary tags indicating selected reviews.

python selsum/scripts/inference/posterior_select_revs.py --data-path=../data/form  \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/q_sel \
--split=test \
--ndocs=10 \
--batch-size=30

The output can be downloaded and stored to artifacts/output/q_sel.

2.3. Fitting the Prior

Once tags are produced by the posterior, we can fit the prior to approximate it.

bash selsum/scripts/training/train_prior.sh

2.4. Selecting Reviews with the Prior

After the prior is trained, we select informative reviews for downstream summarization.

python selsum/scripts/inference/prior_select_revs.py --data-path=../data/form \
--checkpoint-path=artifacts/checkpoints/prior.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/p_sel \
--split=test \
--ndocs=10 \
--batch-size=10

The output can be downloaded and stored to artifacts/output/p_sel.

3. Inference

3.1. Summary generation

To generate summaries, run the command below:

python selsum/scripts/inference/gen_summs.py --data-path=artifacts/output/p_sel/ \
--bart-dir=artifacts/bart \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--output-folder-path=artifacts/output/p_summs \
--split=test \
--batch-size=20

The model outputs are also available at artifacts/summs.

3.2. Evaluation

For evaluation, we used a wrapper over ROUGE and the CoreNLP tokenizer.

The tokenizer requires the CoreNLP library to be downloaded. Please unzip it to the artifacts/misc folder. Further, make it visible in the classpath as shown below.

export CLASSPATH=artifacts/misc/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

After the installations, please adjust the paths and use the commands below.

GEN_FILE_PATH=artifacts/summs/test.verd
GOLD_FILE_PATH=../data/form/eval/test.verd

# tokenization
cat "${GEN_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GEN_FILE_PATH}.tokenized"
cat "${GOLD_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GOLD_FILE_PATH}.tokenized"

# rouge evaluation
files2rouge "${GOLD_FILE_PATH}.tokenized" "${GEN_FILE_PATH}.tokenized"

Citation

@inproceedings{bražinskas2021learning,
      title={Learning Opinion Summarizers by Selecting Informative Reviews}, 
      author={Arthur Bražinskas and Mirella Lapata and Ivan Titov},
      booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
      year={2021},
}

License

Codebase: MIT

Dataset: non-commercial

Notes

  • Occasionally logging stops being printed while the model is training. In this case, the log can be displayed either with a gap or only at the end of the epoch.
  • SelSum is trained with a single data worker process because otherwise cross-parallel errors are encountered.
Owner
Arthur Bražinskas
PhD in NLP at the University of Edinburgh, UK. I work on abstractive opinion summarization.
Arthur Bražinskas
This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

Dasha 4 Oct 17, 2021
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal

A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop which is flexible enough to handle the majority of use cases,

Chris Hughes 110 Dec 23, 2022
Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks

CyGNet This repository reproduces the AAAI'21 paper “Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Network

CunchaoZ 89 Jan 03, 2023
Deep Ensemble Learning with Jet-Like architecture

Ransomware analysis using DEL with jet-like architecture comprising two CNN wings, a sparse AE tail, a non-linear PCA to produce a diverse feature space, and an MLP nose

Ahsen Nazir 2 Feb 06, 2022
OcclusionFusion: realtime dynamic 3D reconstruction based on single-view RGB-D

OcclusionFusion (CVPR'2022) Project Page | Paper | Video Overview This repository contains the code for the CVPR 2022 paper OcclusionFusion, where we

Wenbin Lin 193 Dec 15, 2022
Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

LearningToCompare Pytorch Implementation for Paper: Learning to Compare: Relation Network for Few-Shot Learning Howto download mini-imagenet and make

Jackie Loong 246 Dec 19, 2022
A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022)

DFC2022 Baseline A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022) This repository uses TorchGeo, PyTorch Lightning, and Segmenta

isaac 24 Nov 28, 2022
A python comtrade load library accelerated by go

Comtrade-GRPC Code for python used is mainly from dparrini/python-comtrade. Just patch the code in BinaryDatReader.parse for parsing a little more eff

Bo 1 Dec 27, 2021
Train a deep learning net with OpenStreetMap features and satellite imagery.

DeepOSM Classify roads and features in satellite imagery, by training neural networks with OpenStreetMap (OSM) data. DeepOSM can: Download a chunk of

TrailBehind, Inc. 1.3k Nov 24, 2022
Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

HamasKhan 3 Jul 08, 2022
Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation Requirements This repository needs mmsegmentation Training To train

Adelaide Intelligent Machines (AIM) Group 7 Sep 12, 2022
Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation By Qiang Zhou*, Zilong Huang*, Lichao Huang, Han Shen, Yon

Forest 117 Apr 01, 2022
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels (BMVC 2021)

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi Code

Subhabrata Choudhury 18 Dec 27, 2022
Real-time Object Detection for Streaming Perception, CVPR 2022

StreamYOLO Real-time Object Detection for Streaming Perception Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Sun Jian Real-time Object Detection

Jinrong Yang 237 Dec 27, 2022
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 05, 2022
code for Fast Point Cloud Registration with Optimal Transport

robot This is the repository for the paper "Accurate Point Cloud Registration with Robust Optimal Transport". We are in the process of refactoring the

28 Jan 04, 2023
Repo for EchoVPR: Echo State Networks for Visual Place Recognition

EchoVPR Repo for EchoVPR: Echo State Networks for Visual Place Recognition Currently under development Dirs: data: pre-collected hidden representation

Anil Ozdemir 4 Oct 04, 2022
Implementations for the ICLR-2021 paper: SEED: Self-supervised Distillation For Visual Representation.

Implementations for the ICLR-2021 paper: SEED: Self-supervised Distillation For Visual Representation.

Jacob 27 Oct 23, 2022
Fast and Easy Infinite Neural Networks in Python

Neural Tangents ICLR 2020 Video | Paper | Quickstart | Install guide | Reference docs | Release notes Overview Neural Tangents is a high-level neural

Google 1.9k Jan 09, 2023