BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Related tags

Deep Learningbooksum
Overview

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

Paper link: https://arxiv.org/abs/2105.08209

Table of Contents

  1. Updates
  2. Citation
  3. Legal Note
  4. License
  5. Usage
  6. Get Involved

Updates

4/15/2021

Initial commit

Citation

@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization}, 
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Legal Note

By downloading or using the resources, including any code or scripts, shared in this code repository, you hereby agree to the following terms, and your use of the resources is conditioned on and subject to these terms.

  1. You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited.
  2. You will comply with all terms and conditions, and are responsible for obtaining all rights, related to the services you access and the data you collect.
  3. We do not make any representations or warranties whatsoever regarding the sources from which data is collected. Furthermore, we are not liable for any damage, loss or expense of any kind arising from or relating to your use of the resources shared in this code repository or the data collected, regardless of whether such liability is based in tort, contract or otherwise.

License

The code is released under the BSD-3 License (see LICENSE.txt for details).

Usage

1. Chapterized Project Guteberg Data

The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:

gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .

or downloaded directly here.

2. Data Collection

Data collection scripts for the summary text are organized by the different sources that we use summaries from. Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources.

For each data source, run get_works.py to first fetch the links for each book, and then run get_summaries.py to get the summaries from the collected links.

python scripts/data_collection/cliffnotes/get_works.py
python scripts/data_collection/cliffnotes/get_summaries.py

3. Data Cleaning

Data Cleaning is performed through the following steps:

First script for some basic cleaning operations, like removing parentheses, links etc from the summary text

python scripts/data_cleaning_scripts/basic_clean.py

We use intermediate alignments in summary_chapter_matched_all_sources.jsonl to identify which summaries are separable, and separates them, creating new summaries (eg. Chapters 1-3 summary separated into 3 different files - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)

python scripts/data_cleaning_scripts/split_aggregate_chaps_all_sources.py

Lastly, our final cleaning script using various regexes to separate out analysis/commentary text, removes prefixes, suffixes etc.

python scripts/data_cleaning_scripts/clean_summaries.py

Data Alignments

Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:

Gather the data from the summaries and book chapters into a single jsonl

python paragraph-level-summary-alignments/gather_data.py

Generate alignments of the paragraphs with sentences from the summary using the bi-encoder paraphrase-distilroberta-base-v1

python paragraph-level-summary-alignments/align_data_bi_encoder_paraphrase.py

Aggregate the generated alignments for cases where multiple sentences from chapter-summaries are matched to the same paragraph from the book

python paragraph-level-summary-alignments/aggregate_paragraph_alignments_bi_encoder_paraphrase.py

Troubleshooting

  1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
  2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Disentangled Representation Learning for Text-Video Retrieval This is a PyTorch implementation of the paper Disentangled Representation Learning for T

Qiang Wang 49 Dec 18, 2022
A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

this is a simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

crispengari 5 Dec 09, 2021
This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

LEAP Lab 2 Sep 15, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 90 Dec 31, 2022
Based on the paper "Geometry-aware Instance-reweighted Adversarial Training" ICLR 2021 oral

Geometry-aware Instance-reweighted Adversarial Training This repository provides codes for Geometry-aware Instance-reweighted Adversarial Training (ht

Jingfeng 47 Dec 22, 2022
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

Phil Wang 128 Dec 24, 2022
Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

One2Set This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”. Our implementation is built on the

Jiacheng Ye 63 Jan 05, 2023
Code for the Active Speakers in Context Paper (CVPR2020)

Active Speakers in Context This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper. Before Training The c

43 Oct 14, 2022
Train neural network for semantic segmentation (deep lab V3) with pytorch in less then 50 lines of code

Train neural network for semantic segmentation (deep lab V3) with pytorch in 50 lines of code Train net semantic segmentation net using Trans10K datas

17 Dec 19, 2022
Simple-Neural-Network From Scratch in Python

Simple-Neural-Network From Scratch in Python This is a simple Neural Network created without any Machine Learning Libraries. The only dependencies are

Aum Shah 1 Dec 28, 2021
O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning Object-object Interaction Affordance Learning. For a given object-object int

Kaichun Mo 26 Nov 04, 2022
Contrastive Learning with Non-Semantic Negatives

Contrastive Learning with Non-Semantic Negatives This repository is the official implementation of Robust Contrastive Learning Using Negative Samples

39 Jul 31, 2022
Toward Spatially Unbiased Generative Models (ICCV 2021)

Toward Spatially Unbiased Generative Models Implementation of Toward Spatially Unbiased Generative Models (ICCV 2021) Overview Recent image generation

Jooyoung Choi 88 Dec 01, 2022
FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection

FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection arXi

59 Nov 29, 2022
《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters This repository is the implementation of the paper "K-Adapter: Infusing Knowledge

Microsoft 118 Dec 13, 2022
BraTs-VNet - BraTS(Brain Tumour Segmentation) using V-Net

BraTS(Brain Tumour Segmentation) using V-Net This project is an approach to dete

Rituraj Dutta 7 Nov 27, 2022
Command-line tool for downloading and extending the RedCaps dataset.

RedCaps Downloader This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly dow

RedCaps dataset 33 Dec 14, 2022
You Only Look Once for Panopitic Driving Perception

You Only 👀 Once for Panoptic 🚗 Perception You Only Look at Once for Panoptic driving Perception by Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wan

Hust Visual Learning Team 1.4k Jan 04, 2023
A GUI to automatically create a TOPAS-readable MLC simulation file

Python script to create a TOPAS-readable simulation file descriring a Multi-Leaf-Collimator. Builds the MLC using the data from a 3D .stl file.

Sebastian Schäfer 0 Jun 19, 2022
Comp445 project - Data Communications & Computer Networks

COMP-445 Data Communications & Computer Networks Change Python version in Conda

Peng Zhao 2 Oct 03, 2022