Two-stage text summarization with BERT and BART

Last update: Oct 22, 2022

Overview

Two-Stage Text Summarization

Description

We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter informative sentences (like extractive summarization) and the ability to paraphrase (like abstractive summarization). Our best model achieves a ROUGE-L F1 score of 39.82, which outperforms the strong Lead-3 baseline and BERTSumEXT. Qualitative analysis indicates better readability and factual accuracy. Further, fine-tuning both stages on our oracle as the gold references shows the potential to outperform BART.

Results

Environment

conda create -n text-sum python=3.8
conda activate text-sum
pip install -r src/requirements.txt

Extraction stage

See here

Abstraction stage

See here

Owner

Yukai Yang (Alexis)

Passionate about scalable systems for video/data analytics. Software engineer, open source lover

GitHub Repository

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022

Two-stage text summarization with BERT and BART

Related tags

Overview

Two-Stage Text Summarization

Description

Results

Environment

Extraction stage

Abstraction stage

Owner

Yukai Yang (Alexis)

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

Transformation spoken text to written text

Quick insights from Zoom meeting transcripts using Graph + NLP

A simple word search made in python

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Task-based datasets, preprocessing, and evaluation for sequence models.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

vits chinese, tts chinese, tts mandarin

Unsupervised intent recognition

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Basic yet complete Machine Learning pipeline for NLP tasks

Sample data associated with the Aurora-BP study

This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Natural Language Processing with transformers

Two-stage text summarization with BERT and BART

Related tags

Overview

Two-Stage Text Summarization

Description

Results

Environment

Extraction stage

Abstraction stage

Owner

Yukai Yang (Alexis)

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

Transformation spoken text to written text

Quick insights from Zoom meeting transcripts using Graph + NLP

A simple word search made in python

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Task-based datasets, preprocessing, and evaluation for sequence models.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

vits chinese, tts chinese, tts mandarin

Unsupervised intent recognition

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Basic yet complete Machine Learning pipeline for NLP tasks

Sample data associated with the Aurora-BP study

This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Natural Language Processing with transformers

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。