Does Pretraining for Summarization Reuqire Knowledge Transfer?

Overview

Does Pretraining for Summarization Reuqire Knowledge Transfer?

This repository is the official implementation of the work in the paper Does Pretraining for Summarization Reuqire Knowledge Transfer? to appear in Findings of EMNLP 2021.
You can find the paper on arXiv here: https://arxiv.org/abs/2109.04953

Requirements

This code requires Python 3 (tested using version 3.6)

To install requirements, run:

pip install -r requirements.txt

Preparing finetuning datasets

To prepare a summarization dataset for finetuning, run the corresponding script in the finetuning_datasetgen folder. For example, to prepare the cnn-dailymail dataset run:

cd finetuning_datasetgen
python cnndm.py

Running finetuning experiment

We show here how to run training, prediction and evaluation steps for a finetuning experiment. We assume that you have downloaded the pretrained models in the pretrained_models folder from the provided Google Drive link (see pretrained_models/README.md) If you want to pretrain models yourself, see latter part of this readme for the instructions.

All models in our work are trained using allennlp config files which are in .jsonnet format. To run a finetuning experiment, simply run

# for t5-like models
./pipeline_t5.sh 
   
    

# for pointer-generator models
./pipeline_pg.sh 
    

    
   

For example, for finetuning a T5 model on cnndailymail dataset, starting from a model pretrained with ourtasks-nonsense pretraining dataset, run

./pipeline_t5.sh finetuning_experiments/cnndm/t5-ourtasks-nonsense

Similarly, for finetuning a randomly-initialized pointer-generator model, run

./pipeline_pg.sh finetuning_experiments/cnndm/pg-randominit

The trained model and output files would be available in the folder that would be created by the script.

model.tar.gz contains the trained (finetuned) model

test_outputs.jsonl contains the outputs of the model on the test split.

test_genmetrics.json contains the ROUGE scores of the output

Creating pretraining datasets

We have provided the nonsense pretraining datasets used in our work via Google Drive (see dataset_root/pretraining_datasets/README.md for instructions)

However, if you want to generate your own pretraining corpus, you can run

cd pretraining_datasetgen
# for generating dataset using pretraining tasks
python ourtasks.py
# for generating dataset using STEP pretraining tasks
python steptasks.py

These commands would create pretraining datasets using nonsense. If you want to create datasets starting from wikipedia documents please look into the two scripts which guide you how to do that by commenting/uncommenting two blocks of code.

Pretraining models

Although we provide you the pretrained model checkpoints via GoogleDrive, if you want to pretrain your own models, you can do that by using the corresponding pretraining config file. As an example, we have provided a config file which pretrains on ourtasks-nonsense dataset. Make sure that the pretraining dataset files exist (either created by you or downloaded from GoogleDrive) before running the pretraining command. The pretraining is also done using the same shell scripts used for the finetuning experiments. For example, to pretrain a model on the ourtasks-nonsense dataset, simply run :

./pipeline_t5.sh pretraining_experiments/pretraining_t5_ourtasks_nonsense
Owner
Approximately Correct Machine Intelligence (ACMI) Lab
Research on machine learning, its social impacts, and applications to healthcare. PI—@zackchase
Approximately Correct Machine Intelligence (ACMI) Lab
Resilience from Diversity: Population-based approach to harden models against adversarial attacks

Resilience from Diversity: Population-based approach to harden models against adversarial attacks Requirements To install requirements: pip install -r

0 Nov 23, 2021
This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

Stock Trading Market OpenAI Gym Environment with Deep Reinforcement Learning using Keras Overview This project provides a general environment for stoc

Kim, Ki Hyun 769 Dec 25, 2022
Manim is an engine for precise programmatic animations, designed for creating explanatory math videos

Manim is an engine for precise programmatic animations, designed for creating explanatory math videos. Note, there are two versions of manim. This rep

Grant Sanderson 49k Jan 09, 2023
Unified MultiWOZ evaluation scripts for the context-to-response task.

MultiWOZ Context-to-Response Evaluation Standardized and easy to use Inform, Success, BLEU ~ See the paper ~ Easy-to-use scripts for standardized eval

Tomáš Nekvinda 38 Dec 13, 2022
SwinTrack: A Simple and Strong Baseline for Transformer Tracking

SwinTrack This is the official repo for SwinTrack. A Simple and Strong Baseline Prerequisites Environment conda (recommended) conda create -y -n SwinT

LitingLin 196 Jan 04, 2023
FFCV: Fast Forward Computer Vision (and other ML workloads!)

Fast Forward Computer Vision: train models at a fraction of the cost with accele

FFCV 2.3k Jan 03, 2023
Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.

mani 1.2k Jan 07, 2023
A Python package for generating concise, high-quality summaries of a probability distribution

GoodPoints A Python package for generating concise, high-quality summaries of a probability distribution GoodPoints is a collection of tools for compr

Microsoft 28 Oct 10, 2022
CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors

CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors   In order to facilitate the res

yujmo 11 Dec 12, 2022
Generalized Random Forests

generalized random forests A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods fo

GRF Labs 781 Dec 25, 2022
Code in conjunction with the publication 'Contrastive Representation Learning for Hand Shape Estimation'

HanCo Dataset & Contrastive Representation Learning for Hand Shape Estimation Code in conjunction with the publication: Contrastive Representation Lea

Computer Vision Group, Albert-Ludwigs-Universität Freiburg 38 Dec 13, 2022
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
The first dataset of composite images with rationality score indicating whether the object placement in a composite image is reasonable.

Object-Placement-Assessment-Dataset-OPA Object-Placement-Assessment (OPA) is to verify whether a composite image is plausible in terms of the object p

BCMI 53 Nov 15, 2022
Train Dense Passage Retriever (DPR) with a single GPU

Gradient Cached Dense Passage Retrieval Gradient Cached Dense Passage Retrieval (GC-DPR) - is an extension of the original DPR library. We introduce G

Luyu Gao 92 Jan 02, 2023
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
Download from Onlyfans.com.

OnlySave: Onlyfans downloader Getting Started: Download the setup executable from the latest release. Install and run. Only works on Windows currently

4 May 30, 2022
Privacy-Preserving Portrait Matting [ACM MM-21]

Privacy-Preserving Portrait Matting [ACM MM-21] This is the official repository of the paper Privacy-Preserving Portrait Matting. Jizhizi Li∗, Sihan M

Jizhizi_Li 212 Dec 27, 2022
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

Computational Photography Lab @ SFU 1.1k Jan 02, 2023
Dataset for the Research2Clinics @ NeurIPS 2021 Paper: What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Collect super-resolution related papers, data, repositories

Collect super-resolution related papers, data, repositories

WangChaofeng 1.7k Jan 03, 2023