Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Last update: Nov 24, 2022

Overview

GATER

This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”.

Our implementation is built on the source code from keyphrase-generation-rl and fastNLP. Thanks for their work.

If you use this code, please cite our paper:

@inproceedings{ye2021heterogeneous,
  title={Heterogeneous Graph Neural Networks for Keyphrase Generation},
  author={Ye, Jiacheng and Cai, Ruijian and Gui, Tao and Zhang, Qi},
  booktitle={Proceedings of EMNLP},
  year={2021}
}

Dependency

python 3.5+
pytorch 1.0+
dgl 0.4.3
sentence_transformers 1.1.0
faiss 1.6.3

Dataset

The datasets can be downloaded from here, which are the tokenized version of the datasets provided by Ken Chen:

The testsets directory contains the five datasets for testing (i.e., inspec, krapivin, nus, and semeval and kp20k), where each of the datasets contains test_src.txt and test_trg.txt.
The kp20k_sorted directory contains the training and validation files (i.e., train_src.txt, train_trg.txt, valid_src.txt and valid_trg.txt).
Each line of the *_src.txt file is the source document, which contains the tokenized words of title <eos> abstract .
Each line of the *_trg.txt file contains the target keyphrases separated by an ; character. For example, each line can be like present keyphrase one;present keyphrase two;absent keyprhase one;absent keyphrase two.

Quick Start

The whole process includes the following steps:

Build tfidf: The retrievers/build_tfidf.py script is used to build the index for document retrieval.
Preprocessing: The preprocess.py script numericalizes the train_src.txt, train_trg.txt,valid_src.txt and valid_trg.txt files, and produces train.one2many.pt, valid.one2many.pt and vocab.pt.
Training: The train.py script loads the train.one2many.pt, valid.one2many.pt and vocab.pt file and performs training. We evaluate the model every 8000 batches on the valid set, and the model will be saved if the valid loss is lower than the previous one.
Decoding: The predict.py script loads the trained model and performs decoding on the five test datasets. The prediction file will be saved, which is like predicted keyphrase one;predicted keyphrase two;….
Evaluation: The evaluate_prediction.py script loads the ground-truth and predicted keyphrases, and calculates the [email protected]$ and [email protected]$ metrics.

For the sake of simplicity, we provide an one-click script in the script directory. You can run the following command to run the whole process with Gater model:

# under `One2One` paradigm
bash scripts/run_gater_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_gater_one2seq.sh

You can also run the baseline model with the following command:

# under `One2One` paradigm
bash scripts/run_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_one2seq.sh

Note:

Please download and unzip the datasets in the ./data directory first.
The Preprocessing procedure takes time because we have to pre-retrieve similiar references for each samples, and we also store them for the preparation of the training stage.
To run all the bash files smoothly, you may need to specify the correct home_dir (i.e., the absolute path to kg_gater dictionary) and the gpu id for CUDA_VISIBLE_DEVICES. We provide a small amount of data to quickly test whether your running environment is correct. You can test by running the following command:

bash scripts/run_small_gater_one2seq.sh

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Related tags

Overview

GATER

Dependency

Dataset

Quick Start

Owner

Jiacheng Ye

A Python module for the generation and training of an entry-level feedforward neural network.

a pytorch implementation of auto-punctuation learned character by character

List of all dependencies affected by node-ipc malicious commit

Autonomous Driving on Curvy Roads without Reliance on Frenet Frame: A Cartesian-based Trajectory Planning Method

PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Official PyTorch Implementation of Convolutional Hough Matching Networks, CVPR 2021 (oral)

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

Experiments for Operating Systems Lab (ETCS-352)

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer

Code for CVPR2021 paper 'Where and What? Examining Interpretable Disentangled Representations'.

An implementation of Deep Graph Infomax (DGI) in PyTorch

Code accompanying our paper Feature Learning in Infinite-Width Neural Networks

PFENet: Prior Guided Feature Enrichment Network for Few-shot Segmentation (TPAMI).

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

Code for Multimodal Neural SLAM for Interactive Instruction Following

GPU Accelerated Non-rigid ICP for surface registration

Learning kernels to maximize the power of MMD tests