EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures

Related tags

Deep LearningSciCap
Overview

SCICAP: Scientific Figures Dataset

This is the Github repo of the EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures (Hsu et. al, 2021)

SCICAP a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020. SCICAP contained 410k figures that focused on one of the dominent figure type - graphplot, extracted from over 290,000 papers.

How to Cite?

@inproceedings{hsu2021scicap,
  title={SciCap: Generating Captions for Scientific Figures},
  author={Hsu, Ting-Yao E. and Giles, C. Lee and Huang, Ting-Hao K.},
  booktitle={Findings of 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021 Findings)},
  year={2021}
}

Download the Dataset

You can dowload the SCICAP dataset here: Download Link (18.15 GB)

Folder Structure

scicap_data.zip
├── SciCap-Caption-All                  #caption text for all figures
│	├── Train
│	├── Val
│	└── Test
├── SciCap-No-Subfig-Img                #image files for the figures without subfigures
│	├── Train
│	├── Val
│	└── Test
├── SciCap-Yes-Subfig-Img               #image files for the figures with subfigures
│	├── Train
│	├── Val
│	└── Test
├── arxiv-metadata-oai-snapshot.json    #arXiv paper's metadata (from arXiv dataset)
└── List-of-Files-for-Each-Experiments  #list of figure names used in each experiment 
    ├── Single-Sentence-Caption
    │   ├── No-Subfig
    │   │   ├── Train
    │	│   ├── Val
    │	│   └── Test
    │	└── Yes-Subfig
    │       ├── Train
    │       ├── Val
    │       └── Test
    ├── First-Sentence                  #Same as in Single-Sentence-Caption
    └── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption

Number of Figures in Each Subset

Data Collection Does the figure have subfigures? Train Validate Test
First Sentence Yes 226,608 28,326 28,327
First Sentence No 106,834 13,354 13,355
Single-Sent Caption Yes 123,698 15,469 15,531
Single-Sent Caption No 75,494 9,242 9,459
Caption w/ <=100 words Yes 216,392 27,072 27,036
Caption w/ <=100 words No 105,687 13,215 13,226

JSON Data Format

Example Data Instance (Caption and Figure)

An actual JSON object from SCICAP:

{
  "contains-subfigure": true, 
  "Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"], 
  "paper-ID": "1001.0025v1", 
  "figure-ID": "1001.0025v1-Figure2-1.png", 
  "figure-type": "Graph Plot", 
  "0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.", 
  "1-lowercase-and-token-and-remove-figure-index": {
    "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
    "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
    "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
  }, 
  "2-normalized": {
    "2-1-basic-num": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }, 
    "2-2-advanced-euqation-bracket": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }
    }
  }


Corresponding Figure: 1001.0025v1-Figure2-1.png

JSON Scheme

  • contains-subfigure: boolean (check if contain subfigure)
  • paper-ID: the unique paper ID in the arXiv dataset
  • figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
  • figure-type: the figure type
  • 0-originally-extracted: original captions of figures
  • 1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
  • 2-normalized:
    • 2-1-basic-num: caption after replacing the number
    • 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
  • Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.

Within the caption content, we have three attributes:

  • caption: caption after each normalization
  • sentence: a list of segmented sentences
  • token: a list of tokenized words

Normalized Token

In the paper, we used [NUM], [BRACKET], [EQUATION], but we decided to use NUM-TK, BRACKET-TK, EQUAT-TK in the final data release to avoid the extra problems caused by "[]".

Token Description
NUM-TK Numbers (e.g., 0, -0.2, 3.44%, 1,000,000).
BRACKET-TK Text spans enclosed by any types of bracket pairs, including {}, [], and ().
EQUAT-TK Math equations identified using regular expressions.

Baseline Performance

To examine the feasibility and challenges of creating an image-captioning model for scientific figures, we established several baselines and tested them using SCICAP. The caption quality was measured by BLEU-4, using the test set of the corresponding data collection as a reference. We trained the models on each data collection with varying levels of data filtering and text normalization. Table 2 shows the results. We also designed three variations of the baseline models, Vision-only, Vision+Text, and Text-only. Table 3 shows the results.
























Data License

The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license, which grants permission to remix, remake, annotate, and publish the data.

Acknowledgements

We thank Chieh-Yang Huang, Hua Shen, and Chacha Chen for helping with the data annotation. We thank Chieh-Yang Huang for the feedback and strong technical support. We also thank the anonymous reviewers for their constructive feedback. This research was partially supported by the Seed Grant (2020) from the College of Information Sciences and Technology (IST), Pennsylvania State University.

Owner
Edward
PHD Student
Edward
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023
Distributed Asynchronous Hyperparameter Optimization better than HyperOpt.

UltraOpt : Distributed Asynchronous Hyperparameter Optimization better than HyperOpt. UltraOpt is a simple and efficient library to minimize expensive

98 Aug 16, 2022
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 06, 2022
Image Processing, Image Smoothing, Edge Detection and Transforms

opevcvdl-hw1 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

Kenny Cheng 3 Aug 17, 2022
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

LightHuBERT LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT | Github | Huggingface | SUPER

WangRui 46 Dec 29, 2022
TeST: Temporal-Stable Thresholding for Semi-supervised Learning

TeST: Temporal-Stable Thresholding for Semi-supervised Learning TeST Illustration Semi-supervised learning (SSL) offers an effective method for large-

Xiong Weiyu 1 Jul 14, 2022
Source code for Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning Official implementation of ACC, described in the paper "Adaptively Calibrated C

3 Sep 16, 2022
Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

NDQ: Learning Nearly Decomposable Value Functions with Communication Minimization Note This codebase accompanies paper Learning Nearly Decomposable Va

Tonghan Wang 69 Nov 26, 2022
Code for KDD'20 "Generative Pre-Training of Graph Neural Networks"

GPT-GNN: Generative Pre-Training of Graph Neural Networks GPT-GNN is a pre-training framework to initialize GNNs by generative pre-training. It can be

Ziniu Hu 346 Dec 19, 2022
Unbalanced Feature Transport for Exemplar-based Image Translation (CVPR 2021)

UNITE and UNITE+ Unbalanced Feature Transport for Exemplar-based Image Translation (CVPR 2021) Unbalanced Intrinsic Feature Transport for Exemplar-bas

Fangneng Zhan 183 Nov 09, 2022
Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Visual Parser (ViP) This is the official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers. Key Feature

Shuyang Sun 117 Dec 11, 2022
Collections for the lasted paper about multi-view clustering methods (papers, codes)

Multi-View Clustering Papers Collections for the lasted paper about multi-view clustering methods (papers, codes). There also exists some repositories

Andrew Guan 10 Sep 20, 2022
Domain Adaptation with Invariant RepresentationLearning: What Transformations to Learn?

Domain Adaptation with Invariant RepresentationLearning: What Transformations to Learn? Repository Structure: DSAN |└───amazon |    └── dataset (Amazo

DMIRLAB 17 Jan 04, 2023
Code for the paper "Relation of the Relations: A New Formalization of the Relation Extraction Problem"

This repo contains the code for the EMNLP 2020 paper "Relation of the Relations: A New Paradigm of the Relation Extraction Problem" (Jin et al., 2020)

YYY 27 Oct 26, 2022
Jiminy Cricket Environment (NeurIPS 2021)

Jiminy Cricket This is the repository for "What Would Jiminy Cricket Do? Towards Agents That Behave Morally" by Dan Hendrycks*, Mantas Mazeika*, Andy

Dan Hendrycks 15 Aug 29, 2022
Classification Modeling: Probability of Default

Credit Risk Modeling in Python Introduction: If you've ever applied for a credit card or loan, you know that financial firms process your information

Aktham Momani 2 Nov 07, 2022
Few-shot Neural Architecture Search

One-shot Neural Architecture Search uses a single supernet to approximate the performance each architecture. However, this performance estimation is super inaccurate because of co-adaption among oper

Yiyang Zhao 38 Oct 18, 2022
Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021) 99% of the code in this repository originates from this link. ICCV 2021 pap

Jeesoo Kim 10 Feb 01, 2022
Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM)

Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM) Introduction The average lifetime of the $D^{0}$ me

Son Gyo Jung 1 Dec 17, 2021
Train CNNs for the fruits360 data set in NTOU CS「Machine Vision」class.

CNNs fruits360 Train CNNs for the fruits360 data set in NTOU CS「Machine Vision」class. CNN on a pretrained model Build a CNN on a pretrained model, Res

Ricky Chuang 1 Mar 07, 2022