Table Extraction Tool

Overview

Tree Structure - Table Extraction

Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables. A crucial step in this process is the construction of the hierarchical tree of context objects such as text blocks, figures, tables, etc. The system currently uses PDF to HTML conversion provided by Adobe Acrobat converter. Adobe Acrobat converter is not an open source tool and this can be very inconvenient for Fonduer users. We therefore need to build our own module as replacement to Adobe Acrobat. Several open source tools are available for pdf to html conversion but these tools do not preserve the cell structure in a table. Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).

Dependencies

pip install -r requirements.txt

Environment variables

First, set environment variables. The DATAPATH folder should contain the pdf files that need to be processed.

source set_env.sh

Tutorial

The table-extraction/tutorials/ folder contains a notebook table-extraction-demo.ipynb. In this demo we detail the different steps of the table extraction tool and display some examples of table detection results for paleo papers. However, to extract tables for new documents, the user should directly use the command line tool detailed in the next section.

Command Line Usage

To use the tool via command line, run:

source set_env.sh

python table-extraction/ml/extract_tables.py [-h]

usage: extract_tables.py [-h] [--mode MODE] [--train-pdf TRAIN_PDF]
                         [--test-pdf TEST_PDF] [--gt-train GT_TRAIN]
                         [--gt-test GT_TEST] [--model-path MODEL_PATH]
                         [--iou-thresh IOU_THRESH]

Script to extract tables bounding boxes from PDF files using a machine
learning approach. if model.pkl is saved in the model-path, the pickled model
will be used for prediction. Otherwise the model will be retrained. If --mode
is test (by default), the script will create a .bbox file containing the
tables for the pdf documents listed in the file --test-pdf. If --mode is dev,
the script will also extract ground truth labels fot the test data and compute
some statistics. To run the script on new documents, specify the path to the
list of pdf to analyze using the argument --test-pdf. Those files must be
saved in the DATAPATH folder.

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE           usage mode dev or test, default is test
  --train-pdf TRAIN_PDF
                        list of pdf file names used for training. Those files
                        must be saved in the DATAPATH folder (cf set_env.sh)
                        must be saved in the DATAPATH folder (cf set_env.sh)
  --test-pdf TEST_PDF   list of pdf file names used for testing. Those files
                        must be saved in the DATAPATH folder (cf set_env.sh)
  --gt-train GT_TRAIN   ground truth train tables
  --gt-test GT_TEST     ground truth test tables
  --model-path MODEL_PATH
                        pretrained model
  --iou-thresh IOU_THRESH
                        intersection over union threshold to remove duplicate
                        tables

Each document must be saved in the DATAPATH folder.

The script will create a .bbox file where each row contains tables coordinates of the corresponding row document in the --test_pdf file.

The bounding boxes are stored in the format (page_num, page_width, page_height, top, left, bottom, right) and are separated with ";".

Evaluation

We provide an evaluation code to compute recall, precision and F1 score at the character level.

python table-extraction/evaluation/char_level_evaluation.py [-h] pdf_files extracted_bbox gt_bbox

usage: char_level_evaluation.py [-h] pdf_files extracted_bbox gt_bbox

Computes scores for the table localization task. Returns Recall and Precision
for the sub-objects level (characters in text). If DISPLAY=TRUE, display GT in
Red and extracted bboxes in Blue

positional arguments:
  pdf_files       list of paths of PDF file to process
  extracted_bbox  extracting bounding boxes (one line per pdf file)
  gt_bbox         ground truth bounding boxes (one line per pdf file)

optional arguments:
  -h, --help      show this help message and exit
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
GDB python tool to pretty print and debug c++ xtensor containers

gdb_xt2np GDB python tool to pretty print, examine, and debug c++ Xtensor containers. Xtensor is a c++ library for scientific computing using multidim

Christopher Burke 4 Oct 29, 2021
textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

An End-to-End TextSpotter with Explicit Alignment and Attention This is initially described in our CVPR 2018 paper. Getting Started Installation Clone

Tong He 323 Nov 10, 2022
Ocular is a state-of-the-art historical OCR system.

Ocular Ocular is a state-of-the-art historical OCR system. Its primary features are: Unsupervised learning of unknown fonts: requires only document im

228 Dec 30, 2022
This is a tensorflow re-implementation of PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.My blog:

PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network Introduction This is a tensorflow re-implementation of PSENet: Shape Robu

Michael liu 498 Dec 30, 2022
An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

Tips: A more recent scene text detection algorithm: PixelLink, has been implemented here: https://github.com/ZJULearning/pixel_link Contents: Introduc

dengdan 484 Dec 07, 2022
This project is basically to draw lines with your hand, using python, opencv, mediapipe.

Paint Opencv 📷 This project is basically to draw lines with your hand, using python, opencv, mediapipe. Screenshoots 📱 Tools ⚙️ Python Opencv Mediap

Williams Ismael Bobadilla Torres 3 Nov 17, 2021
A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

Attention-based OCR Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the tra

Ed Medvedev 933 Dec 29, 2022
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

PRImA Research Lab 46 Nov 14, 2022
Rotational region detection based on Faster-RCNN.

R2CNN_Faster_RCNN_Tensorflow Abstract This is a tensorflow re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detecti

UCAS-Det 581 Nov 22, 2022
Basic functions manipulating images using the OpenCV library

OpenCV Basic functions manipulating images using the OpenCV library. Reading Ima

Shatha Siala 3 Feb 17, 2022
Pixie - A full-featured 2D graphics library for Python

Pixie - A full-featured 2D graphics library for Python Pixie is a 2D graphics library similar to Cairo and Skia. pip install pixie-python Features: Ty

treeform 65 Dec 30, 2022
This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"

StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image

Meta Research 840 Dec 26, 2022
A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes A PyTorch implement of TextSnake: A Flexible Representation for Detecting

Prince Wang 417 Dec 12, 2022
Automatic Number Plate Recognition (ANPR) is a highly accurate system capable of reading vehicle number plates without human intervention

ANPR ANPR is therefore the underlying technology used to find a vehicle license/number plate and it, in turn, supplies this information to a next stag

Melih Emin Kılıçoğlu 1 Jan 09, 2022
A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV.

DcoumentScanner A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV. Directly install the .exe file to inst

Harsh Vardhan Singh 1 Oct 29, 2021
Here use convulation with sobel filter from scratch in opencv python .

Here use convulation with sobel filter from scratch in opencv python .

Tamzid hasan 2 Nov 11, 2021
FOTS Pytorch Implementation

News!!! Recognition branch now is added into model. The whole project has beed optimized and refactored. ICDAR Dataset SynthText 800K Dataset detectio

Ning Lu 599 Dec 19, 2022
A simple demo program for using OpenCV on Android

Kivy OpenCV Demo A simple demo program for using OpenCV on Android Build with: buildozer android debug deploy run Run (on desktop) with: python main.p

Andrea Ranieri 13 Dec 29, 2022
Repository for Scene Text Detection with Supervised Pyramid Context Network with tensorflow.

Scene-Text-Detection-with-SPCNET Unofficial repository for [Scene Text Detection with Supervised Pyramid Context Network][https://arxiv.org/abs/1811.0

121 Oct 15, 2021
Sort By Face

Sort-By-Face This is an application with which you can either sort all the pictures by faces from a corpus of photos or retrieve all your photos from

0 Nov 29, 2021