Python library to extract tabular data from images and scanned PDFs

Last update: Dec 31, 2022

Overview

ExtractTable - API to extract tabular data from images and scanned PDFs

The motivation is to make it easy for developers to extract tabular data from images or scanned PDF files without worrying about the table area, column coordinates, rotation et al.

Prerequisite

API Key: All requests to ExtractTable are authorized by an API Key. FREE credits here. The same API Key can also be used for conversions on the browser at Web Pro.

Installation

pip install -U ExtractTable

Basic Usage

Ok, enough selling. Let the ease in coding do the talk, and the output encourages you to buy credits; put that timer on and count the LOC.

from ExtractTable import ExtractTable
et_sess = ExtractTable(api_key=YOUR_API_KEY)        # Replace your VALID API Key here
print(et_sess.check_usage())        # Checks the API Key validity as well as shows associated plan usage 
table_data = et_sess.process_file(filepath=Location_of_Image_with_Tables, output_format="df")

# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
table_data = et_sess.process_file(filepath=Location_of_PDF_with_Tables, output_format="df", pages="all")

Detailed Library Usage

The tutorial available at takes you through

1. Installation
2. Import and check version
3. Create Session & Validate API Key
    3.1 Create Session with your API Key
    3.2 Validate the Key and check the plan usage
    3.3 Check Usage Details
4. Trigger the extraction process
    4.1 Accepted Input Types
    4.2 Process an IMAGE Input
    4.3 Process a PDF Input
    4.4 Output options
    4.5 Explore session objects
5. Explore the Output
    5.1 Output Structure
    5.2 Output Details
6. Make Corrections
    6.1 Split Merged Rows
    6.2 Split Merged Columns
    6.3 Fix Decimal Format
    6.4 Fix Date Format
7. Helpful Code Snippets
    7.1 Get text data
    7.2 Table output to Excel

Woahh, as simple as that ?!

Certainly. Do you know the current ExtractTable users use it for

Bank Statement
Medical Records
Invoice Details
Tax forms
Tender Notices

Its up to you now to explore the ways.

Explore

check the complete server response of the latest job with et_sess.ServerResponse.json()

{
    "JobStatus": <string>,                              # Status of the triggered Process  @ JOB-LEVEL
    "Pages": <integer>,                                 # Number of pages processed in this request @ PAGE-LEVEL
    "Tables": [<list of key-value objects of table>     # List of all tables found @ TABLE-LEVEL
        {
            "Page": <integer>,                              ## Page number in which this table is found
            "CharacterConfidence": <float>,                 ## Accuracy of Characters recognized from the input-page
            "LayoutConfidence": <float>,                    ## Accuracy of table layout's design decision
            "TableJson": <dict>,                            ## Table Cell Text in key-value format with index orientation - {row#: {col#: <str>}}
            "TableCoordinates": <dict>,                     ## Top-left & Bottom-right Cell Coordinates - {row#: {col#: <list(x1,y1,x2,y2)>}}
            "TableConfidence": <dict>                       ## Cell level accuracy of detected characters - {row#: {col#: <float>}}
        },
    {...}                                               ## ... more "Tables" objects
    ],
    "Lines": [<list of key-value objects>               # Pagewise Line details @ PAGE-LEVEL
        {
            "Page": <integer>,                          # Page number in which the lines are found
            "CharacterConfidence": <float>,             # Average Accuracy of all Characters recognized from the input-page
            "LinesArray": [
                <list of key-value objects of line>     # Ordered list of lines in this page @ LINE-LEVEL
                {
                    "Line": <str>,                          ## Detected text of the complete line
                    "WordsArray": [
                        <list of key-value objects>         ## Word level datails in this line @ WORD-LEVEL
                        {
                            "Conf": <float>,                    ### Accuracy of recognized characters of the word
                            "Word": <str>,                      ### Detected text of the word
                            "Loc": [x1, y1, x2, y2]             ### Top-left & Bottom-right coordinates, w.r.t the input-page width-height dimensions
                        },
                    {...}                                   ### More "WordsArray" objects
                    ]
                },
            {...}                                       ## More "LinesArray" objects
            ]
        },
    {...}                                               # More Pagewise "Lines" details
    ]
}

Bug Reports

Bug reports/fixes are most welcome and greatly appreciated with API credits. For support reach us at [email protected]

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Social Media

Comments

bug: holding when the program running after some samples

Describe the bug A clear and concise description of what the bug is. keep holding my apI key prefix is o6No6aqYRhrQ2MWxtDDyTeHiiUg****

To Reproduce Steps to reproduce the behavior: or the code you tried

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.
bug

opened by franztao 5
bug: function "et_sess.save_output(output_folder, output_format="csv")" output file, the file name lack some alpha of the origin full name

Describe the bug A clear and concise description of what the bug is. my picture name is all suffix png. such as "[email protected]_14-1-4.png"

To Reproduce Steps to reproduce the behavior: or the code you tried

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.
bug

opened by franztao 3
found some bugs and list the bugs out

Describe the bug A clear and concise description of what the bug is. 1.不能识别出垮列的文本，识别成表格时，不符合逻辑的分开成两边

2.不能识别加减号,can not recognize Plus minus sign. 31.2 + 4.98 3.不能够识别上下标，can not recognize subscript and supscript. 4.ocr识别丢失字符 loss some recognized tokens 5.长的表格，有部分没有识别出来 long size table,can not recognize the bottem part 6.cell中有化学式的，识别不出来,when there is chemical formulate in cell, can not recognize the table

To Reproduce Steps to reproduce the behavior: or the code you tried

Expected behavior A clear and concise description of what you expected to happen. I can solve these problems with us.

Additional context Add any other context about the problem here.
bug

opened by franztao 2
question: what meaning is LayoutConfidence?

"CharacterConfidence": , # Average Accuracy of all Characters recognized from the input-page "LayoutConfidence": , ## Accuracy of table layout's design decision please give out the detaild decription or calculate function code about CharacterConfidence,LayoutConfidence
good first issue

opened by franztao 2
Invalid cross-device link
Describe the bug On some OS, we can not save output file to temporary directory (let's say /tmp) and move it to a new place. It throws the following error :

os.replace(each_tbl_path, os.path.join(output_folder, input_fname+os.path.basename(each_tbl_path))) OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp7hqcm0fh/_table_1.csv' -> '/var/www/python/app/tmp/details_table_1.csv'

After checking the source code, it appears ExtractTable use os.replace to move the file. This method does not support moving file from a partition to an other : https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link

To Reproduce I use Python 3.6 in a venv. You will need two different system parts, and invoke save_output from ExtractTable-py library, to save file from a filesystem to an other. I have not tried, but I think you can simply reproduce this bug by invoking os.replace without calling ExtractTable-py.

Expected behavior Move the file from a filesystem to an other. I think using shutil.move would be a preferable way to achieve file moving than os.replace.
bug
opened by Elegye 2
MakeCorrections API - How do you chain corrections

Hi there, I'm trying to use multiple correction commands but it isn't working as the object becomes a list after the first correction. Is there something I'm missing here? Thanks!
good first issue

opened by kylebutts 1
character ocr can support latex format?

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

opened by franztao 1
please, do you have tools of transform ExtracTable output file type to CoCo file type(other open source Detection file type)?

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

opened by franztao 1
Custom output path when the output_format is csv

Is your feature request related to a problem? Please describe. When the output_format is set to csv the csv file is written to some random path in /tmp location.

Describe the solution you'd like [optional, but helpful] Define a parameter in the process_file like output_file which takes the absolute path where the file needs to be written along with the file name

opened by padmano 1
Is it possible to get the data in excel by maintaining table structure?

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

opened by jcthink 1

Character and Layout Confidence

Hi, need some definition material for Character and Layout Confidence like how it is calculated mathematically using below code. Thanks.

for idx, each_table in enumerate(et_sess.ServerResponse.json()['Tables']):
    print("CharacterConfidence = ", each_table['CharacterConfidence'])
    print("LayoutConfidence = ", each_table['LayoutConfidence'])

good first issue

opened by muhdzubair 1

Consider user hints on the table structure information

Is your feature request related to a problem? Please describe. "while you do whatever you want, why not consider the our hints" is the developers feedback on many instances

Describe alternatives you've considered Developers are tackling with their custom post processing.

Describe the solution you'd like [optional, but helpful] Pros: May be it is a worth taking a look as most of the post processing involves in similar approaches that resolves majority issues. Cons: computing cost
feature/idea

opened by akshowhini 0
Capture Vertically center aligned columns

Refer: https://stackoverflow.com/questions/58238981/extracting-table-from-a-pdf-table-without-vertical-lines

Do not miss: Joelgeraci's comment to the question
feature/idea

opened by akshowhini 0

Releases(v2.4.0)

v2.4.0(Jul 18, 2022)

Use corrections.save_output() to save the output to a folder
Source code(tar.gz)
Source code(zip)
ExtractTable-2.4.0-py3-none-any.whl(18.90 KB)
ExtractTable-2.4.0.tar.gz(16.16 KB)
v2.3.1(May 6, 2022)
Fix processing splitted PDFs

Support downloading BigFile

Source code(tar.gz)
Source code(zip)
ExtractTable-2.3.1-py3-none-any.whl(18.38 KB)
ExtractTable-2.3.1.tar.gz(16.01 KB)
v2.2.0(Apr 20, 2021)
View your transactions processed in the last 24 hours

Give user the ability to make character error corrections

Source code(tar.gz)
Source code(zip)
ExtractTable-2.2.0-py3-none-any.whl(20.52 KB)
v2.1.2(Nov 6, 2020)

To provide user control on whether to output the row & column numbers in the output file
Source code(tar.gz)
Source code(zip)
ExtractTable-2.1.2-py3-none-any.whl(20.13 KB)
ExtractTable-2.1.2.tar.gz(12.04 KB)
v2.1.0(Aug 27, 2020)
Data Cleaning on the server output made easy with MakeCorrections class.

split_merged_rows

split_merged_columns

fix_decimal_format

fix_date_format functionalities

added server_response attribute to the session for easy reference

Update Google Colab Tutorial

Save tables to multiple sheets of a single excel file

save_output functionality in session to save Tables & Text output to local

Updated Tutorial in example-code.ipynb
Source code(tar.gz)
Source code(zip)
ExtractTable-2.1.0-py3-none-any.whl(20.12 KB)
ExtractTable-2.1.0.tar.gz(12.03 KB)
v2.0.2(Jul 4, 2020)

To handle Invalid Object Exception when processing big files
Source code(tar.gz)
Source code(zip)
ExtractTable-2.0.2-py3-none-any.whl(17.28 KB)
ExtractTable-2.0.2.tar.gz(9.27 KB)
v2.0.1(Jul 3, 2020)
#28

Maintain column and row indices order

Display JobId & Wait message for async transactions

Source code(tar.gz)
Source code(zip)
ExtractTable-2.0.1-py3-none-any.whl(17.27 KB)
ExtractTable-2.0.1.tar.gz(9.27 KB)
v2.0.0(Apr 30, 2020)

To support the below added features at API level • Tables + Text Data: non-tabular text along with the tabular data (when tables undetected, by default, response gets text data) • Text Accuracy Details: page level character accuracy details • Cell / Word Coordinates: x,y coordinates of all words and table's cell data • Cell / Word Level Accuracy: word level accuracy details • Non-English characters: for non-english alphabets like Mandrin, Japanese etc
Source code(tar.gz)
Source code(zip)
ExtractTable-2.0.0-py3-none-any.whl(17.04 KB)
ExtractTable-2.0.0.tar.gz(9.09 KB)
v1.2.1.2(Dec 1, 2019)

Fixed Columns were not in order in the output
Source code(tar.gz)
Source code(zip)
ExtractTable-1.2.1.2-py3-none-any.whl(14.09 KB)
ExtractTable-1.2.1.2.tar.gz(8.01 KB)
v1.1.0(Oct 20, 2019)

Support URL as input which downloads the file to the temporary directory for the instance, auto deleted on success
Source code(tar.gz)
Source code(zip)
v1.0.1(Oct 7, 2019)

The first stable library to make it easier for python developers to use ExtractTable's API to extract tabular data (table) from images and scanned PDFs without worrying about table area, column regions, image rotation et al.
Source code(tar.gz)
Source code(zip)
ExtractTable-1.0.1-py3-none-any.whl(12.42 KB)
ExtractTable-1.0.1.tar.gz(6.54 KB)

Owner

Org. Account

You, I and they have the same problem to solve !?!?

GitHub Repository https://extracttable.com

Textboxes_plusplus implementation with Tensorflow (python)

TextBoxes++-TensorFlow TextBoxes++ re-implementation using tensorflow. This project is greatly inspired by slim project And many functions are modifie

81 Dec 07, 2022

Using Opencv ,based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching

Using Opencv ,this project is based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching ,it will just mask that image . This project ,if used in cctv

1 Feb 13, 2022

Fully-automated scripts for collecting AI-related papers

AI-Paper-Collector Web demo: https://ai-paper-collector.vercel.app/ (recommended) Colab notebook: here Motivation Fully-automated scripts for collecti

772 Dec 30, 2022

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Quick and Dirty OCR of Facebook Papers Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review. As lu

2 Oct 28, 2021

code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

DeepCAD This repository provides source code for our paper: DeepCAD: A Deep Generative Network for Computer-Aided Design Models Rundi Wu, Chang Xiao,

85 Dec 31, 2022

The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

SpeechDrivesTemplates The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". [arxiv

53 Dec 23, 2022

TextBoxes++: A Single-Shot Oriented Scene Text Detector

TextBoxes++: A Single-Shot Oriented Scene Text Detector Introduction This is an application for scene text detection (TextBoxes++) and recognition (CR

930 Jan 04, 2023

[EMNLP 2021] Improving and Simplifying Pattern Exploiting Training

ADAPET This repository contains the official code for the paper: "Improving and Simplifying Pattern Exploiting Training". The model improves and simpl

138 Dec 26, 2022

An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

PyTorch implementation of Learning by Aligning (ICCV 2021) This is an official PyTorch implementation of the paper "Learning by Aligning: Visible-Infr

30 Nov 05, 2022

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

SMCG Code for the paper "Controllable Video Captioning with an Exemplar Sentence" Introduction We investigate a novel and challenging task, namely con

10 Dec 04, 2022

A real-time dolly zoom camera effect

Dolly-Zoom I've always been amazed by the gradual perspective change of dolly zoom, and I have some experience in python and OpenCV, so I decided to c

52 Dec 08, 2022

原神风花节自动弹琴辅助

GenshinAutoPlayBalladsofBreeze 原神风花节自动弹琴辅助（已适配1920*1080分辨率）本程序基于opencv图像识别技术，不存在任何封号。因为正确率取决于你的cpu性能，10900k都不一定全对。由于图像识别存在误差，根本无法确定出错时间。更不用说被检测到了。

20 Oct 27, 2022

3点クリックで円を指定し、極座標変換を行うサンプルプログラム

click-warpPolar 3点クリックで円を指定し、極座標変換を行うサンプルプログラムです。 Requirements OpenCV 3.4.2 or Later Usage 実行方法は以下です。起動後、マウスで3点をクリックし円を指定してください。 python click-warpPol

17 Dec 30, 2022

Educational application aimed at automating user-defined workflows for the mobile game, "Granblue Fantasy", using a variety of CV technologies in the backend such as OpenCV, PyAutoGUI and EasyOCR and a frontend coded in Typescript.

Granblue Automation using Template Matching (It is like Full Auto, but with Full Customization!) Discord here: https://discord.gg/5Yv4kqjAbm Android v

71 Dec 30, 2022

Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, rastreia padrões de gestos em vez de um mouse físico.

mouserController Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, r

6 Jun 28, 2022

This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

90 Dec 22, 2022

This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.

Pinch-zoom This is a python project based on real-time hand-gesture detection, to zoom in or out, using the distance between the index finger and the

6 Jul 11, 2022

Histogram specification using openCV in python .

histogram specification using openCV in python . Have to input miu and sigma to draw gausssian distribution which will be used to map the input image . Example input can be miu = 128 sigma = 30

6 Nov 17, 2021

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

532 Dec 28, 2022

Python library to extract tabular data from images and scanned PDFs

Related tags

Overview

Overview

Prerequisite

Installation

Basic Usage

Detailed Library Usage

Woahh, as simple as that ?!

Explore

Bug Reports

License

Social Media

Comments

Releases(v2.4.0)

v2.4.0(Jul 18, 2022)

v2.3.1(May 6, 2022)

v2.2.0(Apr 20, 2021)

v2.1.2(Nov 6, 2020)

v2.1.0(Aug 27, 2020)

v2.0.2(Jul 4, 2020)

v2.0.1(Jul 3, 2020)

v2.0.0(Apr 30, 2020)

v1.2.1.2(Dec 1, 2019)

v1.1.0(Oct 20, 2019)

v1.0.1(Oct 7, 2019)

Owner

Org. Account

Textboxes_plusplus implementation with Tensorflow (python)

Using Opencv ,based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching

Fully-automated scripts for collecting AI-related papers

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

TextBoxes++: A Single-Shot Oriented Scene Text Detector

[EMNLP 2021] Improving and Simplifying Pattern Exploiting Training

An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

Detect textlines in document images

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

A real-time dolly zoom camera effect

原神风花节自动弹琴辅助

3点クリックで円を指定し、極座標変換を行うサンプルプログラム

Educational application aimed at automating user-defined workflows for the mobile game, "Granblue Fantasy", using a variety of CV technologies in the backend such as OpenCV, PyAutoGUI and EasyOCR and a frontend coded in Typescript.

Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, rastreia padrões de gestos em vez de um mouse físico.

This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.

Histogram specification using openCV in python .

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike