Deep learning based page layout analysis

Overview

Deep Learning Based Page Layout Analyze

This is a Python implementaion of page layout analyze tool. The goal of page layout analyze is to segment page images into different regions and recoginize them into different classes. A sementic segmentation model is trained to predict a pixel-wise probability map and a simple post-processing procedure is utilized to generate final detection bounding boxs and their corresponding labels and confidence scores. Here is what this code can do:

visual result

Requirements

This repository is mostly written in Python, so Python is essential to run the code. For now, only Python2 (Python2.7 is tested) is supported and it needs some minor modifications if you want to run this code on Python3.

The core of this repository is DeepLab_v2, which is a image semantic segmentation model. We use a TensorFlow implementation of DeepLab_v2 DeepLab-ResNet-TensorFlow written by DrSleep, so TensorFlow needs to be installed before running this code. We use TensorFlow v1.4 and it may need some slightly change if you are using other version of TensorFlow. Also, we need all the requirements from DeepLab-ResNet-TensorFlow.

  • Cython>=0.19.2
  • numpy>=1.7.1
  • matplotlib>=1.3.1
  • Pillow>=2.3.0
  • six>=1.1.0

Also Scikit-image is required for image processing.

  • scikit-image>=0.13.1

To install all the required python packages, you can just run

pip install -r requirements.txt

or for a local installation, run

pip install -U -r requirements.txt

Usage

The code is packaged into a Python function and a Python module with a main function, which will produce exactly the same final detection results. For simplifying the usage, all the parameters are fixed except for the number of classes and visual result flag. One can easily extend the function to accept the parameters that need to be altered.

First, you need to save all the images in a folder and all the images should be in 'jpg' format. Then, a output directory need to be specified to save all the output predictions that include the down-sampled images, probability maps, visualization results and a JSON file with final detection results. The output directory does not have to exist before running the code (if there isn't one, we will create one for you). Finally you can run this code by calling the function in a bash terminal or import the module in Python and either way will do.

Function

PageAnalyze.py is the good point to start with. Just call this function and magic will happen.

python PageAnalyze.py --img_dir=./test/test_images \
                      --out_dir=./test/test_outputs \
                      --num_class=2 \
                      --save=True

Module

example.py gives the easiest way to import the module and call the main function.

import sys
# Add the path of the great module and python sripts.
sys.path.append('utils/')

# Import the great module.
import page_analyze

# The image directory containing all the images need to be processed.
# All the images should be the '.jpg' format.
IMAGE_DIR = 'test/test_images/'

# The output directory saving the output masks and results.
OUTPUT_DIR = 'test/test_outputs/'

# Classes number: 2 or 4.
# 2 --- text / non-text: trained on CNKI data.
# 4 --- figure / table / equation / text: trained on POD data (beta).
CLASS_NUM = 2

# Calling the great function in the great module.
page_analyze.great_function(image_dir=IMAGE_DIR, \
	                        output_dir=OUTPUT_DIR, \
	                        class_num=CLASS_NUM, \
	                        save=True)

# The final detection results will be saved in a single JSON file.
import json
RESULTS = json.load(open(OUTPUT_DIR + 'results.json', 'r'))

Output

If the visual result flag is set to True, then the visualization results will be saved at output_dir/predictions/. To save the running time, the default value of this flag is False.

Final results will be coded into a single JSON file at output_dir/results.json. Here is the example of the JSON file.

{
	"3005": 
	{
		"confs": [0.5356797385620915, 0.913904087544255, 0.7526443014705883, 0.9095219564478454, 0.8951748322262306, 0.6817004971002486, 0.9001002744772497, 0.9337936032277651, 0.8377339456847807, 0.7026428593008852, 0.8779250071028856, 0.8281628004780167, 0.8653182372135079, 0.7315979265269327, 0.5775715633847122, 0.6177185356901381], 
		"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2], 
		"bboxs": [[130, 219, 158, 477], [395, 347, 484, 1990], [543, 725, 714, 1605], [800, 257, 1068, 2082], [1137, 1209, 2007, 2168], [1175, 185, 1230, 429], [1268, 171, 1910, 1123], [1897, 164, 2316, 1123], [2055, 1209, 2986, 2165], [2364, 175, 2567, 1120], [2691, 171, 2942, 1123], [3038, 1213, 3272, 2165], [3052, 168, 3261, 1123], [2594, 563, 2694, 749], [2608, 1055, 2684, 1113], [2979, 1464, 3048, 2161]]
	}, 
	
	"3004": 
	{
		"confs": [0.630786120259585, 0.7335585214477949, 0.8283411346491016, 0.7394772139625081, 0.6203790052606408, 0.7930634196637656, 0.9062854529210855, 0.8209901351845086, 0.9105478240273018, 0.6283956118923438, 0.9496875863021265, 0.8075845380525092, 0.9290070507407969, 0.899121940386255, 0.9245526964953498], 
		"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2], 
		"bboxs": [[996, 1297, 1054, 1581], [1037, 212, 1201, 1026], [1102, 1208, 1259, 2115], [1807, 143, 2163, 1102], [1988, 1293, 2043, 1574], [2094, 1317, 2245, 2016], [2272, 1191, 2385, 2142], [2437, 1191, 2491, 1742], [2529, 1180, 3265, 2149], [2728, 164, 2820, 1067], [2875, 140, 3258, 1112], [219, 164, 1026, 2135], [1211, 191, 1776, 1037], [1287, 1235, 1981, 2105], [2197, 277, 2683, 985]]
	},
	
} 

Descriptions

The pipeline of our model contains two parts: the semantic segmentation stage and the post-processing stage. We adopt DeepLab_v2 to preform a semantic segmentation procedure, and use the TensorFlow implementation DeepLab-ResNet-TensorFlow to train our semantic segmentation model. For post-processing part, we get the bounding box locations along with their confidence scores and labels by analysing the connected component on the probability map generated from our segmentation model.

Training

In order to train the segmentation model, we prepare the pixel-wise mask labels on CNKI dataset and POD dataset.

  • CNKI dataset contains total 14503 images with text / non-text annotations, and most of them are in Chinese language.

  • POD dataset contains 11333 images all in English with a four-class labeling which is figure / table / equation / text.

Noted that CNKI dataset is noisy because it is annotated by a software, but POD dataset has a lot less noise. Also, text are annotated as regions in CNKI dataset while each text line is labeled in POD dataset. Here are the examples of CNKI dataset and POD dataset.

data examples

For CNKI dataset and POD dataset, we train two DeepLab models separately. We initialize the network by the model pre-trained on MSCOCO. And each model was trained by 200k steps with batch size 5 and random scale 321*321 inputs. It took roughly 2 days to converge the model on a single GTX1080 GPU. During the training, all the other hyper paremeters we use are the default values in DeepLab-ResNet-TensorFlow. The mAPs on the training sets of two datasets are 0.909 and 0.837.

This repository does not contain the code for training in the reason that we want this repository keep an acceptable size (or else we need to pack the data as well). But we do put the trained models mdoel_cnki and mdoel_pod in the models folder for inference.

Testing

At inference time, there are four steps and each one is written in a Python module in the utils folder. Either the function or the module calls the page_analyze module which calls those four modules in turn.

  • Configuration module generates the image list file and configuration dictionary.

  • Pre-processing module re-scales (down-samples) large images and dump the scale JSON file. Due to the limit of GPU memory, we down-sample the image with height larger than 1000 pixels to 1000 and keep the aspect ratio.

  • Segmentation module is the core module of this code. It sets up a FIFO queue for all the input images and feeds them into the deep neural network. The probability maps generated by the DeepLab model will be saved in the output directory.

  • Post-Processing module reads the original images and generated probability masks ro get the bounding box locations and their labels and confidence scores. If the save flag is set to Ture, the detection results will be drew on the original images and saved in the output directory. The final results will be written in a single JSON file we have mentioned before.

Here are a simple demo of the detection pipline.

pipline demo

Running Time

We conduct some simple running time analysis on our server (14 cores E5-2680-v4*2 and 8GB GTX1080*2). Configuration module takes nearly no time and the rest of the modules running time is linear to the image number. So we run the code on 51 test images and compute the average time per image for each module. For GPU we only use one GTX1080 GPU and perform multi-processing on CPU.

pre_process inference post_process
CPU time 0.53s / 0.05s 3.98s 0.07s / 0.10s
GPU time - 0.27s -
  • For pre_process, a size 3000*2000 image is quite large for deep learning so we have to down-sample the input image (due to the limit of GPU memory) and this is the most time-consuming part. We took the down-sampled images as inputs and run the code again, and it only took 0.05s per image for the reason we don't need to re-sacle the input images.

  • For inference, it is a feed-forward precedure through the deep neural network, so the time gap between CPU and GPU is enormous. Noted that we only use one GTX1080, so it should be at least twice faster when running on a decent GPU like Titan X.

  • For post_process, connected component analysis is time-consuming, but suprisingly fast on this case. Also, there is a slightly difference between 2 classification and 4 classification which is 0.07s versus 0.10s

In general, now it takes us about a second to process one image. But if the sizes of the input images are smaller, it is likely to achieve 4 to 5 FPS, that is 4 or 5 images per second, with the help of a nice GPU of course.

Problem

We analyze the weakness of this algorithm by 51 test images and the main problem is from the post-processing procedure. Since the DeepLab model achieves 0.909 mAP, there is not a lot space to improve on the deep learning model. We categorize the problem into three types.

problem example

  • Fragmentary text regions (especially in captions). This is because the CNKI data annotated all the captions as text regions, and these extremely small text regions are very close to the non-text region (like a figure or a table) which can harm the training of deep neural network. So at the inference time, the network may predict some separable text regions on the captions causing the bad results.

  • Inseparable non-text regions (causing overlaping between regions). There is only two classes (text / non-text) in CNKI dataset, so the network can not tell the difference between figures and tables. Sometimes, some different but close table and figure regions may be predicted as one non-text region together which may cause overlaping with other regions (which is very bad for the recognition procedure afterwards).

  • Poor results on 4 classification (different data distribution). Since the 4 classes model is trained on POD dataset which has different distribution compared with CNKI dataset (language, layout and different text regions). So there is inevitable some bad results on CNKI test sets when we try to use the model trained on POD dataset. (We have already beaten the second place on POD competition by training the figure / table / equation model and using basically the same post processing procedure.)

Todo

  • Improve the post processing procedure to get a better result.

  • Modify the code to run on Python3.

Statements

  • Sorry we can not make the source code public yet.

  • For more details, please refer to our paper:

@inproceedings{li2018deeplayout,

title={DeepLayout: A Semantic Segmentation Approach to Page Layout Analysis},

author={Li, Yixin and Zou, Yajun and Ma, Jinwen},

booktitle={International Conference on Intelligent Computing},

pages={266--277},

year={2018},

organization={Springer}

}

Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
TableBank: A Benchmark Dataset for Table Detection and Recognition

TableBank TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on th

844 Jan 04, 2023
Python Computer Vision Aim Bot for Roblox's Phantom Forces

Python-Phantom-Forces-Aim-Bot Python Computer Vision Aim Bot for Roblox's Phanto

drag0ngam3s 2 Jul 11, 2022
Course material for the Multi-agents and computer graphics course

TC2008B Course material for the Multi-agents and computer graphics course. Setup instructions Strongly recommend using a custom conda environment. Ins

16 Dec 13, 2022
Face_mosaic - Mosaic blur processing is applied to multiple faces appearing in the video

動機 face_recognitionを使用して得られる顔座標は長方形であり、この座標をそのまま用いてぼかし処理を行った場合得られる画像は醜い。 それに対してモ

Yoshitsugu Kesamaru 6 Feb 03, 2022
Maze generator and solver with python

Procedural-Maze-Generator-Algorithms Check out my youtube channel : Auctux Ressources Thanks to Jamis Buck Book : Mazes for programmers Requirements P

Joseph 19 Dec 07, 2022
A Python script to capture images from multiple webcams at once and save them into your local machine

Capturing multiple images at once from Webcam Using OpenCV Capture multiple image by accessing the webcam of your system and save it to your machine.

Fazal ur Rehman 2 Apr 16, 2022
Deep learning based page layout analysis

Deep Learning Based Page Layout Analyze This is a Python implementaion of page layout analyze tool. The goal of page layout analyze is to segment page

186 Dec 29, 2022
scene-linear test images

Scene-Referred Image Collection A collection of OpenEXR Scene-Referred images, encoded as max 2048px width, DWAA 80 compression. All exrs are encoded

Gralk Klorggson 7 Aug 25, 2022
Markup for note taking

Subtext: markup for note-taking Subtext is a text-based, block-oriented hypertext format. It is designed with note-taking in mind. It has a simple, pe

Gordon Brander 224 Jan 01, 2023
Amazing 3D explosion animation using Pygame module.

3D Explosion Animation 💣 💥 🔥 Amazing explosion animation with Pygame. 💣 Explosion physics An Explosion instance is made of a set of Particle objec

Dylan Tintenfich 12 Mar 11, 2022
Face Detection with DLIB

Face Detection with DLIB In this project, we have detected our face with dlib and opencv libraries. Setup This Project Install DLIB & OpenCV You can i

Can 2 Jan 16, 2022
The open source extract transaction infomation by using OCR.

Transaction OCR Mã nguồn trích xuất thông tin transaction từ file scaned pdf, ở đây tôi lựa chọn tài liệu sao kê công khai của Thuy Tien. Mã nguồn có

Nguyen Xuan Hung 18 Jun 02, 2022
TextField: Learning A Deep Direction Field for Irregular Scene Text Detection (TIP 2019)

TextField: Learning A Deep Direction Field for Irregular Scene Text Detection Introduction The code and trained models of: TextField: Learning A Deep

Yukang Wang 101 Dec 12, 2022
Give a solution to recognize MaoYan font.

猫眼字体识别 该 github repo 在于帮助xjtlu的同学们识别猫眼的扭曲字体。已经打包上传至 pypi ,可以使用 pip 直接安装。 猫眼字体的识别不出来的原理与解决思路在采茶上 使用方法: import MaoYanFontRecognize

Aruix 4 Jun 30, 2022
BoxToolBox is a simple python application built around the openCV library

BoxToolBox is a simple python application built around the openCV library. It is not a full featured application to guide you through the w

František Horínek 1 Nov 12, 2021
Introduction to image processing, most used and popular functions of OpenCV

👀 OpenCV 101 Introduction to image processing, most used and popular functions of OpenCV go here.

Vusal Ismayilov 3 Jul 02, 2022
This is a pytorch re-implementation of EAST: An Efficient and Accurate Scene Text Detector.

EAST: An Efficient and Accurate Scene Text Detector Description: This version will be updated soon, please pay attention to this work. The motivation

Dejia Song 544 Dec 20, 2022
Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

68 Dec 14, 2022
Kornia is a open source differentiable computer vision library for PyTorch.

Open Source Differentiable Computer Vision Library

kornia 7.6k Jan 06, 2023