~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Last update: Dec 06, 2022

Related tags

Overview

cosc428-structor

I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Conventional OCR engines like Tesseract weren't able to accurately recognise the page structure, which led to many transcription errors. If I could tell Tesseract to ignore certain regions (like images or repeated headers), then I could greatly reduce the number of errors in the resulting ebook. Thus: for my assignment, I wrote a program that takes an image and uses computer vision magick to determine the page's structure. So far, my program can detect and locate:

lines of text,
paragraphs,
section titles,
images and their associated captions,
boilerplate like page numbers, and
chapter titles.

Ain't it grand?

Dependencies

The project is written in Python 2.7.3 and uses the cv2 library for interacting with openCV. It also uses numpy for some of the mathematical operations. On windows, the best way to get these dependencies is to install the Python(x,y) suite (https://code.google.com/p/pythonxy/), which combines python with a customisable set of scientific computing libraries.

Program Structure

The program's root is main.py, but this simply iterates through images in a folder and constructs a Page instance from each image. Thus, the real work happens in page.py.

page.py contains a few utility methods and the Page class. The constructor calls the appropriate methods in order to determine the logical structure of the page. This structure is stored in three objects: self.margin, self.content, and self.boilerplate (which contains such non-content text objects as the page number and header).

The getBuildingBlocks method is responsible for finding words, grouping words into textual lines, discarding marginal noise, and fitting a Margin instance around the remaining lines. Most of these tasks are preformed by calling other functions.

The self.content object is found by passing the set of lines to the Content() constructor. This uses a state machine to group lines into figures, paragraphs, section titles, etc. The Content class, along with a class for each content type, is found in content.py.

The other files can generally be ignored when trying to understand the program; they are largely just convenience classes which represent page elements (such as points, geometric lines, words, text lines, and boxes), as well as supporting tools such as the Stopwatch.

How to Run the Code

Run main.py using the python interpreter. This will process each page in ./images, and for each page a series of 'snapshot' images will be displayed in order to illustrate the algorithm. To show only the final result for each image, set showSteps in main.py to False.

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Comments

The getBuildingBlocks

Hello, Recently, I have some task about the document layout analysis. The description in "README.md" is very consistent with my mission. But when I try to run the code as README.md: How to Run the Code, there just some red line in each dobule word and have no resault of the detect and locate of "line of text", "paragraphs", "section titles" , etc. So I want to know what has happend to the code. Very thankful

opened by lvbohui 3

Releases(v1.0)

v1.0(Nov 7, 2013)

This is the version that I used to write the first draft of my conference paper.
Source code(tar.gz)
Source code(zip)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Related tags

Overview

cosc428-structor

Dependencies

Program Structure

How to Run the Code

You might also like...

Basic functions manipulating images using the OpenCV library

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Text page dewarping using a "cubic sheet" model

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Comments

The getBuildingBlocks

Releases(v1.0)

v1.0(Nov 7, 2013)

Owner

Chad Oliver

MONAI Label is a server-client system that facilitates interactive medical image annotation by using AI.

Document Image Dewarping

A small C++ implementation of LSTM networks, focused on OCR.

Open Source Computer Vision Library

OCR of Chicago 1909 Renumbering Plan

This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"

Détection de créneaux de vaccination disponibles pour l'outil ViteMaDose

Make OpenCV camera loops less of a chore by skipping the boilerplate and getting right to the interesting stuff

Ackermann Line Follower Robot Simulation.

A tool to enhance your old/damaged pictures built using python & opencv.

A simple demo program for using OpenCV on Android

Histogram specification using openCV in python .

TextBoxes re-implement using tensorflow

Tool which allow you to detect and translate text.

TextBoxes++: A Single-Shot Oriented Scene Text Detector

Solution for Problem 1 by team codesquad for AIDL 2020. Uses ML Kit for OCR and OpenCV for image processing

A webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV.

Detect and fix skew in images containing text

DouZero is a reinforcement learning framework for DouDizhu - 斗地主AI

OCR powered screen-capture tool to capture information instead of images