A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

Overview

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control over the text recognition process.

go-ocr

A tool for extracting plain text from scanned documents (pdf or djvu), with user-defined postprocessing.

Motivation

Once I had a task of OCR'ing a number of scanned documents in pdf format. I quickly built a pipeline of the tools to extract images from the input files and to convert them to plain text, but then I realised that modern OCR software is still less than ideal in terms of recognising text, so a good deal of postprocessing was needed in order to remove at least some of those OCR artefacts and irregularities. I ended up with a long pipeline of sed/grep filters which also had to be adjusted per each document and per each document language. What I wanted was a tool that could combine the OCR tools invocation with filters application, also giving an easy way of modifying and combining the filter definitions.

The tool

Given an input file in either pdf or djvu format, the tool performs the following steps:

  1. Images get extracted from the input file using pdfimages or ddjvu tool;
  2. The extracted images get converted to plain text using tesseract tool, in parallel;
  3. The specified filters get applied to the text.

Invocation

go-ocr [OPTION]... FILE

Command line options:

-f,--first N        first page number (optional, default: 1)
-l,--last  N        last page number (optional, default: last page of the document)
-F,--filter FILE    filter specification file name (optional, may be given multiple times)
-L,--language LANG  document language (optional, default: 'eng')
-o,--output FILE    output file name (optional, default: stdout)
-h,--help           display this help and exit
-v,--version        output version information and exit
Example

The following command processes a document some.pdf in Russian, from page 12 to page 26 (inclusive), without any postprocessing, storing the result in the file document.txt:

./go-ocr --first 12 --last 26 --language rus --output document.txt some.pdf

Filter definitions

Filter definition file is a plain text file containing rewriting rules and C-style comments. Each rewriting rule has the following format:

scope type "match" "substitution"

where

  • scope is either line or text;
  • type is either word or regex;
  • match and substitution are Go strings.

Each rule must be on one line.

Each rule of the scope line is applied to each line of the text. There is no processing done to the line by the tool itself other than trimming the trailing whitespace, which means that a line does not have a trailing newline symbol when the rule is applied. After that all the lines get combined into text with newline symbols inserted between them.

Each rule of the scope text is applied to the whole text after all the line rules. All newline symbols are visible to the rule which allows for combining multiple lines into one.

The reason for having two different scopes for the rules is that applying a rule to a line is computationally cheaper that applying to the whole text. Also, this makes the line regular expressions a bit simpler as, for example, \s regex cannot match a newline.

Rules of type word do a simple substitution replacing any match string with its corresponding substitution string.

Rules of type regex search the input for any match of the match regular expression and replace it with the substitution string. The syntax of the regular expression is that of the Go regexp engine. The substuitution string may contain references to the content of capturing groups from the corresponding match regular expression. From the Go documentation, each reference

is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores. A purely numeric name like $1 refers to the submatch with the corresponding index; other names refer to capturing parentheses named with the (?P<name>...) syntax. A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.

In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

To insert a literal $ in the output, use $$ in the template.

All filter definition files are always processed in the order in which they are specified on the command line. Within each file, the rules are grouped by the scope, and applied in the order of specification. This allows for each rule to rely on the outcome of all the rules before it.

Rewriting rules examples

Rule to replace ellipsis with a single utf-8 symbol:

line word	"..."  "…"

Rule to replace all whitespace sequences with a single space character:

line regex	`\s+`	" "

Rule to remove all newline characters from the middle of a sentence:

text regex	`([a-z\(\),])\n+([a-z\(\)])` "${1} ${2}"

More examples can be found in the files filter-eng and filter-rus.

In practice, it is often useful to maintain one filter definition file with rules to remove common OCR artefacts, and another file with rules specific to a particular document. In general, it is probably impossible to avoid all manual editing altogether by using this tool, but from my experience, a few hours spent on setting up the appropriate filters for a 700 pages document can dramatically reduce the amount of manual work needed afterwards.

Other tools

Internally the program relies on pdfimages and ddjvu tools for extracting images from the input file, and on tesseract program for the actual OCR'ing. The tool pdfimages is usually a part of poppler-utils package, the tool ddjvu comes from djvulibre-bin package, and tesseract is included in tesseract-ocr package. By default, tesseract comes with the English language support only, other languages should be installed separately, for example, run sudo apt install tesseract-ocr-rus to install the Russian language support. To find out what languages are currently installed type tesseract --list-langs.

Compilation

Invoke make (or make debug) from the directory of the project to compile the code with debug information included, or make release to compile without debug symbols. This creates executable file go-ocr.

Technical details

The tool first runs pdfimages or ddjvu program to extract images to a temporary directory, and then invokes tesseract on each image in parallel to produce lines of plain text. Those lines are then passed through the line filters, if any, then assembled into one text string and passed through text filters, if any. regexp filters are implemented using Regexp.ReplaceAll() function, and word filters are invocations of bytes.Replace() function.

Known issues

Older versions of pdfimages tool do not have -tiff option, resulting in an error.

Platform

Linux (tested on Linux Mint 18 64bit, based on Ubuntu 16.04), will probably work on MacOS as well.

Tools:

$ go version
go version go1.6.2 linux/amd64
$ tesseract --version
tesseract 3.04.01
...
$ pdfimages --version
pdfimages version 0.41.0
...
$ ddjvu --help
DDJVU --- DjVuLibre-3.5.27
...
Lisence: BSD
You might also like...
A set of workflows for corpus building through OCR, post-correction and normalisation
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Toolbox for OCR post-correction

Ochre Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress! Overview of OCR pos

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Library used to deskew a scanned document
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Unofficial implementation of
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Python library to extract tabular data from images and scanned PDFs
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Comments
  • ocrpdf fails with error message from pdfimages

    ocrpdf fails with error message from pdfimages

    I compiled ocrpdf on linux ubuntu 14.04 but it won't process a pdf file. It issues an error message from pdfimages

    eneafse:~/Downloads$ ocrpdf declasspart4.pdf ERROR: pdfimages version 3.04 Copyright 1996-2014 Glyph & Cog, LLC Usage: pdfimages [options] -f : first page to convert -l : last page to convert -j : write JPEG images as JPEG files -opw : owner password (for encrypted files) -upw : user password (for encrypted files) -q : don't print any messages or errors -cfg : configuration file to use in place of .xpdfrc -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information

    Thanks.

    E.J. Neafsey

    opened by Ejneafsey 1
Releases(v0.4.2)
Owner
Maxim
Maxim
One Metrics Library to Rule Them All!

onemetric Installation Install onemetric from PyPI (recommended): pip install onemetric Install onemetric from the GitHub source: git clone https://gi

Piotr Skalski 49 Jan 03, 2023
OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

OCR-D 59 Sep 10, 2022
Image Detector and Convertor App created using python's Pillow, OpenCV, cvlib, numpy and streamlit packages.

Image Detector and Convertor App created using python's Pillow, OpenCV, cvlib, numpy and streamlit packages.

Siva Prakash 11 Jan 02, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

<a href=[email protected]"> 354 Jan 01, 2023
3点クリックで円を指定し、極座標変換を行うサンプルプログラム

click-warpPolar 3点クリックで円を指定し、極座標変換を行うサンプルプログラムです。 Requirements OpenCV 3.4.2 or Later Usage 実行方法は以下です。 起動後、マウスで3点をクリックし円を指定してください。 python click-warpPol

KazuhitoTakahashi 17 Dec 30, 2022
Play the Namibian game of Owela against a terrible AI. Built using Django and htmx.

Owela Club A Django project for playing the Namibian game of Owela against a dumb AI. Built following the rules described on the Mancala World wiki pa

Adam Johnson 18 Jun 01, 2022
A version of nrsc5-gui that merges the interface developed by cmnybo with the architecture developed by zefie in order to start a new baseline that is not heavily dependent upon Python processing.

NRSC5-DUI is a graphical interface for nrsc5. It makes it easy to play your favorite FM HD radio stations using an RTL-SDR dongle. It will also displa

61 Dec 22, 2022
📷 This repository is focused on having various feature implementation of OpenCV in Python.

📷 This repository is focused on having various feature implementation of OpenCV in Python. The aim is to have a minimal implementation of all OpenCV features together, under one roof.

Aditya Kumar Gupta 128 Dec 04, 2022
Reference Code for AAAI-20 paper "Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels"

Reference Code for AAAI-20 paper "Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels" Please refer to htt

Ke Sun 1 Feb 14, 2022
Distort a video using Seam Carving (video) and Vibrato effect (sound)

Distort videos Applies a Seam Carving algorithm (aka liquid rescale) on every frame of a video, and a vibrato effect on the audio to distort the video

AlexZeGamer 6 Dec 06, 2022
✌️Using this you can control your PC/Laptop volume by Hand Gestures created with Python.

Hand Gesture Volume Controller ✋ Hand recognition 👆 Finger recognition 🔊 you can decrease and increase volume Demo Code Firstly I have created a Mod

Abbas Ataei 19 Nov 17, 2022
A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV.

DcoumentScanner A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV. Directly install the .exe file to inst

Harsh Vardhan Singh 1 Oct 29, 2021
Solution for Problem 1 by team codesquad for AIDL 2020. Uses ML Kit for OCR and OpenCV for image processing

CodeSquad PS1 Solution for Problem Statement 1 for AIDL 2020 conducted by @unifynd technologies. Problem Given images of bills/invoices, the task was

Burhanuddin Udaipurwala 111 Nov 27, 2022
Open Source Differentiable Computer Vision Library for PyTorch

Kornia is a differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer

kornia 7.6k Jan 04, 2023
Framework for the Complete Gaze Tracking Pipeline

Framework for the Complete Gaze Tracking Pipeline The figure below shows a general representation of the camera-to-screen gaze tracking pipeline [1].

Pascal 20 Jan 06, 2023
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

81 Jan 01, 2023
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
In this project we will be using the live feed coming from the webcam to create a virtual mouse with complete functionalities.

Virtual Mouse Using OpenCV In this project we will be using the live feed coming from the webcam to create a virtual mouse using hand tracking. Projec

Hassan Shahzad 8 Dec 20, 2022
Ackermann Line Follower Robot Simulation.

Ackermann Line Follower Robot This is a simulation of a line follower robot that works with steering control based on Stanley: The Robot That Won the

Lucas Mazzetto 2 Apr 16, 2022
Text to QR-CODE

QR CODE GENERATO USING PYTHON Author : RAFIK BOUDALIA. Installation Use the package manager pip to install foobar. pip install pyqrcode Usage from tki

Rafik Boudalia 2 Oct 13, 2021