An application of high resolution GANs to dewarp images of perturbed documents

Last update: Dec 25, 2022

Overview

Docuwarp

This project is focused on dewarping document images through the usage of pix2pixHD, a GAN that is useful for general image to image translation. The objective is to take images of documents that are warped, folded, crumpled, etc. and convert the image to a "dewarped" state by using pix2pixHD to train and perform inference. All of the model code is borrowed directly from the pix2pixHD official repository.

Some of the intuition behind doing this is inspired by these two papers:

May 8, 2020 : Important Update

This project does not contain a pre-trained model. I currently do not have the resources to train a model on an open source dataset, nor do I have the bandwidth at the moment to do so. If anyone would like to contribute a pretrained model and share their model checkpoints, feel free to do so, I will likely accept any PR trying to do this. Thanks!

Prerequisites

This project requires Python and the following Python libraries installed:

Linux or OSX
scikit-learn
NVIDIA GPU (11G memory or larger) + CUDA cuDNN
Pytorch
Pillow
OpenCV

Getting Started

Installation

Install PyTorch and dependencies from http://pytorch.org
Install python libraries dominate.

pip install dominate

Clone this repo:

git clone https://github.com/thomasjhuang/deep-learning-for-document-dewarping
cd deep-learning-for-document-dewarping

Training

Train the kaggle model with 256x256 crops:

python train.py --name kaggle --label_nc 0 --no_instance --no_flip --netG local --ngf 32 --fineSize 256

To view training results, please checkout intermediate results in ./checkpoints/kaggle/web/index.html. If you have tensorflow installed, you can see tensorboard logs in ./checkpoints/kaggle/logs by adding --tf_log to the training scripts.

Training with your own dataset

If you want to train with your own dataset, please generate label maps which are one-channel whose pixel values correspond to the object labels (i.e. 0,1,...,N-1, where N is the number of labels). This is because we need to generate one-hot vectors from the label maps. Please also specity --label_nc N during both training and testing.
If your input is not a label map, please just specify --label_nc 0 which will directly use the RGB colors as input. The folders should then be named train_A, train_B instead of train_label, train_img, where the goal is to translate images from A to B.
If you don't have instance maps or don't want to use them, please specify --no_instance.
The default setting for preprocessing is scale_width, which will scale the width of all training images to opt.loadSize (1024) while keeping the aspect ratio. If you want a different setting, please change it by using the --resize_or_crop option. For example, scale_width_and_crop first resizes the image to have width opt.loadSize and then does random cropping of size (opt.fineSize, opt.fineSize). crop skips the resizing step and only performs random cropping. If you don't want any preprocessing, please specify none, which will do nothing other than making sure the image is divisible by 32.

Testing

Test the model:

python test.py --name kaggle --label_nc 0 --netG local --ngf 32 --resize_or_crop crop --no_instance --no_flip --fineSize 256

The test results will be saved to a directory here: ./results/kaggle/test_latest/.

Dataset

I use the kaggle denoising dirty documents dataset. To train a model on the full dataset, please download it from the official website. After downloading, please put it under the datasets folder with warped images under the directory name train_A and unwarped images under the directory train_B. Your test images are warped images, and should be under the name test_A. Below is an example dataset directory structure.
```
    .
    ├── ...
    ├── datasets                  
    │   ├── train_A               # warped images
    │   ├── train_B               # unwarped, "ground truth" images
    │   └── test_A                # warped images used for testing
    └── ...
```

Multi-GPU training

Train a model using multiple GPUs (bash ./scripts/train_kaggle_256_multigpu.sh):

#!./scripts/train_kaggle_256_multigpu.sh
python train.py --name kaggle_256_multigpu --label_nc 0 --netG local --ngf 32 --resize_or_crop crop --no_instance --no_flip --fineSize 256 --batchSize 32 --gpu_ids 0,1,2,3,4,5,6,7

Training with Automatic Mixed Precision (AMP) for faster speed

To train with mixed precision support, please first install apex from: https://github.com/NVIDIA/apex
You can then train the model by adding --fp16. For example,

#!./scripts/train_512p_fp16.sh
python -m torch.distributed.launch train.py --name label2city_512p --fp16

In my test case, it trains about 80% faster with AMP on a Volta machine.

More Training/Test Details

Flags: see options/train_options.py and options/base_options.py for all the training flags; see options/test_options.py and options/base_options.py for all the test flags.
Instance map: we take in both label maps and instance maps as input. If you don't want to use instance maps, please specify the flag --no_instance.

An application of high resolution GANs to dewarp images of perturbed documents

Related tags

Overview

Docuwarp

May 8, 2020 : Important Update

Prerequisites

Getting Started

Installation

Training

Training with your own dataset

Testing

Dataset

Multi-GPU training

Training with Automatic Mixed Precision (AMP) for faster speed

More Training/Test Details

Owner

Thomas Huang

https://arxiv.org/abs/1904.01941

Face_mosaic - Mosaic blur processing is applied to multiple faces appearing in the video

Converts an image into funny, smaller amongus characters

A post-processing tool for scanned sheets of paper.

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

A simple document layout analysis using Python-OpenCV

⛓ marc is a small, but flexible Markov chain generator

fishington.io bot with OpenCV and NumPy

An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

The code of "Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes"

Developed an AI-based system to control the mouse cursor using Python and OpenCV with the real-time camera.

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Fully-automated scripts for collecting AI-related papers

Program created with opencv that allows you to automatically count your repetitions on several fitness exercises.

A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

Fusion 360 Add-in that creates a pair of toothed curves that can be used to split a body and create two pieces that slide and lock together.

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

ERQA - Edge Restoration Quality Assessment

This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows