Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Overview

PDFMiner

PDFMiner is a text extraction tool for PDF documents.

Build Status PyPI

Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.

Features:

  • Pure Python (3.6 or above).
  • Supports PDF-1.7. (well, almost)
  • Obtains the exact location of text as well as other layout information (fonts, etc.).
  • Performs automatic layout analysis.
  • Can convert PDF into other formats (HTML/XML).
  • Can extract an outline (TOC).
  • Can extract tagged contents.
  • Supports basic encryption (RC4 and AES).
  • Supports various font types (Type1, TrueType, Type3, and CID).
  • Supports CJK languages and vertical writing scripts.
  • Has an extensible PDF parser that can be used for other purposes.

How to Use:

  1. > pip install pdfminer
  2. > pdf2txt.py samples/simple1.pdf

Command Line Syntax:

pdf2txt.py

pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.

> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
             [-O output_dir] [-c encoding] [-s scale] [-R rotation]
             [-Y normal|loose|exact] [-p pagenos] [-m maxpages]
             [-S] [-C] [-n] [-A] [-V]
             [-M char_margin] [-L line_margin] [-W word_margin]
             [-F boxes_flow] [-d]
             input.pdf ...
  • -P password : PDF password.
  • -o output : Output file name.
  • -t text|html|xml|tag : Output type. (default: automatically inferred from the output file name.)
  • -O output_dir : Output directory for extracted images.
  • -c encoding : Output encoding. (default: utf-8)
  • -s scale : Output scale.
  • -R rotation : Rotates the page in degree.
  • -Y normal|loose|exact : Specifies the layout mode. (only for HTML output.)
  • -p pagenos : Processes certain pages only.
  • -m maxpages : Limits the number of maximum pages to process.
  • -S : Strips control characters.
  • -C : Disables resource caching.
  • -n : Disables layout analysis.
  • -A : Applies layout analysis for all texts including figures.
  • -V : Automatically detects vertical writing.
  • -M char_margin : Speficies the char margin.
  • -W word_margin : Speficies the word margin.
  • -L line_margin : Speficies the line margin.
  • -F boxes_flow : Speficies the box flow ratio.
  • -d : Turns on Debug output.

dumppdf.py

dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.

> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
             [-o output] [-r|-b|-t] [-T] [-O directory] [-d]
             input.pdf ...
  • -P password : PDF password.
  • -a : Extracts all objects.
  • -p pageid : Extracts a Page object.
  • -i objid : Extracts a certain object.
  • -o output : Output file name.
  • -r : Raw mode. Dumps the raw compressed/encoded streams.
  • -b : Binary mode. Dumps the uncompressed/decoded streams.
  • -t : Text mode. Dumps the streams in text format.
  • -T : Tagged mode. Dumps the tagged contents.
  • -O output_dir : Output directory for extracted streams.

TODO

  • Replace STRICT variable with something better.
  • Improve the debugging functions.
  • Use logging module instead of sys.stderr.
  • Proper test cases.
  • PEP-8 and PEP-257 conformance.
  • Better documentation.
  • Crypto stream filter support.

Related Projects

A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

Emilio Kartono 20 Nov 25, 2022
Svg2pdfgen - Svg To PDF gen with python

Svg2pdfgen - Svg To PDF gen with python

Robert Urbańczyk 3 May 30, 2022
Program that locks/unlocks pdf files🐍

🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 04, 2021
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

Otto 1 Nov 08, 2020
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
rst2pdf: Use a text editor. Make a PDF.

rst2pdf: Use a text editor. Make a PDF.

rst2pdf 487 Jan 06, 2023
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 08, 2023
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

David Chuan-En Lin 51 Sep 22, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022