Camelot is a Python library that can help you extract tables from PDFs!

Overview

Camelot: PDF Table Extraction for Humans

tests Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!


Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]

>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

Camelot also comes packaged with a command-line interface!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

  • Configurability: Camelot gives you control over the table extraction process with tweakable settings.
  • Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
  • Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

$ pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[base]"

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Owner
Camelot and Excalibur: PDF Table Extraction for Humans
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro

Zen Reportz 27 Jul 13, 2022
rst2pdf: Use a text editor. Make a PDF.

rst2pdf: Use a text editor. Make a PDF.

rst2pdf 487 Jan 06, 2023
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf

8k Jan 08, 2023
Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

Emilio Kartono 20 Nov 25, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
Convert given source code into .pdf with syntax highlighting and more features

Code2pdf 📠 Convert given source code into .pdf with syntax highlighting and more features Build Status Version Downloads Python Demo Installation Bui

Tushar Gautam 343 Jan 05, 2023
Mipdfcompressor - 💕A simple pdf size compressing telegram robot

Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A

Madhavan Mi 1 Feb 14, 2022
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022