thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

รวมพลัง Dev แปลงงบ จาก PDF สู่ Machine-readable

เพื่อการตรวจสอบงบประมาณแผ่นดินที่ง่ายมากขึ้น

Usage

PDF -> TXT

You can download the results and see the source code in each approach under ./txt-extraction folder, or, just download output files from shortcut links below:

tee4cute-gcloud-vision: Google Drive folder.

TXT -> CSV

You can download the results and see the source code in each approach under ./csv-extraction folder, or, just download output files from shortcut links below:

napatswift-coordintes: Google Drive folder.

Translations

English version

napatswift-coordintes (partially translated using Google Translation API): Google Sheet, see @asiripanich's repo for code.

Let's Code!

Download source budget PDF files from budget-pdf (เล่มขาวคาดแดง) and do some secret magics to generate output csv files with exepcted format below:

Expected Output Format (V2)

Field Name	Formal Thai Name	Data Type / Format	Description	Since Version
`ITEM_ID`	-	str / [`REF_DOC`].[RUNNING_NO]	Unique Id ของแต่ละ row, สำหรับ `REF_DOC` = ดูที่ field `REF_DOC`, RUNNING_NO = เลข running no ของแต่ละ row ในเล่มงบ (pdf) ไฟล์นั้น ๆ	v1
`REF_DOC`	-	str / [FY].[ฉบับ].[เล่ม]	เลขที่เอกสารเล่มงบ (pdf), [FY]=ปีงบประมาณของเล่มงบ, [ฉบับ]=ฉบับที่, [เล่ม]=เล่มที่ (บางเล่มจะมีวงเล็บต่อท้ายด้วย)	v1
`REF_PAGE_NO`	-	int	หน้าของเอกสารในเล่มงบที่แสดงอยู่บริเวณหัวกระดาษของ row นั้น (โปรดระวัง! เกือบทุกกรณี หน้าเอกสารจะไม่ใช่ pdf page)	v1
`MINISTRY`	กระทรวง/หน่วยงานเทียบเท่ากระทรวง	str		v1
`BUDGETARY_UNIT`	หน่วยรับงบประมาณ	str	ส่วนใหญ่เป็นกรม/หน่วยงานเทียบเท่ากรม	v1
`CROSS_FUNC?`		bool	เป็น row (งบประมาณ) ภายใต้แผนงานบูรณาการ ใช่หรือไม่?, แผนงานบูรณาการ หมายถึง แผนงานที่มีชื่อขึ้นต้นด้วยคำว่า "แผนงานบูรณาการ", See: `BUDGET_PLAN`	v1
`BUDGET_PLAN`	แผนงาน	str	ชื่อแผนงานตาม พ.ร.บ.วิธีการงบประมาณฯ	v1
`OUTPUT`	ผลผลิต	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`PROJECT`	โครงการ	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`CATEGORY_LV1`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-1` จะประกอบไปด้วย งบบุคลากร, งบดำเนินงาน, งบลงทุน, งบเงินอุดหนุน, งบรายจ่ายอื่น เท่านั้น (ยกเว้น "งบกลาง" ที่อาจมีรายการอื่น ๆ นอกเหนือจากนี้ได้)	v1
`CATEGORY_LV2`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-2`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV3`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-3`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV4`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-4`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV5`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-5`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV6`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-6`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`ITEM_DESCRIPTION`	-	str	ชื่อรายการ, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `(x)`, บาง row อาจไม่มี `ITEM_DESCRIPTION` ก็ได้	v1
`FISCAL_YEAR`	ปีงบประมาณ	str / ปี ค.ศ.	มีโอกาสที่ 1 line item อาจมีหลาย row ได้หากรายการนั้นเป็นรายการ งบผูกพัน	v1
`AMOUNT`	-	float	จำนวนเงินงบประมาณ	v1
`OBLIGED?`	-	bool	มีค่าเป็น TRUE ก็ต่อเมื่อ เป็น line item ที่มีข้อมูลหลาย row `FISCAL_YEAR`	v1
`DEBUG_LOG`	-	str	Log message สำหรับแจ้ง error ที่เกิดขึ้นระหว่างการ extract row นั้น ๆ	v2

Note: Please see output example in output_example_vx.xlsx and output_example_vx.csv at repository root.

Release Notes

29 Jul 2021

Send messages to DEBUG_LOG to cleary inform user about the source of error where it was orignated from: Syntactic Error or OCR Error.
- Invalid CATEGORY_LV1 values will be reported in DEBUG_LOG as follows: "CATEGORY_LV1 is not as described". issue#15-comment
- Invalid AMOUNT values will be reported in DEBUG_LOG as follows: "AMOUNT FORMAT IS WRONG".

25 Jul 2021

Fix some of Syntactic Errors reported by issue#15.
Fix Compiler Error for wrong AMOUNT output on obliged item written in "XXXX - YYYY ZZZZ บาท" format.
- For example, if the obliged entry is written as "2562 - 2564 30,000,000 บาท", the output will be:
```
  2562    10,000,000
  2563    10,000,000
  2564    10,000,000
```
  instead of
```
  2562    30,000,000
  2563    30,000,000
  2564    30,000,000
```
Sending OCR Error reported by issue#11 to DEBUG_LOG to make it clear that the error was originated from the OCR Tool and needed to be cleaned by hand.

21 Jul 2021

First version release
You can download the first version in CSV format here.

Powered by This Dataset

Budget Overview by korlan rayong

https://public.tableau.com/app/profile/korlan.rayong2953/viz/OverviewBudget65/Dashboard1
2022 Thai Budget Structure by Thanawit Prasongpongchai

Visualization: https://taepras.github.io/thaibudget65 Repository: https://github.com/taepras/thaibudget65

Talk

"ก้าวGeek Community", Line Group: http://line.me/ti/g/STUxfMX87U

Let's create a tool to convert Thailand budget from PDF to CSV.

Related tags

Overview

thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

Usage

PDF -> TXT

TXT -> CSV

Translations

English version

Let's Code!

Expected Output Format (V2)

Release Notes

29 Jul 2021

25 Jul 2021

21 Jul 2021

Powered by This Dataset

Talk

Owner

Kao.Geek

PyTorch Implementation of PIXOR: Real-time 3D Object Detection from Point Clouds

graph-theoretic framework for robust pairwise data association

StarGAN - Official PyTorch Implementation (CVPR 2018)

a Lightweight library for sequential learning agents, including reinforcement learning

BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库，帮助大家挑选或训练出更适合自己科研或者业务的模型结构

A free, multiplatform SDK for real-time facial motion capture using blendshapes, and rigid head pose in 3D space from any RGB camera, photo, or video.

Deep Q-network learning to play flappybird.

This repository is for Competition for ML_data class

Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

Bayesian Meta-Learning Through Variational Gaussian Processes

TimeSHAP explains Recurrent Neural Network predictions.

Official code for paper Exemplar Based 3D Portrait Stylization.

3rd place solution for the Weather4cast 2021 Stage 1 Challenge

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach (CVPR2021)

PyTorch reimplementation of REALM and ORQA

Implementation of SwinTransformerV2 in TensorFlow.

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

PyJokes - Joking around with Python library pyjokes

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2