Table-Extraction-using-OCR

Description

This is a Python implementation for converting tables in PDF documents to Excel format using Optical Character Recognition (OCR) and OpenCV. The input PDF document can be found in input/test_input.pdf. The screenshot of the PDF document used is shown below

The table is extracted and converted to excel in output/pdf2excel.xlsx.

Requirements

The code uses python3.6 and the dependencies required are described in requirements.txt and can be installed using the following command:

                                        pip install -r requirements.txt

How does it work

The code, as shown in main.ipynb, consists of a few steps:

Row detection : The first step is to detect the rows in the table. We use morphological tranformation provided by OpenCV to extract horizontal and vertical lines and combine them to detect the table. After detection, we look for contours in the table to detect the rows and crop them.

OCR : The next step is to take the cropped rows and apply OCR to extract text and bounding box for location. We use pytesseract to perform OCR. It's important to to capture the location to account for multiple words in the same column and misssing values.

Post-processing : The final step is to match the position of bounding boxes with texts and convert the format to an Excel file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table-Extraction-using-OCR

Description

Requirements

How does it work

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
input		input
output		output
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt

ajay960singh/Table-Extraction-using-OCR

Folders and files

Latest commit

History

Repository files navigation

Table-Extraction-using-OCR

Description

Requirements

How does it work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages