Table-Extraction-using-OCR

Description

This is a Python implementation for converting tables in PDF documents to Excel format using Optical Character Recognition (OCR) and OpenCV. The input PDF document can be found in input/test_input.pdf. The screenshot of the PDF document used is shown below

The table is extracted and converted to excel in output/pdf2excel.xlsx.

Requirements

The code uses python3.6 and the dependencies required are described in requirements.txt and can be installed using the following command:

                                        pip install -r requirements.txt

How does it work

The code, as shown in main.ipynb, consists of a few steps:

Row detection : The first step is to detect the rows in the table. We use morphological tranformation provided by OpenCV to extract horizontal and vertical lines and combine them to detect the table. After detection, we look for contours in the table to detect the rows and crop them.

OCR : The next step is to take the cropped rows and apply OCR to extract text and bounding box for location. We use pytesseract to perform OCR. It's important to to capture the location to account for multiple words in the same column and misssing values.

Post-processing : The final step is to match the position of bounding boxes with texts and convert the format to an Excel file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Table-Extraction-using-OCR

Description

Requirements

How does it work

Files

README.md

Latest commit

History

README.md

File metadata and controls

Table-Extraction-using-OCR

Description

Requirements

How does it work