Skip to content

Latest commit

 

History

History
42 lines (21 loc) · 2.09 KB

README.md

File metadata and controls

42 lines (21 loc) · 2.09 KB

Table-Extraction-using-OCR

Description

This is a Python implementation for converting tables in PDF documents to Excel format using Optical Character Recognition (OCR) and OpenCV. The input PDF document can be found in input/test_input.pdf. The screenshot of the PDF document used is shown below

The table is extracted and converted to excel in output/pdf2excel.xlsx.

Requirements

The code uses python3.6 and the dependencies required are described in requirements.txt and can be installed using the following command:

                                        pip install -r requirements.txt

How does it work

The code, as shown in main.ipynb, consists of a few steps:

  1. Row detection : The first step is to detect the rows in the table. We use morphological tranformation provided by OpenCV to extract horizontal and vertical lines and combine them to detect the table. After detection, we look for contours in the table to detect the rows and crop them.

Screenshot 2020-08-13 at 3 31 25 PM

  1. OCR : The next step is to take the cropped rows and apply OCR to extract text and bounding box for location. We use pytesseract to perform OCR. It's important to to capture the location to account for multiple words in the same column and misssing values.

Screenshot 2020-08-13 at 4 00 30 PM

  1. Post-processing : The final step is to match the position of bounding boxes with texts and convert the format to an Excel file.