Skip to content

ajay960singh/Table-Extraction-using-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table-Extraction-using-OCR

Description

This is a Python implementation for converting tables in PDF documents to Excel format using Optical Character Recognition (OCR) and OpenCV. The input PDF document can be found in input/test_input.pdf. The screenshot of the PDF document used is shown below

The table is extracted and converted to excel in output/pdf2excel.xlsx.

Requirements

The code uses python3.6 and the dependencies required are described in requirements.txt and can be installed using the following command:

                                        pip install -r requirements.txt

How does it work

The code, as shown in main.ipynb, consists of a few steps:

  1. Row detection : The first step is to detect the rows in the table. We use morphological tranformation provided by OpenCV to extract horizontal and vertical lines and combine them to detect the table. After detection, we look for contours in the table to detect the rows and crop them.

Screenshot 2020-08-13 at 3 31 25 PM

  1. OCR : The next step is to take the cropped rows and apply OCR to extract text and bounding box for location. We use pytesseract to perform OCR. It's important to to capture the location to account for multiple words in the same column and misssing values.

Screenshot 2020-08-13 at 4 00 30 PM

  1. Post-processing : The final step is to match the position of bounding boxes with texts and convert the format to an Excel file.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published