This is a Python implementation for converting tables in PDF documents to Excel format using Optical Character Recognition (OCR) and OpenCV. The input PDF document can be found in input/test_input.pdf
. The screenshot of the PDF document used is shown below
The table is extracted and converted to excel in output/pdf2excel.xlsx
.
The code uses python3.6
and the dependencies required are described in requirements.txt
and can be installed using the following command:
pip install -r requirements.txt
The code, as shown in main.ipynb
, consists of a few steps:
- Row detection : The first step is to detect the rows in the table. We use morphological tranformation provided by
OpenCV
to extract horizontal and vertical lines and combine them to detect the table. After detection, we look for contours in the table to detect the rows and crop them.
- OCR : The next step is to take the cropped rows and apply OCR to extract text and bounding box for location. We use
pytesseract
to perform OCR. It's important to to capture the location to account for multiple words in the same column and misssing values.
- Post-processing : The final step is to match the position of bounding boxes with texts and convert the format to an Excel file.