-
Notifications
You must be signed in to change notification settings - Fork 561
Extracting Tables from Documents (GUI)
This is a script based on wxPython and PyMuPDF to browse a document and extract tables. It uses the method ParseTab
contained in the same (examples) directory.
The script will at first present a file selection dialog to pick a document.
If the document is encrypted, a decryption password will be asked for. Then the document's first page will be displayed in another dialog. A number of controls at the dialog's top and left sides exist to do several things as follows. The availability of these controls depend on the situation. E.g. you can only add a column with New Col
after a rectangle has been painted, etc.
- You can browse forward and backward in the document using the buttons or the mouse wheel.
- You can jump to a specific page.
- You can paint a rectangle on a displayed page using the
New Rect
button. You can fine tune it by the spin controls. Pressing theNew Rect
button again will destroy any rectangles and columns. The same is true if you leave the page. - After a rectangle has been painted, you can paint one or more columns into it via button
New Col
. Columns are shown as vertical lines. You can modify a column by selecting it in the choice box and using the spin control. The column "under change" in this way, will change its colour from red to blue. A column can be deleted by entering "0" or a value outside the rectangle's borders in the spin control. - You can change a rectangle via the spin controls also after columns have been painted. This will not affect them in any way except when a column's coordinate leaves the rectangle area (then it will be deleted).
- You can also move a rectangle around with the mouse (left key held down). In this case, any columns will go with it and will not get deleted.
- Any time after a rectangle has been painted, you can parse the text that it surrounds by pressing button
Get Table
. The current script just prints the table to STDOUT if you do this - see the following example screens. You can repeatedly press this button to e.g. check the effect of new or deleted columns.
Displaying page 253 of Adobe's PDF manual:
After painting a rectangle around TABLE 4.16
and pressing Get Table
, the table's content is displayed using automatic column detection:
After painting additional columns into the rectangle and again pressing Get Table
, a slightly different analysis of the table is displayed, based on the column information supplied:
-
ParseTab
, and therefore alsowxTableExtract
are not OCR programs, any images will be ignored. They are text extraction programs. -
If a logical table is physically spread across more than one page of the document, it is up to you to bind them together by any logic invoked by
Get Table
.
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance