Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: chezou/tabula-py
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v2.8.2rc
Choose a base ref
...
head repository: chezou/tabula-py
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Sep 27, 2023

  1. Update Custom.md

    Encourage to use GitHub discussions.
    chezou authored Sep 27, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    ac0b92f View commit details

Commits on Nov 11, 2023

  1. Update and rename Bug_report.md to Bug_report.yaml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    cae9d74 View commit details
  2. Update Bug_report.yaml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    2c4040b View commit details
  3. Update Bug_report.yaml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    df5c33c View commit details
  4. Update Bug_report.yaml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    ca4574c View commit details
  5. Update and rename Feature_request.md to Feature_request.yaml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    02859d6 View commit details
  6. Delete .github/workflows/autocloser.yml

    chezou authored Nov 11, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    0e73e26 View commit details

Commits on Nov 12, 2023

  1. Update Bug_report.yaml

    chezou authored Nov 12, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    f327c84 View commit details
  2. Delete .github/ISSUE_TEMPLATE.md

    chezou authored Nov 12, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    9b6e952 View commit details

Commits on Nov 19, 2023

  1. Make jpype optional

    chezou committed Nov 19, 2023
    Copy the full SHA
    a2aaf76 View commit details
  2. Merge pull request #369 from chezou/opt-jpype

    Make jpype optional
    chezou authored Nov 19, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    76db276 View commit details

Commits on Nov 20, 2023

  1. Set encoding on SubprocessTabula initialization

    chezou committed Nov 20, 2023
    Copy the full SHA
    5c23bb2 View commit details
  2. Set encoding on SubprocessTabula initialization

    chezou committed Nov 20, 2023
    Copy the full SHA
    782793d View commit details
  3. Merge pull request #371 from chezou/fix-encoding

    Set encoding on SubprocessTabula initialization
    chezou authored Nov 20, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    635e51a View commit details
  4. Support Python 3.12

    chezou committed Nov 20, 2023
    Copy the full SHA
    9de531c View commit details
  5. Merge pull request #370 from chezou/py312

    Support Python 3.12
    chezou authored Nov 20, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    cefa1f4 View commit details

Commits on Jan 16, 2024

  1. Update README.md

    chezou authored Jan 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    49a88ba View commit details
  2. Use Trusted Publisher

    chezou committed Jan 16, 2024
    Copy the full SHA
    3cf4bcf View commit details

Commits on Mar 10, 2024

  1. Update encoding everytime when SubprocessTabule is initialized

    chezou committed Mar 10, 2024
    Copy the full SHA
    72d9234 View commit details
  2. Merge pull request #378 from chezou/update-encoding

    Update encoding everytime when SubprocessTabule is initialized
    chezou authored Mar 10, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    235e25c View commit details

Commits on Mar 16, 2024

  1. Copy the full SHA
    e8036b7 View commit details
  2. Add Python 3.12 test with jpype

    chezou committed Mar 16, 2024
    Copy the full SHA
    301da37 View commit details
  3. Merge pull request #380 from chezou/to_numeric

    Suppress pandas FutureWarning of to_numeric with errors='ignore'
    chezou authored Mar 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    3861295 View commit details

Commits on May 14, 2024

  1. Build separately

    chezou committed May 14, 2024
    Copy the full SHA
    cdacc29 View commit details
  2. Merge pull request #386 from chezou/fix-publish

    Build separately on Publish action
    chezou authored May 14, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    1e78f69 View commit details

Commits on May 15, 2024

  1. Improve error message on jpype importing

    It was confusing that the warning message shows "Error". It should use other wording.
    chezou authored May 15, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    2dc676c View commit details

Commits on May 16, 2024

  1. Update Bug_report.yaml

    chezou authored May 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    4ae6c65 View commit details
  2. Update Bug_report.yaml

    chezou authored May 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    8e18da5 View commit details
  3. Remove patreon

    chezou authored May 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    11b7f44 View commit details
  4. Add Buy me a coffee

    chezou authored May 16, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    7f2cad1 View commit details

Commits on May 20, 2024

  1. Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    362ac06 View commit details
  2. Copy the full SHA
    df3ba02 View commit details
  3. Merge pull request #390 from rbubley/patch-1

    chezou authored May 20, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    79792ad View commit details

Commits on May 21, 2024

  1. Fix type hint. follow-up #390

    chezou committed May 21, 2024
    Copy the full SHA
    04911e8 View commit details
  2. Merge pull request #392 from chezou/follow-up-sequence

    Fix type hint. follow-up #390
    chezou authored May 21, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    61a00ef View commit details
  3. Enable GH Actions on PR

    chezou committed May 21, 2024
    Copy the full SHA
    f3f9550 View commit details

Commits on Aug 25, 2024

  1. Add download stats badge

    chezou authored Aug 25, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    7ecbd0e View commit details

Commits on Sep 3, 2024

  1. Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows

    Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7.
    - [Release notes](https://github.com/actions/download-artifact/releases)
    - [Commits](actions/download-artifact@v3...v4.1.7)
    
    ---
    updated-dependencies:
    - dependency-name: actions/download-artifact
      dependency-type: direct:production
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    dependabot[bot] authored Sep 3, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    e4aa756 View commit details
  2. Merge pull request #400 from chezou/dependabot/github_actions/dot-git…

    …hub/workflows/actions/download-artifact-4.1.7
    
    Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows
    chezou authored Sep 3, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    cb6131c View commit details

Commits on Oct 17, 2024

  1. Support Python 3.13; Drop 3.8

    chezou committed Oct 17, 2024
    Copy the full SHA
    4c996e9 View commit details
  2. Remove jpype from Python 3.13 test

    chezou committed Oct 17, 2024
    Copy the full SHA
    1ca64e6 View commit details
  3. Use ruff

    chezou committed Oct 17, 2024
    Copy the full SHA
    f57323e View commit details
  4. Bump setup-java

    chezou committed Oct 17, 2024
    Copy the full SHA
    ea031a1 View commit details
  5. Merge pull request #402 from chezou/313

    Support Python 3.13; Drop 3.8
    chezou authored Oct 17, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    78a18c7 View commit details
  6. Fix release workflow

    chezou committed Oct 17, 2024
    Copy the full SHA
    e6220ff View commit details

Commits on Dec 4, 2024

  1. Add test for Python 3.13 with jpype

    chezou authored Dec 4, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    2d239b1 View commit details
  2. Merge pull request #404 from chezou/py313-jpype

    Add test for Python 3.13 with jpype
    chezou authored Dec 4, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    93b07ba View commit details

Commits on Dec 5, 2024

  1. Update documents about Python version

    chezou committed Dec 5, 2024
    Copy the full SHA
    d7a233b View commit details
2 changes: 1 addition & 1 deletion .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
github: chezou
patreon: chezou
buy_me_a_coffee: chezou
63 changes: 0 additions & 63 deletions .github/ISSUE_TEMPLATE.md

This file was deleted.

57 changes: 0 additions & 57 deletions .github/ISSUE_TEMPLATE/Bug_report.md

This file was deleted.

96 changes: 96 additions & 0 deletions .github/ISSUE_TEMPLATE/Bug_report.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
name: Bug report
description: File a bug report
title: "[BUG] <title>"
labels: ["triage"]
body:
- type: input
id: summary
attributes:
label: Summary
description: Write a summary of your issue
placeholder: ex. Unable to import `tabula`
validations:
required: true
- type: checkboxes
id: faq
attributes:
label: Did you read the FAQ?
description: Please read the [FAQ](https://tabula-py.readthedocs.io/en/latest/faq.html)
options:
- label: I have read the FAQ
required: true
- type: checkboxes
id: issues
attributes:
label: Did you search GitHub issues?
description: Please search the [discussions](https://github.com/chezou/tabula-py/issues)
options:
- label: I have searched the issues
required: true
- type: checkboxes
id: discussions
attributes:
label: Did you search GitHub Discussions?
description: Please search the [discussions](https://github.com/chezou/tabula-py/discussions)
options:
- label: I have searched the discussions
required: true
- type: input
id: pdf_url
attributes:
label: "(Optional) PDF URL"
description: Provide your PDF URL. It's optional, but really helpful.
validations:
required: false
- type: textarea
id: environment
attributes:
label: About your environment
description: |
Paste the output of `import tabula; tabula.environment_info()` on Python REPL.
Or, paste the results of `python --version` and `java -version`, and write down your OS version
placeholder: |
put here if you executed `tabula.environment_info()` or write
- Python version: result of `python --version`
- Java version: result of `java -version`
- OS version: Ubuntu 22.04
render: markdown
validations:
required: true
- type: textarea
id: reproducible_info
attributes:
label: What did you do when you faced the problem?
description: Provide your information to reproduce the issue
validations:
required: true
- type: textarea
id: code
attributes:
label: Code
description: Paste your core code which minimum reproducible for the issue
placeholder: Paste your output in text
validations:
required: true
- type: textarea
id: expected_behavior
attributes:
label: Expected behavior
description: Write your expected results/outputs
validations:
required: true
- type: textarea
id: actuabl_behavior
attributes:
label: Actual behavior
description: Put the actual results/outputs
placeholder: Paste your output in text
validations:
required: true
- type: textarea
id: related_issues
attributes:
label: Related issues
description: "If there are any related issue, please put them"
validations:
required: false
4 changes: 1 addition & 3 deletions .github/ISSUE_TEMPLATE/Custom.md
Original file line number Diff line number Diff line change
@@ -9,6 +9,4 @@ THE ISSUE TRACKER IS NOT FOR QUESTIONS.

DO NOT CREATE A NEW ISSUE TO ASK A QUESTION.

IF YOU ARE HAVING PROBLEMS WITH YOUR TABULA-PY CODE, DO NOT ASK A QUESTION HERE.

Many tabula-py questions have been asked and answered on StackOverflow; see https://stackoverflow.com/search?q=tabula-py . You can ask questions there or on other websites.
You can use GitHub discussions for question instead. [Search the GH discussion](https://github.com/chezou/tabula-py/discussions?discussions_q=is%3Aopen), or [create a new discussion](https://github.com/chezou/tabula-py/discussions/new/choose).
17 changes: 0 additions & 17 deletions .github/ISSUE_TEMPLATE/Feature_request.md

This file was deleted.

33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/Feature_request.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Feature request
description: Suggest an idea for this project
title: "[Feature request]: "
labels: ["feature request", "triage"]
body:
- type: textarea
id: problem
attributes:
label: Is your feature request related to a problem? Please describe.
description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
validations:
required: true
- type: textarea
id: solution
attributes:
label: Describe the solution you'd like
description: A clear and concise description of what you want to happen.
validations:
required: true
- type: textarea
id: alternative
attributes:
label: Describe alternatives you've considered
description: A clear and concise description of any alternative solutions or features you've considered.
validations:
required: true
- type: textarea
id: additional_context
attributes:
label: Additional context
description: Add any other context or screenshots about the feature request here.
validations:
required: false
12 changes: 0 additions & 12 deletions .github/workflows/autocloser.yml

This file was deleted.

50 changes: 33 additions & 17 deletions .github/workflows/pythonpublish.yml
Original file line number Diff line number Diff line change
@@ -5,22 +5,38 @@ on:
types: [created]

jobs:
deploy:
build:
name: Build distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python -m build
twine upload dist/*
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.x'
- name: Install pypa/build
run: python3 -m pip install build --user
- name: Build a binary wheel and a source tarball
run: python -m build
- name: Store the distribution packages
uses: actions/upload-artifact@v4
with:
name: release-dists
path: dist/
publish:
name: >-
Publish to PyPI
needs: build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/tabula-py
permissions:
id-token: write
steps:
- name: Download all the dists
uses: actions/download-artifact@v4
with:
name: release-dists
path: dist/
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
11 changes: 6 additions & 5 deletions .github/workflows/pythontest.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Python testing

on: [push]
on: [push, pull_request]

jobs:
build:
@@ -12,20 +12,21 @@ jobs:
os: [ubuntu-latest, windows-latest]

steps:
- uses: actions/checkout@v3
- uses: actions/setup-java@v3
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
java-version: '17'
java-package: jdk
distribution: adopt
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: |
3.8
3.9
3.10
3.11
3.12
3.13
- name: Install dependencies and test
run: |
python -m pip install --upgrade pip
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -3,7 +3,9 @@
[![Build Status](https://github.com/chezou/tabula-py/actions/workflows/pythontest.yml/badge.svg)](https://github.com/chezou/tabula-py/actions/workflows/pythontest.yml)
[![PyPI version](https://badge.fury.io/py/tabula-py.svg)](https://badge.fury.io/py/tabula-py)
[![Documentation Status](https://readthedocs.org/projects/tabula-py/badge/?version=latest)](https://tabula-py.readthedocs.io/en/latest/?badge=latest)
[![Patreon](https://img.shields.io/badge/patreon-donate-orange.svg)](https://www.patreon.com/chezou)
![PyPI - Downloads](https://img.shields.io/pypi/dw/tabula-py)
[![](https://img.shields.io/badge/-Sponsor-fafbfc?logo=GitHub%20Sponsors
)](https://github.com/sponsors/chezou)

`tabula-py` is a simple Python wrapper of [tabula-java](https://github.com/tabulapdf/tabula-java), which can read tables in a PDF.
You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.
@@ -15,7 +17,7 @@ You can see [the example notebook](https://nbviewer.jupyter.org/github/chezou/ta
## Requirements

- Java 8+
- Python 3.8+
- Python 3.9+

### OS

@@ -35,6 +37,12 @@ Ensure you have a Java runtime and set the PATH for it.
pip install tabula-py
```

If you want to leverage faster execution with jpype, install with `jpype` extra.

```sh
pip install tabula-py[jpype]
```

### Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.  
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import sphinx_rtd_theme
import sphinx_rtd_theme # noqa: F401

GH_ORGANIZATION = "chezou"
GH_PROJECT = "tabula-py"
8 changes: 7 additions & 1 deletion docs/getting_started.rst
Original file line number Diff line number Diff line change
@@ -11,7 +11,7 @@ Requirements

* Python

* 3.8+
* 3.9+


Installation
@@ -26,6 +26,12 @@ You can install tabula-py from PyPI with ``pip`` command.
pip install tabula-py
If you want to leverage faster execution with jpype, install with `jpype` extra.

.. code-block:: bash
pip install tabula-py[jpype]
.. Note::
conda recipe on conda-forge is not maintained by us.
We recommend installing via ``pip`` to use the latest version of tabula-py.
9,263 changes: 4,634 additions & 4,629 deletions examples/tabula_example.ipynb

Large diffs are not rendered by default.

34 changes: 27 additions & 7 deletions noxfile.py
Original file line number Diff line number Diff line change
@@ -4,24 +4,44 @@
@nox.session
def lint(session):
lint_tools = [
"black",
"isort",
"flake8",
"ruff",
"mypy",
"types-setuptools",
"Flake8-pyproject",
]
targets = ["tabula", "tests", "noxfile.py"]
session.install(*lint_tools)
session.run("flake8", *targets)
session.run("black", "--diff", "--check", *targets)
session.run("isort", "--check-only", *targets)
session.run("ruff", "format", "--check", *targets)
session.run("ruff", "check", *targets)
session.run("mypy", *targets)


@nox.session
def tests(session):
@nox.parametrize(
"python,jpype",
[
("3.9", True),
("3.10", True),
("3.11", True),
("3.12", True),
("3.13", False),
("3.13", True),
],
)
def tests(session, jpype):
if jpype:
tests_with_jpype(session)
else:
tests_without_jpype(session)


def tests_without_jpype(session):
session.install(".[test]")
session.run("pytest", "-v", "tests/test_read_pdf_table.py")


def tests_with_jpype(session):
session.install(".[jpype,test]")
session.run("pytest", "-v", "tests/test_read_pdf_table.py")
session.run("pytest", "-v", "tests/test_read_pdf_jar_path.py")
session.run("pytest", "-v", "tests/test_read_pdf_silent.py")
26 changes: 12 additions & 14 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -17,31 +17,30 @@ classifiers = [
"Development Status :: 5 - Production/Stable",
"Topic :: Text Processing :: General",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.8",
]
keywords = [
"data frame",
"pdf",
"table",
]
requires-python = ">=3.8"
requires-python = ">=3.9"
dependencies = [
"pandas >= 0.25.3",
"numpy",
"numpy > 1.24.4",
"distro",
"jpype1",
]
dynamic = ["version"]

[project.optional-dependencies]
jpype = ["jpype1"]
dev = [
"pytest",
"flake8",
"black",
"isort",
"ruff",
"mypy",
"Flake8-pyproject",
]
@@ -60,21 +59,20 @@ doc = [
[tool.setuptools_scm]


[tool.isort]
profile = "black"

[tool.flake8]
ignore = ["E203", "W503"]
max-line-length = 88
[tool.ruff]
line-length = 88
exclude = [
".git",
"__pycache__",
"build",
"dist",
".venv",
"tabula/__init__.py",
]

[tool.ruff.lint]
select = ["E", "W"]
ignore = ["E203"]

[tool.mypy]
ignore_missing_imports = true

4 changes: 2 additions & 2 deletions tabula/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from importlib.metadata import PackageNotFoundError, version

from .io import convert_into, convert_into_by_batch, read_pdf, read_pdf_with_template
from .util import environment_info
from .io import convert_into, convert_into_by_batch, read_pdf, read_pdf_with_template # noqa: F401
from .util import environment_info # noqa: F401

try:
__version__ = version("tabula-py")
52 changes: 33 additions & 19 deletions tabula/backend.py
Original file line number Diff line number Diff line change
@@ -3,9 +3,6 @@
from logging import getLogger
from typing import List, Optional

import jpype
import jpype.imports

from .errors import JavaNotFoundError
from .util import TabulaOption

@@ -27,35 +24,38 @@ def jar_path() -> str:

class TabulaVm:
def __init__(self, java_options: List[str], silent: Optional[bool]) -> None:
if not jpype.isJVMStarted():
jpype.addClassPath(jar_path())

# Workaround to enforce the silent option. See:
# https://github.com/tabulapdf/tabula-java/issues/231#issuecomment-397281157
if silent:
java_options.extend(
(
"-Dorg.slf4j.simpleLogger.defaultLogLevel=off",
"-Dorg.apache.commons.logging.Log"
"=org.apache.commons.logging.impl.NoOpLog",
try:
import jpype
import jpype.imports

if not jpype.isJVMStarted():
jpype.addClassPath(jar_path())

# Workaround to enforce the silent option. See:
# https://github.com/tabulapdf/tabula-java/issues/231#issuecomment-397281157
if silent:
java_options.extend(
(
"-Dorg.slf4j.simpleLogger.defaultLogLevel=off",
"-Dorg.apache.commons.logging.Log"
"=org.apache.commons.logging.impl.NoOpLog",
)
)
)

jpype.startJVM(*java_options, convertStrings=False)
jpype.startJVM(*java_options, convertStrings=False)

try:
import java.lang as lang
import technology.tabula as tabula
from org.apache.commons.cli import DefaultParser

self.tabula = tabula
self.parser = DefaultParser()
self.lang = lang

except (ModuleNotFoundError, ImportError) as e:
logger.warning(
"Error importing jpype dependencies. Fallback to subprocess."
"Failed to import jpype dependencies. Fallback to subprocess."
)
logger.warning(jpype.java.lang.System.getProperty("java.class.path"))
logger.warning(e)
self.tabula = None
self.parse = None
@@ -92,6 +92,20 @@ def __init__(
self.java_options = java_options
self.encoding = encoding

def update_encoding(
self, encoding: str, java_options: List[str], silent: Optional[bool]
) -> None:
self.encoding = encoding
self.java_options = java_options
if silent:
self.java_options.extend(
(
"-Dorg.slf4j.simpleLogger.defaultLogLevel=off",
"-Dorg.apache.commons.logging.Log"
"=org.apache.commons.logging.impl.NoOpLog",
)
)

def call_tabula_java(
self, options: TabulaOption, path: Optional[str] = None
) -> str:
76 changes: 40 additions & 36 deletions tabula/io.py
Original file line number Diff line number Diff line change
@@ -26,7 +26,7 @@
from copy import deepcopy
from dataclasses import asdict
from logging import getLogger
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union

import numpy as np
import pandas as pd
@@ -44,8 +44,8 @@


def _run(
java_options: List[str],
options: TabulaOption,
java_options: Optional[List[str]] = None,
path: Optional[str] = None,
encoding: str = "utf-8",
force_subprocess: bool = False,
@@ -62,6 +62,8 @@ def _run(
"-Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.NoOpLog",
}

java_options = _build_java_options(java_options, encoding)

global _tabula_vm
if force_subprocess:
_tabula_vm = SubprocessTabula(
@@ -74,6 +76,10 @@ def _run(
_tabula_vm = SubprocessTabula(
java_options=java_options, silent=options.silent, encoding=encoding
)
elif isinstance(_tabula_vm, SubprocessTabula):
_tabula_vm.update_encoding(
encoding=encoding, java_options=java_options, silent=options.silent
)
elif set(java_options) - IGNORED_JAVA_OPTIONS:
logger.warning("java_options is ignored until rebooting the Python process.")

@@ -97,7 +103,7 @@ def read_pdf(
stream: bool = False,
password: Optional[str] = None,
silent: Optional[bool] = None,
columns: Optional[Iterable[float]] = None,
columns: Optional[Sequence[float]] = None,
relative_columns: bool = False,
format: Optional[str] = None,
batch: Optional[str] = None,
@@ -187,8 +193,9 @@ def read_pdf(
Password to decrypt document. Default: empty
silent (bool, optional):
Suppress all stderr output.
columns (iterable, optional):
X coordinates of column boundaries.
columns (Sequence, optional):
X coordinates of column boundaries. Must be sorted and of a datatype that
preserves order, e.g. tuple or list
Example:
``[10.1, 20.2, 30.3]``
@@ -381,20 +388,6 @@ def read_pdf(
multiple_tables=multiple_tables,
)

if java_options is None:
java_options = []
elif isinstance(java_options, str):
java_options = shlex.split(java_options)

# to prevent tabula-py from stealing focus on every call on mac
if platform.system() == "Darwin":
if not any("java.awt.headless" in opt for opt in java_options):
java_options += ["-Djava.awt.headless=true"]

if encoding == "utf-8":
if not any("file.encoding" in opt for opt in java_options):
java_options += ["-Dfile.encoding=UTF8"]

path, temporary = localize_file(input_path, user_agent, use_raw_url=use_raw_url)

if not os.path.exists(path):
@@ -405,8 +398,8 @@ def read_pdf(

try:
output = _run(
java_options,
tabula_options,
java_options,
path,
encoding=encoding,
force_subprocess=force_subprocess,
@@ -462,7 +455,7 @@ def read_pdf_with_template(
stream: bool = False,
password: Optional[str] = None,
silent: Optional[bool] = None,
columns: Optional[List[float]] = None,
columns: Optional[Sequence[float]] = None,
relative_columns: bool = False,
format: Optional[str] = None,
batch: Optional[str] = None,
@@ -535,8 +528,9 @@ def read_pdf_with_template(
Password to decrypt document. Default: empty
silent (bool, optional):
Suppress all stderr output.
columns (iterable, optional):
X coordinates of column boundaries.
columns (Sequence, optional):
X coordinates of column boundaries. Must be sorted and of a datatype that
preserves order, e.g. tuple or list
Example:
``[10.1, 20.2, 30.3]``
@@ -708,7 +702,7 @@ def convert_into(
stream: bool = False,
password: Optional[str] = None,
silent: Optional[bool] = None,
columns: Optional[Iterable[float]] = None,
columns: Optional[Sequence[float]] = None,
relative_columns: bool = False,
format: Optional[str] = None,
batch: Optional[str] = None,
@@ -774,8 +768,9 @@ def convert_into(
Password to decrypt document. Default: empty
silent (bool, optional):
Suppress all stderr output.
columns (iterable, optional):
X coordinates of column boundaries.
columns (Sequence, optional):
X coordinates of column boundaries. Must be sorted and of a datatype that
preserves order, e.g. tuple or list
Example:
``[10.1, 20.2, 30.3]``
@@ -827,7 +822,6 @@ def convert_into(
output_path=output_path,
options=options,
)
java_options = _build_java_options(java_options)

path, temporary = localize_file(input_path)

@@ -838,7 +832,7 @@ def convert_into(
raise ValueError(f"{path} is empty. Check the file, or download it manually.")

try:
_run(java_options, tabula_options, path, force_subprocess=force_subprocess)
_run(tabula_options, java_options, path, force_subprocess=force_subprocess)
finally:
if temporary:
os.unlink(path)
@@ -856,7 +850,7 @@ def convert_into_by_batch(
stream: bool = False,
password: Optional[str] = None,
silent: Optional[bool] = None,
columns: Optional[Iterable[float]] = None,
columns: Optional[Sequence[float]] = None,
relative_columns: bool = False,
format: Optional[str] = None,
output_path: Optional[str] = None,
@@ -916,8 +910,9 @@ def convert_into_by_batch(
Password to decrypt document. Default: empty
silent (bool, optional):
Suppress all stderr output.
columns (iterable, optional):
X coordinates of column boundaries.
columns (Sequence, optional):
X coordinates of column boundaries. Must be sorted and of a datatype that
preserves order, e.g. tuple or list
Example:
``[10.1, 20.2, 30.3]``
@@ -948,8 +943,6 @@ def convert_into_by_batch(

format = _extract_format_for_conversion(output_format)

java_options = _build_java_options(java_options)

tabula_options = TabulaOption(
pages=pages,
guess=guess,
@@ -967,10 +960,12 @@ def convert_into_by_batch(
options=options,
)

_run(java_options, tabula_options, force_subprocess=force_subprocess)
_run(tabula_options, java_options, force_subprocess=force_subprocess)


def _build_java_options(_java_options: Optional[List[str]] = None) -> List[str]:
def _build_java_options(
_java_options: Optional[List[str]] = None, encoding: str = "utf-8"
) -> List[str]:
if _java_options is None:
_java_options = []
elif isinstance(_java_options, str):
@@ -982,6 +977,10 @@ def _build_java_options(_java_options: Optional[List[str]] = None) -> List[str]:
if not any(filter(r.find, _java_options)): # type: ignore
_java_options = _java_options + ["-Djava.awt.headless=true"]

if encoding == "utf-8":
if not any("file.encoding" in opt for opt in _java_options):
_java_options += ["-Dfile.encoding=UTF8"]

return _java_options


@@ -1051,7 +1050,12 @@ def _extract_from(

if not pandas_options.get("dtype"):
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors="ignore")
try:
df[c] = pd.to_numeric(df[c], errors="raise")
except (ValueError, TypeError):
# Same logic as errors='ignore' in pd.to_numeric
# https://github.com/pandas-dev/pandas/pull/57361/files#diff-08fed2587c15d0370931a8b02252eb1034d2c0a650df56760974440a5433a6e0L240-L243
pass
data_frames.append(df)

return data_frames
2 changes: 1 addition & 1 deletion tabula/template.py
Original file line number Diff line number Diff line change
@@ -49,7 +49,7 @@ def load_template(path_or_buffer: FileLikeObj) -> List[TabulaOption]:


def _convert_template_option(
template: Dict[str, Union[bool, float, int, str]]
template: Dict[str, Union[bool, float, int, str]],
) -> TabulaOption:
"""Convert Tabula app template to tabula-py option
11 changes: 6 additions & 5 deletions tabula/util.py
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@
import shlex
from dataclasses import dataclass
from logging import getLogger
from typing import IO, Iterable, List, Optional, Union, cast
from typing import IO, Iterable, List, Optional, Sequence, Union, cast

logger = getLogger(__name__)

@@ -115,8 +115,9 @@ class TabulaOption:
Password to decrypt document. Default: empty
silent (bool, optional):
Suppress all stderr output.
columns (iterable, optional):
X coordinates of column boundaries.
columns (Sequence, optional):
X coordinates of column boundaries. Must be sorted and of a datatype that
preserves order, e.g. tuple or list
Example:
``[10.1, 20.2, 30.3]``
@@ -147,7 +148,7 @@ class TabulaOption:
stream: bool = False
password: Optional[str] = None
silent: Optional[bool] = None
columns: Optional[Iterable[float]] = None
columns: Optional[Sequence[float]] = None
relative_columns: bool = False
format: Optional[str] = None
batch: Optional[str] = None
@@ -235,7 +236,7 @@ def build_option_list(self) -> List[str]:
__options += ["--outfile", self.output_path]

if self.columns:
if self.columns != sorted(self.columns):
if list(self.columns) != sorted(self.columns):
raise ValueError("columns option should be sorted")

__columns = _format_with_relative(self.columns, self.relative_columns)
3 changes: 2 additions & 1 deletion tests/test_read_pdf_jar_path.py
Original file line number Diff line number Diff line change
@@ -3,6 +3,7 @@
from subprocess import CalledProcessError
from unittest.mock import patch

import jpype
import pytest

import tabula
@@ -19,5 +20,5 @@ def test_read_pdf_with_jar_path(self, jar_func):
# Fallback to subprocess
with pytest.raises(CalledProcessError):
tabula.read_pdf(self.pdf_path, encoding="utf-8")
file_name = Path(tabula.backend.jpype.getClassPath()).name
file_name = Path(jpype.getClassPath()).name
self.assertEqual(file_name, "tabula-java.jar")
7 changes: 2 additions & 5 deletions tests/test_read_pdf_silent.py
Original file line number Diff line number Diff line change
@@ -2,19 +2,16 @@
import unittest
from unittest.mock import patch

import pytest

import tabula


class TestReadPdfJarPath(unittest.TestCase):
def setUp(self):
self.pdf_path = "tests/resources/data.pdf"

@patch("tabula.backend.jpype.startJVM")
@patch("jpype.startJVM")
def test_read_pdf_with_silent_true(self, jvm_func):
with pytest.raises(RuntimeError):
tabula.read_pdf(self.pdf_path, encoding="utf-8", silent=True)
tabula.read_pdf(self.pdf_path, encoding="utf-8", silent=True)

target_args = []
if platform.system() == "Darwin":
12 changes: 12 additions & 0 deletions tests/test_read_pdf_table.py
Original file line number Diff line number Diff line change
@@ -41,6 +41,9 @@ def test_read_pdf_with_force_subprocess(self):
self.assertTrue(len(df), 1)
self.assertTrue(isinstance(df[0], pd.DataFrame))
self.assertTrue(df[0].equals(pd.read_csv(self.expected_csv1)))
self.assertTrue(tabula.io._tabula_vm.encoding, "utf-8")
tabula.read_pdf(self.pdf_path, stream=True, encoding="cp932")
self.assertTrue(tabula.io._tabula_vm.encoding, "cp932")

def test_read_pdf_into_json(self):
expected_json = "tests/resources/data_1.json"
@@ -88,6 +91,15 @@ def test_read_pdf_with_columns(self):
)[0].equals(pd.read_csv(expected_csv))
)

def test_read_pdf_with_tuple_columns(self):
pdf_path = "tests/resources/campaign_donors.pdf"
expected_csv = "tests/resources/campaign_donors.csv"
self.assertTrue(
tabula.read_pdf(
pdf_path, columns=(47, 147, 256, 310, 375, 431, 504), guess=False
)[0].equals(pd.read_csv(expected_csv))
)

def test_read_pdf_with_relative_columns(self):
pdf_path = "tests/resources/campaign_donors.pdf"
expected_csv = "tests/resources/campaign_donors.csv"