Skip to content

Commit

Permalink
Merge pull request #168 from chezou/black
Browse files Browse the repository at this point in the history
Introduce black, isort, nox
  • Loading branch information
chezou authored Jul 27, 2019
2 parents a58b552 + 0c35203 commit d6c65d3
Show file tree
Hide file tree
Showing 17 changed files with 509 additions and 375 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
Expand Down
7 changes: 3 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,10 @@ python:
before_install:
- pip install --upgrade setuptools
install:
- pip install tox
- pip install tox-travis
- pip install coverage coveralls
- pip install nox
- pip install .
script:
- tox -r
- nox
deploy:
provider: pypi
user: chezou
Expand Down
44 changes: 28 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,21 @@ I confirmed working on macOS and Ubuntu. But some people confirm it works on Win

## Install

```
```bash
pip install tabula-py
```

If you want to become a contributor, you can install dependency for development of tabula-py as follows:
If you want to become a contributor, you can install dependency after cloning the repo as follows:

```bash
pip install -e .[dev, test]
pip install nox
```
pip install -r requirements.txt -c constraints.txt

For running text and liter, run nox command.

```bash
nox .
```

## Example
Expand Down Expand Up @@ -78,22 +85,23 @@ This instruction is originally written by [@lahoffm](https://github.com/lahoffm)
- Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
- guess (bool, optional):
- Guess the portion of the page to analyze per page. Default `True`
- Note that as of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.
- area (`list` of `float`, optional):
- Portion of the page to analyze(top,left,bottom,right).
- Example: [269.875, 12.75, 790.5, 561] or [[12.1,20.5,30.1,50.2],[1.0,3.2,10.5,40.2]]. Default is entire page
- Example: `[269.875, 12.75, 790.5, 561]` or `[[12.1,20.5,30.1,50.2],[1.0,3.2,10.5,40.2]]`. Default is entire page
- relative_area (bool, optional):
- If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default `False`.
- lattice (bool, optional):
- [`spreadsheet` option is deprecated] Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet).
- (`spreadsheet` option is deprecated) Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet).
- stream (bool, optional):
- [`nospreadsheet` option is deprecated] Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
- (`nospreadsheet` option is deprecated) Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
- password (bool, optional):
- Password to decrypt document. Default is empty
- silent (bool, optional):
- Suppress all stderr output.
- columns (list, optional):
- X coordinates of column boundaries.
- Example: [10.1, 20.2, 30.3]
- Example: `[10.1, 20.2, 30.3]`
- output_format (str, optional):
- Format for output file or extracted object.
- For `read_pdf()`: `json`, `dataframe`
Expand All @@ -106,7 +114,7 @@ This instruction is originally written by [@lahoffm](https://github.com/lahoffm)
- pandas_options (`dict`, optional):
- Set pandas options like `{'header': None}`.
- multiple_tables (bool, optional):
- (Experimental) Extract multiple tables. If used with multiple pages (e.g. `pages='all'`) will extract separate tables from each page.
- Extract multiple tables. If used with multiple pages (e.g. `pages='all'`) will extract separate tables from each page.
- This option uses JSON as an intermediate format, so if tabula-java output format will change, this option doesn't work.
- user_agent (str, optional)
- Set a custom user-agent when download a pdf from a url. Otherwise it uses the default urllib.request user-agent
Expand All @@ -124,7 +132,7 @@ You can check whether tabula-py can call `java` from Python process with `tabula

If you've installed `tabula`, it will be conflict the namespace. You should install `tabula-py` after removing `tabula`.

```
```bash
pip uninstall tabula
pip install tabula-py
```
Expand All @@ -137,15 +145,15 @@ pip install tabula-py

Yes. You can use `options` argument as following. The format is same as cli of tabula-java.

```py
```python
read_pdf(file_path, options="--columns 10.1,20.2,30.3")
```

### How can I ignore useless area?

In short, you can extract with `area` and `spreadsheet` option.

```py
```python
In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
Expand All @@ -161,7 +169,7 @@ Out[4]:
8 F E E4 R 4
```

*How to use `area` option*
#### How to use `area` option

According to tabula-java wiki, there is a explain how to specify the area:
https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
Expand All @@ -171,14 +179,14 @@ For example, using macOS's preview, I got area information of this [PDF](https:/
![image](https://cloud.githubusercontent.com/assets/916653/22047470/b201de24-dd6a-11e6-9cfc-7bc73e33e3b2.png)


```
```bash
java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
```

given

```
Note the left, top, height, and width parameters and calculate the following:
```python
# Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
Expand All @@ -188,7 +196,7 @@ x2 = left + width

I confirmed with tabula-java:

```
```bash
java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf
```

Expand Down Expand Up @@ -263,6 +271,10 @@ You can help by:
- [@CurtLH](https://github.com/CurtLH)
- [@nikhilgk](https://github.com/nikhilgk)
- [@krassowski](https://github.com/krassowski)
- [@alexandreio](https://github.com/alexandreio)
- [@rmnevesLH](https://github.com/rmnevesLH)
- [@red-bin](https://github.com/red-bin)
- [@Gallaecio](https://github.com/Gallaecio)

### Another support

Expand Down
2 changes: 2 additions & 0 deletions constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ attrs==19.1.0
backcall==0.1.0
black==19.3b0
Click==7.0
colorlog==3.2.0
decorator==4.4.0
distro==1.4.0
entrypoints==0.3
Expand All @@ -17,6 +18,7 @@ isort==4.3.21
jedi==0.14.1
mccabe==0.6.1
more-itertools==7.2.0
nox==2019.5.30
numpy==1.17.0
packaging==19.0
pandas==0.25.0
Expand Down
17 changes: 17 additions & 0 deletions noxfile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import nox


@nox.session
def lint(session):
lint_tools = ["black", "isort", "flake8"]
targets = ["tabula", "tests", "noxfile.py"]
session.install(*lint_tools)
session.run("flake8", *targets)
session.run("black", "--diff", "--check", *targets)
session.run("isort", "--check-only")


@nox.session
def tests(session):
session.install(".[test]")
session.run("pytest", "-v")
19 changes: 14 additions & 5 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
[wheel]
universal = 1

[flake8]
ignore = F401
max-line-length = 200
ignore = E203, W503
max-line-length = 88
exclude =
.git,
__pycache__,
build,
dist,
.venv,
tabula/__init__.py

[isort]
line_length=88
multi_line_output=3
include_trailing_comma=True
59 changes: 26 additions & 33 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,54 +1,47 @@
from setuptools import setup
from setuptools import find_packages
import os

from setuptools import find_packages, setup


def read_file(filename):
filepath = os.path.join(
os.path.dirname(os.path.dirname(__file__)), filename)
filepath = os.path.join(os.path.dirname(os.path.dirname(__file__)), filename)
if os.path.exists(filepath):
return open(filepath).read()
else:
return ''
return ""


about = {}
with open(os.path.join(os.path.dirname(__file__), 'tabula', '__version__.py')) as f:
with open(os.path.join(os.path.dirname(__file__), "tabula", "__version__.py")) as f:
exec(f.read(), about)

with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
about['__long_description__'] = f.read()
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
about["__long_description__"] = f.read()


setup(
name=about['__title__'],
version=about['__version__'],
description=about['__description__'],
long_description=about['__long_description__'],
name=about["__title__"],
version=about["__version__"],
description=about["__description__"],
long_description=about["__long_description__"],
long_description_content_type="text/markdown",
author=about['__author__'],
author_email=about['__author_email__'],
maintainer=about['__maintainer__'],
maintainer_email=about['__maintainer_email__'],
license=about['__license__'],
url=about['__url__'],
author=about["__author__"],
author_email=about["__author_email__"],
maintainer=about["__maintainer__"],
maintainer_email=about["__maintainer_email__"],
license=about["__license__"],
url=about["__url__"],
classifiers=[
'Development Status :: 4 - Beta',
'Topic :: Text Processing :: General',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.5',
"Development Status :: 4 - Beta",
"Topic :: Text Processing :: General",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.5",
],
include_package_data=True,
packages=find_packages(),
keywords=['data frame', 'pdf', 'table'],
install_requires=[
'pandas',
'numpy',
'distro',
],
extras_require={
'dev': ['pytest', 'flake8', 'black', 'isort']
},
keywords=["data frame", "pdf", "table"],
install_requires=["pandas", "numpy", "distro"],
extras_require={"dev": ["pytest", "flake8", "black", "isort"], "test": ["pytest"]},
)
12 changes: 7 additions & 5 deletions tabula/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from .wrapper import read_pdf
from .wrapper import read_pdf_with_template
from .wrapper import convert_into
from .wrapper import convert_into_by_batch
from .util import environment_info
from .__version__ import __version__
from .util import environment_info
from .wrapper import (
convert_into,
convert_into_by_batch,
read_pdf,
read_pdf_with_template,
)
18 changes: 9 additions & 9 deletions tabula/__version__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
__title__ = 'tabula-py'
__version__ = '1.3.1'
__license__ = 'MIT License'
__description__ = 'Simple wrapper for tabula-java, read tables from PDF into DataFrame'
__author__ = 'Aki Ariga'
__author_email__ = 'chezou@gmail.com'
__maintainer__ = 'Aki Ariga'
__maintainer_email__ = 'chezou@gmail.com'
__url__ = 'https://github.com/chezou/tabula-py'
__title__ = "tabula-py"
__version__ = "1.3.1"
__license__ = "MIT License"
__description__ = "Simple wrapper for tabula-java, read tables from PDF into DataFrame"
__author__ = "Aki Ariga"
__author_email__ = "chezou@gmail.com"
__maintainer__ = "Aki Ariga"
__maintainer_email__ = "chezou@gmail.com"
__url__ = "https://github.com/chezou/tabula-py"
2 changes: 1 addition & 1 deletion tabula/errors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

class CSVParseError(ParserError):
def __init__(self, message, cause):
super(CSVParseError, self).__init__(message + ', caused by ' + repr(cause))
super(CSVParseError, self).__init__(message + ", caused by " + repr(cause))
self.cause = cause


Expand Down
Loading

0 comments on commit d6c65d3

Please sign in to comment.