Skip to content

Commit

Permalink
Merge pull request #1161 from QuentinAndre11/develop
Browse files Browse the repository at this point in the history
Add `extra_attrs` paramater to `.dedupe_chars(...)`
  • Loading branch information
jsvine authored Jul 6, 2024
2 parents 6c9ecb2 + 2bdbb5b commit 923707a
Show file tree
Hide file tree
Showing 7 changed files with 78 additions and 8 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.11.2] - Not yet released

### Added

- Add `extra_attrs` parameter to `.dedupe_chars(...)` to adjust the properties used when deduplicating (h/t @QuentinAndre11). ([#1114](https://github.com/jsvine/pdfplumber/issues/1114))

## [0.11.1] - 2024-06-11

### Fixed
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ Note: The methods above are built on Pillow's [`ImageDraw` methods](http://pillo
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |
|`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|
|`.dedupe_chars(tolerance=1, extra_attrs=("fontname", "size"))`| Returns a version of the page with duplicate chars — those sharing the same text, positioning (within `tolerance` x/y), and `extra_attrs` as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|

## Extracting tables

Expand Down Expand Up @@ -543,6 +543,7 @@ Many thanks to the following users who've contributed ideas, features, and fixes
- [Echedey Luis](https://github.com/echedey-ls)
- [Andy Friedman](https://github.com/afriedman412)
- [Aron Weiler](https://github.com/aronweiler)
- [Quentin André](https://github.com/QuentinAndre11)

## Contributing

Expand Down
5 changes: 3 additions & 2 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -559,8 +559,9 @@ def filter(self, test_function: Callable[[T_obj], bool]) -> "FilteredPage":

def dedupe_chars(self, **kwargs: Any) -> "FilteredPage":
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters on the page.
Removes duplicate chars — those sharing the same text and positioning
(within `tolerance`) as other characters in the set. Adjust extra_args
to be more/less restrictive with the properties checked.
"""
p = FilteredPage(self, lambda x: True)
p._objects = {kind: objs for kind, objs in self.objects.items()}
Expand Down
13 changes: 9 additions & 4 deletions pdfplumber/utils/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -781,12 +781,17 @@ def extract_text_simple(
return "\n".join(collate_line(c, x_tolerance) for c in clustered)


def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1) -> T_obj_list:
def dedupe_chars(
chars: T_obj_list,
tolerance: T_num = 1,
extra_attrs: Optional[Tuple[str, ...]] = ("fontname", "size"),
) -> T_obj_list:
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters in the set.
Removes duplicate chars — those sharing the same text and positioning
(within `tolerance`) as other characters in the set. Use extra_args to
be more restrictive with the properties shared by the matching chars.
"""
key = itemgetter("fontname", "size", "upright", "text")
key = itemgetter(*("upright", "text"), *(extra_attrs or tuple()))
pos_key = itemgetter("doctop", "x0")

def yield_unique_chars(chars: T_obj_list) -> Generator[T_obj, None, None]:
Expand Down
2 changes: 1 addition & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ pytest-parallel==0.1.1
flake8==4.0.1
black==22.3.0
isort==5.10.1
pandas==2.0.3
pandas==2.2.2
mypy==0.981
pandas-stubs==1.2.0.58
types-Pillow==9.0.14
Expand Down
Binary file added tests/pdfs/issue-1114-dedupe-chars.pdf
Binary file not shown.
57 changes: 57 additions & 0 deletions tests/test_dedupe_chars.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,60 @@ def test_extract_text2(self):
page.dedupe_chars().extract_text(y_tolerance=6).splitlines()[4]
== "UE 8. Circulation - Métabolismes"
)

def test_extra_attrs(self):
path = os.path.join(HERE, "pdfs/issue-1114-dedupe-chars.pdf")
pdf = pdfplumber.open(path)
page = pdf.pages[0]

def dup_chars(s: str) -> str:
return "".join((char if char == " " else char + char) for char in s)

ground_truth = (
("Simple", False, False),
("Duplicated", True, True),
("Font", "fontname", True),
("Size", "size", True),
("Italic", "fontname", True),
("Weight", "fontname", True),
("Horizontal shift", False, "HHoorrizizoonntatal ls shhifitft"),
("Vertical shift", False, True),
)
gt = []
for text, should_dedup, dup_text in ground_truth:
if isinstance(dup_text, bool):
if dup_text:
dup_text = dup_chars(text)
else:
dup_text = text
gt.append((text, should_dedup, dup_text))

keys_list = ["no_dedupe", (), ("size",), ("fontname",), ("size", "fontname")]
for keys in keys_list:
if keys != "no_dedupe":
filtered_page = page.dedupe_chars(tolerance=2, extra_attrs=keys)
else:
filtered_page = page
for i, line in enumerate(
filtered_page.extract_text(y_tolerance=5).splitlines()
):
text, should_dedup, dup_text = gt[i]
if keys == "no_dedupe":
should_dedup = False
if isinstance(should_dedup, str):
if should_dedup in keys:
fail_msg = (
f"{should_dedup} is not required to match "
"so it should be duplicated"
)
assert line == dup_text, fail_msg
else:
fail_msg = (
"Should not be duplicated "
f"when requiring matching {should_dedup}"
)
assert line == text, fail_msg
elif should_dedup:
assert line == text
else:
assert line == dup_text

0 comments on commit 923707a

Please sign in to comment.