-
Notifications
You must be signed in to change notification settings - Fork 693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ignore_char_properties arg in dedupe_chars #1161
Conversation
Hi @QuentinAndre11, and thanks for your interest in def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1) -> T_obj_list:
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters in the set.
"""
key = itemgetter("fontname", "size", "upright", "text") ... doing this: def dedupe_chars(
chars: T_obj_list,
tolerance: T_num = 1,
properties: Tuple[str, ...] = tuple("fontname", "size", "upright", "text")
) -> T_obj_list:
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters in the set.
"""
key = itemgetter(*properties) |
Hi ! I thought text and upright properties aren't properties that we would want to change at all. The only case I can think of for the text would be visually close letters like I and l but why on earth would we want to deduplicate that ? Same goes for upright-proof letters like o but it also does not seem a relevant feature. Ps: the tests are failing because of pandas, I guess I can't do anything about it? |
If needed, I can actually replace the kwargs from dedupe_chars(tolerate) to dedupe_chars(tolerate_x: T_num = 1, tolerate_y: T_num = 1, ignore_font: bool = False, tolerate_size: T_num = 0), it's not that much additional work @jsvine |
Thanks, @QuentinAndre11. This is a good point:
Given that there's no real reason to change them, what about this updated definition?: def dedupe_chars(
chars: T_obj_list,
tolerance: T_num = 1,
extra_attrs: Optional[Tuple[str, ...]] = tuple("fontname", "size")
) -> T_obj_list:
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters in the set.
"""
key = itemgetter(*("upright", "text"), *(extra_attrs or tuple())) Then, all you'd have to do for your particular use-case is call Another advantage is that the parameter matches the name (and idea) of the |
@jsvine Works for me (and my use case), so I implemented it your way ! |
Thanks! Pushed some documentation tweaks and a fix for the failing tests (we just needed to upgrade |
Looks like there's still an issue with Python 3.8 (because the latest |
See #1114
More precise parameters could be used for char matching in dedupe_chars (a list of list of similar fontnames, a tolerance for the size...) but for my use case directly ignoring these keys was sufficient and it was easier to test.
PS: It is my first contribution so I did not know if I had to add a new version in the Changelog or append it to the last one.