Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert CIDs to Unicode #33

Open
lukehsiao opened this issue Mar 19, 2018 · 2 comments
Open

Convert CIDs to Unicode #33

lukehsiao opened this issue Mar 19, 2018 · 2 comments

Comments

@lukehsiao
Copy link
Contributor

lukehsiao commented Mar 19, 2018

PDFMiner gives us a bunch of CID characters in our output. (e.g. 25(cid:176) C instead of 25° C). It would be great to be able to convert these to their respective unicode characters before outputting. Some potentially useful references [1], [2].

[1] https://stackoverflow.com/questions/24089245/decode-cid-font-codes-to-equivalent-ascii-characters
[2] https://github.com/adobe-type-tools/cmap-resources/

Update: looking into it, it seems that PDFMiner actually tries to take care of this, but due to poorly created PDFs that don't include all of the necessary information, they cannot always convert to unicode. pdfminer/pdfminer.six#35.

@lukehsiao lukehsiao changed the title Convert CIDs to Unicode? Convert CIDs to Unicode Mar 19, 2018
@lukehsiao
Copy link
Contributor Author

lukehsiao commented Mar 19, 2018

It may also be that we are not using pdfminer to the fullest. For example, (cid:176) appears in the glyphlist [1]. Which makes me wonder why this actually appears in our output.

This makes parsing problematic, since we have a single set of coordinates output for the "character", but then (cid:%d) as a string is passed along, which is interpreted as a string.

https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/converter.py#L127

@lukehsiao lukehsiao self-assigned this Mar 20, 2018
@lukehsiao lukehsiao modified the milestones: v0.3.1, v0.3.2 Mar 20, 2018
@lukehsiao
Copy link
Contributor Author

Right now we're just replacing the cid using regex in the Fonduer parser to a wildcard character ($, at the moment). In an ideal world, we could fix this here, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant