-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert CIDs to Unicode #33
Comments
It may also be that we are not using pdfminer to the fullest. For example, This makes parsing problematic, since we have a single set of coordinates output for the "character", but then (cid:%d) as a string is passed along, which is interpreted as a string. https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/converter.py#L127 |
Right now we're just replacing the cid using regex in the Fonduer parser to a wildcard character ( |
PDFMiner gives us a bunch of CID characters in our output. (e.g.
25(cid:176) C
instead of25° C
). It would be great to be able to convert these to their respective unicode characters before outputting. Some potentially useful references [1], [2].[1] https://stackoverflow.com/questions/24089245/decode-cid-font-codes-to-equivalent-ascii-characters
[2] https://github.com/adobe-type-tools/cmap-resources/
Update: looking into it, it seems that PDFMiner actually tries to take care of this, but due to poorly created PDFs that don't include all of the necessary information, they cannot always convert to unicode. pdfminer/pdfminer.six#35.
The text was updated successfully, but these errors were encountered: