Convert CIDs to Unicode #33

lukehsiao · 2018-03-19T19:34:57Z

PDFMiner gives us a bunch of CID characters in our output. (e.g. 25(cid:176) C instead of 25° C). It would be great to be able to convert these to their respective unicode characters before outputting. Some potentially useful references [1], [2].

[1] https://stackoverflow.com/questions/24089245/decode-cid-font-codes-to-equivalent-ascii-characters
[2] https://github.com/adobe-type-tools/cmap-resources/

Update: looking into it, it seems that PDFMiner actually tries to take care of this, but due to poorly created PDFs that don't include all of the necessary information, they cannot always convert to unicode. pdfminer/pdfminer.six#35.

The text was updated successfully, but these errors were encountered:

lukehsiao · 2018-03-19T21:53:33Z

It may also be that we are not using pdfminer to the fullest. For example, (cid:176) appears in the glyphlist [1]. Which makes me wonder why this actually appears in our output.

This makes parsing problematic, since we have a single set of coordinates output for the "character", but then (cid:%d) as a string is passed along, which is interpreted as a string.

https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/converter.py#L127

lukehsiao · 2018-03-20T21:11:25Z

Right now we're just replacing the cid using regex in the Fonduer parser to a wildcard character ($, at the moment). In an ideal world, we could fix this here, though.

lukehsiao added enhancement help wanted labels Mar 19, 2018

lukehsiao changed the title ~~Convert CIDs to Unicode?~~ Convert CIDs to Unicode Mar 19, 2018

lukehsiao self-assigned this Mar 20, 2018

lukehsiao removed the help wanted label Mar 20, 2018

lukehsiao modified the milestones: v0.3.1, v0.3.2 Mar 20, 2018

lukehsiao added the wontfix label Mar 20, 2018

lukehsiao removed their assignment Apr 4, 2018

lukehsiao removed this from the v0.3.2 milestone Apr 7, 2018

jsvine mentioned this issue Apr 17, 2018

Extracting filled polygons and saving as new pdf jsvine/pdfplumber#57

Closed

HiromuHota mentioned this issue Oct 28, 2020

Treat "(cid:%d)" as a possible char to reduce "Out of order" warnings #102

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert CIDs to Unicode #33

Convert CIDs to Unicode #33

lukehsiao commented Mar 19, 2018 •

edited

Loading

lukehsiao commented Mar 19, 2018 •

edited

Loading

lukehsiao commented Mar 20, 2018

Convert CIDs to Unicode #33

Convert CIDs to Unicode #33

Comments

lukehsiao commented Mar 19, 2018 • edited Loading

lukehsiao commented Mar 19, 2018 • edited Loading

lukehsiao commented Mar 20, 2018

lukehsiao commented Mar 19, 2018 •

edited

Loading

lukehsiao commented Mar 19, 2018 •

edited

Loading