Add script to OCR text in PDFs #7

mistydemeo · 2016-01-20T23:41:59Z

refs #8639.

mistydemeo · 2016-02-29T18:25:53Z

This needs some more discussion/work; see the original ticket for details. Specifically, the problem is that the Imagemagick step responsible for converting the PNG to PDFs is producing unreadable, terrible-quality flat images.

mistydemeo · 2016-02-29T18:51:01Z

Updated to fix the scaling issue. This can get squashed down to one commit if the client approves.

axfelix · 2016-03-22T17:08:09Z

The tesseract line doesn't seem to love being called with subprocess.check_call from within Archivematica in my experience: 3d8b463#diff-1090aa5b96aa7be6b941205f17dabdccR16

Had it working, and then a Python update on an old 12.04 server (?) made it start throwing a kwargs issue (I'll reproduce the exact one from my other machine's search history later today), as though there was something wrong with the cmd args being reconstituted from the passed list. Jury-rigging it into a cmd = "command " + var + " command " + var worked fine, but I still haven't been able to figure out why.

Add script to OCR text in PDFs

3d8b463

refs #8639.

PDF: use gs to scale, not imagemagick

2c2b677

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to OCR text in PDFs #7

Add script to OCR text in PDFs #7

mistydemeo commented Jan 20, 2016

mistydemeo commented Feb 29, 2016

mistydemeo commented Feb 29, 2016

axfelix commented Mar 22, 2016

Add script to OCR text in PDFs #7

Are you sure you want to change the base?

Add script to OCR text in PDFs #7

Conversation

mistydemeo commented Jan 20, 2016

mistydemeo commented Feb 29, 2016

mistydemeo commented Feb 29, 2016

axfelix commented Mar 22, 2016