Skip to content
This repository has been archived by the owner on Sep 25, 2019. It is now read-only.

Add script to OCR text in PDFs #7

Open
wants to merge 2 commits into
base: qa/1.x
Choose a base branch
from

Conversation

mistydemeo
Copy link
Contributor

refs #8639.

@mistydemeo
Copy link
Contributor Author

This needs some more discussion/work; see the original ticket for details. Specifically, the problem is that the Imagemagick step responsible for converting the PNG to PDFs is producing unreadable, terrible-quality flat images.

@mistydemeo
Copy link
Contributor Author

Updated to fix the scaling issue. This can get squashed down to one commit if the client approves.

@axfelix
Copy link

axfelix commented Mar 22, 2016

The tesseract line doesn't seem to love being called with subprocess.check_call from within Archivematica in my experience: 3d8b463#diff-1090aa5b96aa7be6b941205f17dabdccR16

Had it working, and then a Python update on an old 12.04 server (?) made it start throwing a kwargs issue (I'll reproduce the exact one from my other machine's search history later today), as though there was something wrong with the cmd args being reconstituted from the passed list. Jury-rigging it into a cmd = "command " + var + " command " + var worked fine, but I still haven't been able to figure out why.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants