-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Progress Logs #598
Comments
When loading language, if the traineddata doesn't exist in cache, tesseract will download first. But log doesn't return any progress in midway. Only 0 or 1 at beginning or ending. I can only show a faker progress to imitate the real progress, obviously, it's very inaccurate |
There are several distinct issues brought up here, so I'll try to respond to each below. Verbiage Changing with Progress ("Initializing" vs. "Initialized", "Loading" vs. "Loaded")I agree that the verbiage should be consistent--using "initializing" when progress is 0 and then "initialized" when progress is 1 unnecessarily complicates things. This is proven by the fact that switching from consistent verbiage to inconsistent verbiage broke the loading bars in this repo's own demo site, which remain broken to this day. Unfortunately, this issue was not introduced recently--it first appears in this commit from 2018 (in the alpha version of Tesseract.js v2). Therefore, "fixing" will be a breaking change that will break people's code. I still think we should make this change, but it will need to happen in a major release (the next release will be v5). Simplified Progress ReportingI agree that a simplified progress reporting feature (whether at the worker or scheduler level) could be useful for new users trying to implement basic progress bars. I do not anticipate having the time to develop this, however if somebody else was to implement an option for reporting simplified progress as you describe and it works well I would merge it in. Language Data Loading Bar (@Mobbbb)It is true that, at present, Tesseract.js loads a large amount of language data, and this can take a while and appear to stall any loading bar during that time. Unfortunately, I do not believe the Fetch API reports progress when downloading files, nor am I aware of any other way to implement this easily. However, I think this will largely become a moot point once we reduce the amount of language data downloaded by default. Once the changes described in #806 are implemented, the default English |
The language in progress logs has been standardized in v5. Now, the same verbiage is used for the entirety of each step (no "initializing" vs "initialized", "loading" vs. "loaded", etc.). Additionally, waiting for progress should be much less of an issue as v5 significantly reduced file sizes (50-75%). Given the above changes, I am closing this issue. If anybody here upgrades to Tesseract.js v5 and still finds reporting progress problematic, they should open a new issue. |
Is your feature request related to a problem? Please describe.
I am running multiple recognize jobs on multiple workers. It is very hard to implement a simple progress bar for the process.
There are inconsistencies when it come to the initialisation. E.g. we get status
initializing api
and then statusinitialized api
when it's done. Why not having one status and make use of the progress property? I needed to implement a mapping table to unify the messages:const statusMap = { 'initializing api': 'initialized api', 'initializing tesseract': 'initialized tesseract', 'loading language traineddata': 'loaded language traineddata', }
When working with multiple worker, I need to keep track of the worker ids and multiple initialisation phases (api, training data) and when the job is running, I need to keep track of job ids that run on multiple worker.
So just implement a user friendly 0-100% progress bar is way more complicated than implementing the OCR process itself.
Describe the solution you'd like
It would be great to unify the different initialisation phases. Moreover, it would be nice to get a job pool progress to get the overall progress without needing to collect them manually.
Describe alternatives you've considered
Alternative would be, to first ignore the initialisation and focus on the recognizing, but that in itself is very complicated.
The text was updated successfully, but these errors were encountered: