Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Progress Logs #598

Closed
jwedel opened this issue Jan 29, 2022 · 3 comments
Closed

Improve Progress Logs #598

jwedel opened this issue Jan 29, 2022 · 3 comments
Milestone

Comments

@jwedel
Copy link

jwedel commented Jan 29, 2022

Is your feature request related to a problem? Please describe.
I am running multiple recognize jobs on multiple workers. It is very hard to implement a simple progress bar for the process.

  1. There are inconsistencies when it come to the initialisation. E.g. we get status initializing api and then status initialized api when it's done. Why not having one status and make use of the progress property? I needed to implement a mapping table to unify the messages: const statusMap = { 'initializing api': 'initialized api', 'initializing tesseract': 'initialized tesseract', 'loading language traineddata': 'loaded language traineddata', }

  2. When working with multiple worker, I need to keep track of the worker ids and multiple initialisation phases (api, training data) and when the job is running, I need to keep track of job ids that run on multiple worker.

So just implement a user friendly 0-100% progress bar is way more complicated than implementing the OCR process itself.

Describe the solution you'd like
It would be great to unify the different initialisation phases. Moreover, it would be nice to get a job pool progress to get the overall progress without needing to collect them manually.

Describe alternatives you've considered
Alternative would be, to first ignore the initialisation and focus on the recognizing, but that in itself is very complicated.

@Mobbbb
Copy link

Mobbbb commented May 7, 2022

When loading language, if the traineddata doesn't exist in cache, tesseract will download first. But log doesn't return any progress in midway. Only 0 or 1 at beginning or ending. I can only show a faker progress to imitate the real progress, obviously, it's very inaccurate

@Balearica
Copy link
Member

There are several distinct issues brought up here, so I'll try to respond to each below.

Verbiage Changing with Progress ("Initializing" vs. "Initialized", "Loading" vs. "Loaded")

I agree that the verbiage should be consistent--using "initializing" when progress is 0 and then "initialized" when progress is 1 unnecessarily complicates things. This is proven by the fact that switching from consistent verbiage to inconsistent verbiage broke the loading bars in this repo's own demo site, which remain broken to this day.

Unfortunately, this issue was not introduced recently--it first appears in this commit from 2018 (in the alpha version of Tesseract.js v2). Therefore, "fixing" will be a breaking change that will break people's code. I still think we should make this change, but it will need to happen in a major release (the next release will be v5).

Simplified Progress Reporting

I agree that a simplified progress reporting feature (whether at the worker or scheduler level) could be useful for new users trying to implement basic progress bars. I do not anticipate having the time to develop this, however if somebody else was to implement an option for reporting simplified progress as you describe and it works well I would merge it in.

Language Data Loading Bar (@Mobbbb)

It is true that, at present, Tesseract.js loads a large amount of language data, and this can take a while and appear to stall any loading bar during that time. Unfortunately, I do not believe the Fetch API reports progress when downloading files, nor am I aware of any other way to implement this easily. However, I think this will largely become a moot point once we reduce the amount of language data downloaded by default. Once the changes described in #806 are implemented, the default English .traineddata will decrease from 10.4 MB to 2.95 MB (72% decrease) and the Chinese (simplified) .traineddata will decrease from 20.2 MB to 1.7 MB (94% decrease). Even without incremental progress reported for file downloads, files of this size should not produce a significant stall except on the slowest internet connections. This is another breaking change, so will be implemented in Tesseract.js v5.

@Balearica Balearica added this to the v5.0 milestone Aug 30, 2023
@Balearica
Copy link
Member

The language in progress logs has been standardized in v5. Now, the same verbiage is used for the entirety of each step (no "initializing" vs "initialized", "loading" vs. "loaded", etc.). Additionally, waiting for progress should be much less of an issue as v5 significantly reduced file sizes (50-75%).

Given the above changes, I am closing this issue. If anybody here upgrades to Tesseract.js v5 and still finds reporting progress problematic, they should open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants