Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

question: is it true / on purpose that a text is "corrected" to modern stanard Yiddish? #85

Open
mirjam-amsterdam opened this issue Aug 28, 2022 · 8 comments

Comments

@mirjam-amsterdam
Copy link

I just went over a section of nybc200407, p. 11f. I noticed that some words were reproduced in a standardized form, whereas the Yiddish source clearly has a non-modern YIVO-klal-spelling. And just here it is important for a reader nowadays to know that even the great Max Weinreich's texts were first spelled in a different spelling. Two examples: vu with melupn-vov instead of original tsvey-vovn - aleph - vov. (https://bit.ly/3wARw8A)
I noticed some other instances were a letter gimel at the end of a word was "corrected" into a kuf, for example the word באַװײַזנדיג
(https://bit.ly/3wDtEkW)
(There are other issues here too: « » is not recognized, Latin-lettered text is not recognized, there are spaces before punctuation marks. And as for correcting in Jochre, it is tedious that the text cannot be scrolled forward, ie it is hardly possible to correct larger portions of text.)
see screenshot
melupn-vov
bavayzndig

@mirjam-amsterdam
Copy link
Author

mirjam-amsterdam commented Aug 29, 2022

and now I found it in Zalmen Rejzns Leksikon, Vol. 2, col 314 as well! װוּנטש is written with Aleph in the original!
If all dots in the letters beys and kaf are kept as in the original, also the alef has to be kept!

@mirjam-amsterdam
Copy link
Author

I also stumbled over the 'correction' with adding a khirek under the second yud in the word yidish. Neither Rejzen not Weinreich has it. This is also a correction from modern point of view, not reflecting the original.

@mirjam-amsterdam
Copy link
Author

At least not only Weinreich and Rejzen get corrected on their old-fashioned spelling, it happens to Der Pinkes as well.

image

@markhdavid
Copy link

markhdavid commented Mar 30, 2023

I came here to report this exact issue. I think it's absolutely wrong for anyone to change the orthography. I can see the need to correct where the OCR has made a mistake, either in encoding or interpretation or both. But to change to some other orthography, modern standard or otherwise, is absolutely wrong. The whole purpose is to recognize text as written accurately. I've come across examples of this and found them very disconcerting. (I'll try to send one of my own if I find it.) Where is there a set of conventions and rules for editing the OCR output? Who monitors it?

@urieli
Copy link
Owner

urieli commented Apr 26, 2023

It is true and it is wrong to do so.

This will be corrected in the next version of Jochre.

The original text will remain exactly as it was.
We will make an attempt to guess what was meant in YIVO spelling, and store this as a hidden synonym, to facilitate the search mechanism (so that a search for "װוּ" will still return "װאו").

The set of conventions and rules are simple: you write exactly what is on the page, including typographical errors (if there are any), and including full niqqud as it appears on the printed page.

The fixes are automatically applied, but we can easily undo them (including all fixes from a given user) if we find the user is over-fixing or fixing wrongly.

However, in the current case, it's Jochre itself who was over-fixing.

@markhdavid
Copy link

Just read #85 (comment), after I just noticed this in Yudel Mark's Heft far Yidish (https://archive.org/details/nybc204715/page/20/mode/2up). I was moved to make one correction, וווּ => וואו, but it would be a slog to go through each case. Will this book, or ones like it, ever be automatically rescanned? Can that be done? Of course, I could see it being a huge waste of work if books get rescanned and actually valid corrections get thrown away. On the other hand, it's too onerous to go through by hand to make all these corrections.

@urieli
Copy link
Owner

urieli commented Sep 10, 2023

@markhdavid All of the books will be re-analyzed using the new version of Jochre (currently being written). We've made good progress, but it isn't yet ready.

I say "re-analyzed" and not "re-scanned", since there is no plan to re-digitize the books, only to re-analyze the digital content using the OCR software.

The plan is to re-ocr everything, and then to re-apply the user corrections. So no: there is no need to manually correct everything.
We will also try to learn from the manual user corrections, but that's a later phase.

@mirjam-amsterdam
Copy link
Author

mirjam-amsterdam commented Sep 10, 2023 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants