Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New hyphenation patterns #383

Merged
merged 6 commits into from
Oct 3, 2020
Merged

New hyphenation patterns #383

merged 6 commits into from
Oct 3, 2020

Conversation

roshavagarga
Copy link
Contributor

@roshavagarga roshavagarga commented Sep 22, 2020

Proposed changes:

  • Added Armenian
  • Added Brazilian Portuguese
  • Added Friulian
  • Added Piedmontese
  • Added Romansh
  • Added Zulu
  • Updated languages.json
  • Updated textlang.cpp

@cramoisi @poire-z Friulian, Piedmontese and Romansh had 10-20 lines in the beginning which might be added back that all had ' or '' in some manner, if they're not copies. I also have a fairly complete pattern for Brazilian Portuguese, but I'm not sure how it'll be handled in textlang and the json.

Also, should I add hyphenmin's and aliases to all entries in languages.json?


This change is Reviewable

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

You really don't want to get friend with git? Because this branch being your roshavagarga:master, I again won't be able to edit it.
(Unless I should clone it and push to it like if it were an independant repository, as suggested by @NiLuJe in the other PR? gonna try that.)

Also, should I add hyphenmin's and aliases to all entries in languages.json?

I'd say it's not necessary as it's no more used. But I used it as the reference when I updated frontend's readertypography.lua for alternative lang tags (that you can't put anywhere else, except in frontend and in this languages.json).
So, dunno. But if we don't update it, we should trash it. And you would tell the alt lang tags to add in the PR post.

Do you plan on adding hyphenation for all the languages of the world ? :) How many left?
Because by doing so, we may double the size of koreader :/
(I would consider these languages "little languages" - and I feel ashame for that :) - but do we really need to have them ?)

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

(Unless I should clone it and push to it like if it were an independant repository, gonna try that.)

That worked :)
So ok, you can continue updating via github web UI - and I'll be able to clean/squash/rebase it if needed before merging.

@cramoisi
Copy link
Contributor

cramoisi commented Sep 22, 2020

Do you plan on adding hyphenation for all the languages of the world ? :) How many left?
Because by doing so, we may double the size of koreader :/
(I would consider these languages "little languages" - and I feel ashame for that :) - but do we really need to have them ?)

Could you find a epub for each of these languages ? :) 😅😱😂

@roshavagarga
Copy link
Contributor Author

@poire-z I'll have some more time to sit down and go through your git cheatsheet in the next few weeks. Once again, sorry, my mental acuity hasn't been up to learning these past few weeks.

I think the only ones I'd like to fix are the ones in #373 and maybe a select few others I might have a small interest in adding from the usual 2 sources - Afrikaans and Belarusian that I can think of. Honestly the size changes were bothering me a bit too in the long run, and I was curious whether these could somehow be handled in a more discreet manner - having a base pack so to speak, and just offering smaller languages as a bonus download. This is part of the reason why I haven't considered adding the hyphenation patterns available for languages spoken in India, since if I remember correctly some of those were rather large.

Example: Friulian pattern isn't available by default in koreader, but if a document tagged for it is opened - an update/download is offered. Another avenue for the same would be switching to a hyphenation that isn't prepackaged.

If I'd have to point out files that I never understood why we had - the Russian+English patterns seem pointless to me, though I'm sure there's some use for them, but I believe they were rather large, which seemed extravagant to me when I noticed them.

@roshavagarga
Copy link
Contributor Author

@cramoisi Armenian ebooks, Friulian ebooks, Piedmontese ebook, Romansch ebook, Zulu ebooks.

@cramoisi
Copy link
Contributor

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

Afrikaans and Belarusian

Are these different from Dutch and Russian/Ukrainian ? :) or can't they use NL and RU hyphen dicts? (thanks for the cultural insight to come :)
The question is rather if these have a decent users+books base (many books in afrikaaner?)
Dunno about Zulu. Armenian seems fine.
For Friulian, Piemontese, Romansch, spoken in a few regions in/near switzerland, is there really a U+B base (local newspaper downloadable as EPUB :) ? Or a fallback hyph dict from another neighbour language that could work (it might not need to be perfect)? I would wait for someone to come up and ask for one of these. Unless you're this person?

Anyway, I can be fine with the 5 from this PR (very small file sizes, sure they are enough/worth?) and the ones you'd like added, even with the Indians ones (large books and users base I guess :) even if we never had much feedback from India... but you'd have to test how crengine hyphenation works with indian letters, and harfbuzz at correctly drawing now-terminating glyphs at hyphenation cut (if that's a thing) - it's possible there's something limited to latin/cyrillic in crengine in what it considers words... would have to be tested).

My question was whether you might add 30 or 40 more of them :) and then, it might be a problem.

Well, translator.lua lists 105 languages, so we have our languages count :) Dunno about the additional file size they would take, but we already have a menu with 10 pages there - so it could still be ok in the Typography laguage menu.

I was curious whether these could somehow be handled in a more discreet manner - having a base pack so to speak, and just offering smaller languages as a bonus download.
Example: Friulian pattern isn't available by default in koreader, but if a document tagged for it is opened - an update/download is offered. Another avenue for the same would be switching to a hyphenation that isn't prepackaged.

Well, currently, frontend is not aware of what languages cre is meeting (we could meet languages in lang= tags, not only in the book metadata known to frontend). Adding new/unknown language callbacks + ability to auto-download stuff when needed - is quite a lot of work, again for these little languages/users base (days of work, for something that might be used by one person once a year :)

We could have a zip of additional hyph dicts to download and unzip in koreader dir, but I don't know how crengine deal with absent hyph filenames (I'm not looking at the code until it's needed :)

If I'd have to point out files that I never understood why we had - the Russian+English patterns seem pointless to me, though I'm sure there's some use for them, but I believe they were rather large, which seemed extravagant to me when I noticed them.

I guess it's from the crengine Russian and FB2 heritage: FB2 probably doesn't have any lang= tag to tells a section is in English, and I guess FB2 books might include bits in English more than in any other languages. Having only Russian hyphenation would cause all these English sections to not be hyphenated, so possibly ugly to some readers. So, their need for Russian+English hyphenation.
We still have a FB2 user base, pinging @hius07: what typography language/hyph dict are you using? Do you distinguish between Russian and Ukrainian? Do FB2 books come with different and correct RU/UK book metadata language codes?

@roshavagarga
Copy link
Contributor Author

@cramoisi You're welcome, sorry I couldn't find any free options for the Amazon ones, but they might be available somewhere else with some more in-depth searching from your side.

@poire-z Afrikaans can be considered a daughter language of Dutch that started off from the Holland dialect a few hundred years ago, so barring a miracle it's probably diverged too much for hyphenation patterns to be that close to one another. @Frenzie could take a look and give his opinion as a Dutch speaker, I'm not fluent in either sadly. I'd say it's worth adding, since it covers a large amount of people and countries at the same time, but whether they have a ton of epubs is something that would have to be looked into.

As far as Belarusian, it falls within the same language group, that of East Slavic languages, but the language, while sharing a lot of characteristics with Ukrainian and Russian which fall into the same group, has drifted well enough to make hyphenation patterns unique I'd say. I'm looking at this from the outside, mind you, since I speak Bulgarian which isn't in that Slavic language group. I do know that Belarusian and Ukrainian were both influenced by Polish, unlike Russian, so those two might be closer to one another. Personally I'm guesstimating that it's something akin to how I feel when I read or hear Macedonian - a language that is by far the closest to Bulgarian you can get, and I still don't understand a lot of their grammatical constructs, hyphenation rules and so on, even though I can get the gist of what I read or hear. Here is an example of an eBook seller for their country.

For Friulian, Piedmontese and Romansh - I did check the hyphenation patterns and they do share some commonalities, but of the three I'd say Romansh is a must-have, since it covers territories in both Italy and Switzerland. I'm not sure whether they can be covered by the same hyphenation pattern, though at a glance they share enough. If we had a system where we could tell koreader to use file X for as the base hyphenation rules and then other smaller files for small differences, that would be a way to save up space. I'm personally not asking for these - I just saw that there were readily available tex files that were small enough to not be a gigantic burden.

If it would take days to do that, it's not worth it in my opinion either. If cre can handle an extra download being unpacked, that seems like a far easier way.

Ah, that makes sense, I rarely if ever read something that handles 2 languages in the same document, hence my lack of need for such a combination, I thought there was some other reason for those files.

And yeah, I don't plan on adding that much more files - a few others maybe, but there aren't that many sources to get tex or dic files that are clean enough to use the script I have handy on, so that would definitely limit me even if I wanted to, hahah. Ideally I'd like to have hyphenation patterns for most of Europe and a few major languages outside of it, but I think Afrikaans and Belarusian would just about cover what I'd planned on adding for now. All in all I agree that size is a constraint and that if there's a way to lessen the load on koreader's size by offering some of these languages as a bonus pack, I'd be all for that.

@hius07
Copy link
Member

hius07 commented Sep 22, 2020

So, their need for Russian+English hyphenation.

Thanks to CoolReader creators for that, it's useful.

Do you distinguish between Russian and Ukrainian?

Extremely yes, as well as Belarusian.

has drifted well enough to make hyphenation patterns unique

Exactly.

Do FB2 books come with different and correct RU/UK book metadata language codes?

Mostly yes.

@Frenzie
Copy link
Member

Frenzie commented Sep 22, 2020

The basic rules behind Afrikaans hyphenation are likely similar (as they are for German) but the spelling is different. After all, most of these patterns aren't about what a human would do but about how to get a machine to approximate what we'd do intuitively. ;-)

@Frenzie could take a look and give his opinion as a Dutch speaker

Take a look at what, the patterns? At least link them then, although I probably won't give them more than a cursory glance regardless. :-P If you mean Afrikaans itself, I can understand that well enough (slowly and clearly) spoken and of course in writing.

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

May be an idea for the future, to remove the 2 Russian+English*:
Add an option (frontend + crengine), toggable only on languages that don't use the latin alphabet (so, the ones in Cyrillic or Indic if we add some), to have English_US.pattern merged with it (no conflict because not the same alphabets?).

Also, dunno why size differences between GB and US, and if there's really a need to differentiate them:

  217146 Aug 10  2019 English_GB.pattern
  126141 Aug 10  2019 English_US.pattern
  354276 Aug 10  2019 Russian_EnGB.pattern
  262985 Aug 10  2019 Russian_EnUS.pattern

@hius07 : which one of Russian_EnGB or Russian_EnUS do you use?

Any reason for Hungarian to be 10x the size of most others ? Bug/crap, or really needed by this complex language?

1722495 Aug 10  2019 Hungarian.pattern
 206357 Sep 17 15:41 Bulgarian.pattern

@hius07
Copy link
Member

hius07 commented Sep 22, 2020

Russian_EnUS

Personally this one.

@roshavagarga
Copy link
Contributor Author

@poire-z Keep in mind that some hyphenation patterns are probably more optimized than others. As far as Hungarian, the only thing that comes to mind is that it has an absurd amount of grammatical cases - I think Basque and Finnish are the same on that part. Another part of it comes down to language complexity - does it use genders, does it have grammatical cases, does it depend on conjugation to express tenses, etc. There's a ton of other things I'm definitely missing, since I'm not that into linguistics, but you get the drift of it.

Bulgarian, for instance, relies on verb conjugations to express tenses. On top of that it has a few remnants of grammatical cases here and there, loanwords from different languages which use different rules, etc. All of this theoretically could lead to more complex hyphenation rules and a bigger pattern file, as long as a lot of work was put into it. I'm pretty sure a lot of our patterns could be a lot bigger had there been more people volunteering to make tex patterns, which just isn't the case for languages linked to smaller native speaker populations.

@hius07
Copy link
Member

hius07 commented Sep 23, 2020

I've been using Algorithmic hyphenation also, it is good for russian texts (don't know about other languages).

@roshavagarga
Copy link
Contributor Author

Waiting on a quick review whenever you have the time @cramoisi - especially on Friulian, Piedmontese and Romansh - check the source files and note that all 3 files start with 10-20 lines that I removed. I'm not sure if I should just add those back with a dot instead of the '.

@roshavagarga
Copy link
Contributor Author

@poire-z If you can figure out how we're supposed to hook up Brazilian Portuguese, I've got a file pattern for it fairly ready :)

@poire-z
Copy link
Contributor

poire-z commented Sep 24, 2020

I guess this should be enough (uppsecase is fine here, order is important: longer first so they match before shorter):

--- a/crengine/src/textlang.cpp
+++ b/crengine/src/textlang.cpp
@@ -62,2 +62,3 @@ static struct {
     { "pl",    "Polish",        "Polish.pattern",        2, 2 },
+    { "pt-BR", "Portuguese_BR", "Portuguese_BR.pattern", 2, 3 },
     { "pt",    "Portuguese",    "Portuguese.pattern",    2, 3 },

We have this in the quotes specs (should be lowercased):

    { "pt-pt",    L"\x00ab", L"\x00bb", L"\x201c", L"\x201d" },
    { "pt",       L"\x201c", L"\x201d", L"\x2018", L"\x2019" },

dunno why pt-pt, and what "pt" standalone means - but it should be the catch-all pt*

In frontend, we have:

frontend/ui/language.lua:        pt_PT = "Portugues",
frontend/ui/language.lua:        pt_BR = "Portugues do Brasil",
readertypography.lua    { "pt",  {"por"}, "HB  ",  _("Portuguese"), "Portuguese.pattern" },

https://en.wikipedia.org/wiki/Language_localisation#Language_tags_and_codes

tag where
pt-PT European Portuguese (as written and spoken in Portugal)
pt-BR Brazilian Portuguese
pt-AO Angolan Portuguese
pt-MZ Mozambican Portuguese

I let you wikipeding more if needed :)

@roshavagarga
Copy link
Contributor Author

@poire-z Should be good now. Quotation marks were off, as can be read here.

Basically:
pt = Portuguese in general
pt-pt = European Portuguese
pt-br = Brazilian Portuguese

Since only Brazil follows a different quotation mark system, I changed pt to match pt-pt, so that the other 9 or so countries that have Portuguese as an official language are okay, since they follow the rules/habits of European Portuguese, but don't have a hyphenation pattern themselves.

@poire-z
Copy link
Contributor

poire-z commented Sep 24, 2020

don't have a hyphenation pattern themselves

So, they will match "pt" and should use the european pt hyph dict.

@poire-z
Copy link
Contributor

poire-z commented Sep 24, 2020

Still a CI warning:

cr3gui/data/hyph/Portuguese_BR.pattern:8: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xE1 0x3C 0x2F 0x70
  <pattern>a3�</pattern>
             ^

@roshavagarga
Copy link
Contributor Author

@poire-z Uh, that looks normal to me in Notepad++? The original .dic file I ran through my script had "ISO8859-1" noted down as the preferred encoding, but none of that should matter?

As far as the dialects:
They have no iso codes and are basically dialects of Portuguese, so they should default to Portuguese. Although there might be significant differences in some cases, if nobody makes patterns, nobody makes 'em :)

Not sure if we should do something to make sure anything that isn't pt-br is handled by Portuguese.pattern
Some of the dialects have IETF language codes:
Angolan Portuguese - pt-AO
Cape Verdean Portuguese - pt-CV
East Timorese Portuguese - pt-TL
Equatorial Guinean Portuguese - none
Goan Portuguese - none
Guinean Portuguese - pt-GW
Macanese Portuguese - none
Mozambican Portuguese - pt-MZ
São Toméan Portuguese - pt-ST
Uruguayan Portuguese - none

@poire-z
Copy link
Contributor

poire-z commented Sep 24, 2020

Not sure if we should do something to make sure anything that isn't pt-br is handled by Portuguese.pattern

That's what I said above: they will match startswith("pt") and will be handled just like "pt". So, nothing to do (that's what happens for fr-CH).

The original .dic file I ran through my script had "ISO8859-1" noted down as the preferred encoding, but none of that should matter?

You still uploaded it with 1-byte chars (so, latin1) - even if the XML header says it's UTF-8, it contains invalid utf8: you should convert it from iso8859 to utf8 in notepad++.
image

@roshavagarga
Copy link
Contributor Author

Welp, that should fix it. As long as nothing pops up for the other files and the ' lines, this one's ready.

@poire-z
Copy link
Contributor

poire-z commented Sep 24, 2020

Not adding Belarus ?
(You have time with this PR, unless you want it bumped and merged quickly.)

@roshavagarga
Copy link
Contributor Author

@poire-z I might handle that this weekend or in a week or so. Afrikaans has a lot of lines with - and ', Belarusian seems a bit easier but I'm not sure whether I'll have the time in the coming days. As long as everything else checks out as far as the lines I removed, this can be merged though I'm not in that much of a hurry.

@cramoisi
Copy link
Contributor

cramoisi commented Sep 24, 2020

I'll look into it this weekend - no time before that :/

But I can say that the rules in Zulu are made for a right hyphenmin at 1.

@roshavagarga
Copy link
Contributor Author

Need a tiny bit of advice for Belarusian - @cramoisi @poire-z
Source is here.

Basically I removed the comments and removed all of the lines with a -, since they seemed to be equivalent to dot lines around them. There are 41 lines with a ' left that I'm not sure what do to with - do they cover something not covered by the other lines and need to be fixed for cre or should I just get rid of them and move on? (Some of the comments in source give insight into these lines). I also got rid of a line in the beginning which was basically 8-1, line 79 in the source with a small explanation above it :)

hyph-be-test.zip

@poire-z
Copy link
Contributor

poire-z commented Sep 25, 2020

I also got rid of a line in the beginning which was basically 8-1, line 79 in the source with a small explanation above it :)

It said we can hyphenate after a hyphen. Which is the default behaviour of crengine: a break is normally allowed after a real hyphen, and both the part below and the part after are considered independant words and given (without any info about the other part) to the hyphenation algo if hyphenation is needed on one of these parts.
Can't answer much to the question about ' (I guess a ' behave as a hyphen and make a word 2 independant words).

@cramoisi
Copy link
Contributor

cramoisi commented Sep 26, 2020

@poire-z is right : ' are considered end of word so words with them are two distinct words for crengi@ne and they are hyphenated according to that. It's the same for - or '' and words with - can be split at the - by crengine as it should be (crengine won't add another hyphen after the first visible one).
You should just remove the line with - and ' nor '' and not worry to much about that as you won't be able to deal with them anyway.
There are some empty <pattern></pattern> line at the end of each of your files. They must be deleted. Perhaps your conversion script should be checked.

Anyway, I 've the feeling I've already made this comment 2 or 3 times ;-) nothing new about what I've just wrote.

(I've looked into your Belarusian hyph-be-test.zip and it seems OK to me :) )

No need to worry about lines like these one as you have just to set left/right hyphenmin at 2 for deal with them. If the template is not design for 2,2, then you are out of luck ;)

.в'8 -в'8
.г'8 -г'8
.ґ'8 -ґ'8
.д'8 -д'8
.ж'8 -ж'8

@poire-z
Copy link
Contributor

poire-z commented Oct 1, 2020

Took the liberty to rebase and make them nice commits.
You can still add/fix stuff, it will just be easier for me to re-rebase them when you feel this PR is ready.

If it is ready as is, just tell me and I'll merge and bump it in a coming up bump.

@roshavagarga
Copy link
Contributor Author

Count this one to my bad memory in general, hah :D Feel free to merge, yeah, if anything pops up it can be fixed on its own or with the Afrikaans + Belarusian PR.

@poire-z
Copy link
Contributor

poire-z commented Jan 1, 2021

Regarding my comment #383 (comment) :

May be an idea for the future, to remove the 2 Russian+English*:
Add an option (frontend + crengine), toggable only on languages that don't use the latin alphabet (so, the ones in Cyrillic or Indic if we add some), to have English_US.pattern merged with it (no conflict because not the same alphabets?).

May be we wouldn't need any option: we could mark some languages/hyphenation dicts as being "english/latin-orthogonal" when their hyph dict contains only non-latin words (no a-z).
It feels that we could concatenate the words from English_US to them without any issue (except may be more memory usage & cpu time to look for matches). So, any latin words embedded (without any lang= tag) in text in these languages would be hyphenated as English - as English is probably the most obvious languages for foreign words in such texts :) (For languages with no hyphenation like Indic, Chinese, Japanese, we use English_US as the default anyway.)

Quickly looking at the *.pattern, these languages could be candidates for that:

Armenian
Bulgarian
Georgian
Greek
Macedonian
Russian
Serbian
Ukrainian

which one of Russian(only), Russian_EnGB or Russian_EnUS do you use?

@hius07 answered "Russian_EnUS". @virxkane @pkb : which one do you use?

Dunno how much all these Russian* dicts are up to date - but if Russian.pattern and English_US.pattern are fine, we could get rid of the others and avoid having non-standard ru-GB and ru-US lang tags that would never be found in books.
So, I guess that when reading Russian books, you have to disable "respects EPUB and HTML lang tags" (dunno how this is called in CoolReader) so you can force Russian_EnUS to be used instead of Russian.pattern ?

@roshavagarga : would that work/be welcomed with Bulgarian?
@ichnilatis-gr @noembryo : would that work/be welcomed with Greek?
@strn : would that work/be welcomed with Serbian?

Or can some users find it preferable to have latin words non-hyphenated when they happen in text in these languages - or better non-hyphenated than wrongly hyphenated as english when we don't know their language?

Thoughts ?

@roshavagarga
Copy link
Contributor Author

@poire-z Typically that should work for some of these. As far as Bulgarian, the only use of the latin alphabet I can think of is medical (Latin), mathematics (maybe?) and brands (some people use the latin alphabet original, others transliterate, no norm I think?). Another option, which I highly doubt but is possible, is if somebody decided to use shlyokavitsa, which is the informal way people transliterate Bulgarian when messaging each other online - typically because they're lazy, are used to doing so, can't install or use the original national keyboard standard (or know that there's a qwerty-friendly equivalent) or for numerous other reasons - I can see an author maybe using that in a young adult novel, though I'll admit I've never seen it done, though that maybe due to me not being into that genre. Either way, the English pattern should be good, since I doubt that if somebody does use shlyokavitsa it'll be a common thing. Maybe use the GB one, since that's the type taught in schools here, rather than the US norm? Personally, I'd be more annoyed at things not being hyphenated, but that might just be my preference.

You might want to have a look at Macedonian, since their alphabet has J and S, but I'm guessing those are marked differently in UTF-8? Ukrainian also has i, so might be worth looking into that too.

Serbian is a special case, and while we only have the one hyphenation pattern and a long talk when it was being added, I will just remind you that Serbian has a Cyrillic and a Latin alphabet, both of which are legally equivalent and it's the only country in Europe that does that, so technically we should have separate hyphenation patterns for both of those, though I'll admit I have no idea how ebooks work in Serbia and whether only one or the other alphabet is used, or if books get published in both and how common each one is? I'd love to hear more about it from @strn.

@noembryo
Copy link

noembryo commented Jan 1, 2021

Or can some users find it preferable to have latin words non-hyphenated when they happen in text in these languages - or better non-hyphenated than wrongly hyphenated as english when we don't know their language?

For my personal use, it doesn't really matter, but for people that read translations of old greek text it might be handy to have the right hyphenation for both languages.
And if this feature could be turned off in case it's distracting, then everybody would be even happier.

My main problem with hyphenation is that is not so good (for both greek and english), and I was wondering if I could find a way to use a file from somewhere else (e.g. an office suite).
But that is something for another issue ;o)

@strn
Copy link
Contributor

strn commented Jan 1, 2021

@roshavagarga asked, so @strn responded:

that Serbian has a Cyrillic and a Latin alphabet, both of which are legally equivalent

The statement is not correct. Serbian Cyrillic alphabet has absolute pre-eminence and pre-precedence over any other writing system in Serbian language. It is ensured by Serbian constitution, article 10. Hence "Serbian language having Latin alphabet" cannot be true or "legal equivalent" in any way. What is thought to be "Serbian Latin" alphabet, is in fact Croatian Latin alphabet. It was used for writing Serbian language during time of Yugoslav state (1918-1992) and is a political construct since at that time Serbo-Croatian language existed (ISO code: sh). Since Yugoslavia dissolved, a need for Croatian Latin alphabet in Serbian language ceased to exist. Serbian language written by Croatian Latin alphabet is equivalent to Bulgarian language written in shlyokavitsa .

A similar mess exists (existed?) for Moldavian language - it can be written in Moldavian Cyrillic or Romanian Latin alphabet. This is just for those wanting to draw comparisons.

so technically we should have separate hyphenation patterns for both of those

Croatian Latin hyphenation patterns will work perfectly for those ebooks written in so-called "Serbian Latin" alphabet.

though I'll admit I have no idea how ebooks work in Serbia and whether only one or the other alphabet is used

Ebooks work in Serbia as follows: if ebook is in Serbian language, it can be written (printed?) in either Serbian Cyrillic or Croatian Latin alphabets. Mixing alphabets is strictly forbidden by grammar rules. Only exception of Latin script appearing among Serbian Cyrillic text would be writing foreign (non-Serbian) personal names for the first time when they appear in text, writing chemical formulae, measurement units etc.

or if books get published in both

Official ebook publishing in Serbia is almost non-existent. What exists of ebooks in Serbian language are 99% pirated editions - scans of printed books used for creation of EPUBs. Paper editions are either in Serbian Cyrillic or Croatian Latin alphabet.

@roshavagarga
Copy link
Contributor Author

@strn Did some checking, turns out I was mistaken on the legal aspect and it's just a linguistic quirk (digraphia if you're curious). Wasn't meant to offend you in any way, sorry if that was the case. Bulgarian does have an official transliteration scheme or whatever you'd like to call it, so shlyokavitsa is more along the lines of internet-speak or jargon, but I get the point you're making.

99% pirated editions - scans of printed books used for creation of EPUBs

Could you give examples of some (not copyrighted, of course)? My thinking is what codes are used in ePubs that use Gaj's Latin Alphabet and what are used for those in Serbian Cyrillic - especially if somebody might have made a pirated version of, let's say, a book from the Yugoslav era maybe? I'd use classic books or popular modern novels as the baseline myself :)

There's a tiny segment of paid epubs here, and a centralized free library (chitanka), which exists because it's legal to recreate written books in any form as long as there's no profit and they're already available in a library for instance.

@noembryo Feel free to look at the last few comments in #373 and chime in, I'll try and offer a replacement pattern file this week, been a bit busy.

@strn
Copy link
Contributor

strn commented Jan 1, 2021

Wasn't meant to offend you in any way, sorry if that was the case.

No offence taken, do not worry. I just know that situation around Serbian language is complex, even more because it was in unnatural union with Croatian language. It is not easy for foreigners to understand ;-)

Could you give examples of some (not copyrighted, of course)? My thinking is what codes are used in ePubs that use Gaj's Latin Alphabet and what are used for those in Serbian Cyrillic - especially if somebody might have made a pirated version of, let's say, a book from the Yugoslav era maybe? I'd use classic books or popular modern novels as the baseline myself :)

Since EPUBs in Serbian language are mostly produced by amateurs (myself included), EPUB tags <dc:language/> and <html ... lang... xml:lang...> are wrong in 99% of cases. I have encountered these values, regardless of script (Serbian Cyrillic or Croatian Latin): none, sr, en, hr . This applies to books from either Yugoslav or contemporary times.

Hence, hyphenation rules for Serbian language I contributed to koreader will work only if an ebook is in Serbian Cyrillic and has correct language code sr. That cannot be ensured or enforced across Serbian EPUB space.

I have yet to find ebook written in so-called "Serbian Latin" alphabet that uses correct language code sr@Latn. My humble self corrects language tags in books I keep for myself.

@virxkane
Copy link
Contributor

virxkane commented Jan 2, 2021

So, I guess that when reading Russian books, you have to disable "respects EPUB and HTML lang tags" (dunno how this is called in CoolReader) so you can force Russian_EnUS to be used instead of Russian.pattern ?

Sometimes yes.

"respects EPUB and HTML lang tags" (dunno how this is called in CoolReader)

We call it "Support for multilingual documents".

May be we wouldn't need any option: we could mark some languages/hyphenation dicts as being "english/latin-orthogonal" when their hyph dict contains only non-latin words (no a-z).

It seems to me that the option is needed. Firstly, there are not always English words in the text, secondly, they have already written, not everyone needs the hyphenation of English words in the Cyrillic text, and thirdly, the option should probably not be a boolean, we should choose "English US" or "English GB".

@hius07 answered "Russian_EnUS". @virxkane @pkb : which one do you use?

There is no definite answer here. In one non-English-language book, there may be inserts in "English US", in another - in "English GB". Probably, it would be correct to choose a specific option in this new hypothetical option.

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2021

It seems to me that the option is needed.

Well, I just don't want to have to add any more UI hyphenation option stuff :)
I'd just like to have by default what would make the most sense.
The current situation is we can have RU | RU+EnUS | Ru+EnGB - which I guess is fine for FB2 books which are/were CoolReader original target and are mostly Russian text.

But we can't have RU+FR, BG+EnUS. I'm not saying we should have them :) I'm just wondering if having only RU+EnUS | BG+EnUS wouldn't be a better alternative.

Firstly, there are not always English words in the text

In that case (except for possible performance/memory usage), having english patterns in the hyphenation data used would not hurt and just not have any effect.

secondly, they have already written, not everyone needs the hyphenation of English words in the Cyrillic text

That's the "not everytone" I'd like to estimate :) I guess one reason to not want them is if these English words are mostly person names, which usually should not be hyphenated.
But on the other hand, if one choses Russian hyphenation, it's also to have less whitespace in justified lines - and he would be best served also having any english/latin words also hyphenated, even if a little bad (ie. french words in russian hyphenated as english).
Dunno.

and thirdly, the option should probably not be a boolean, we should choose "English US" or "English GB".

Dunno much about the difference between EnglishUS and EnglishGB hyphenation - but I think these are the same language :) The difference in hyphenation might be only stylistic ones that only native US/GB snobs might care about :) and hoping the set of such snobs and the set of Russian FB2 book readers do not intersect :)

@cramoisi
Copy link
Contributor

cramoisi commented Jan 2, 2021 via email

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2021

What do you mean with "uk books" ? Books published in the UK, or author is english?
Or do you mean you just can't read any english (including books published in the US or author is american) because English_US sucks and English_GB is better - or more suited to your reading of generic english?

(And what when you don't know the book origin ? Can you guess its origin depending on how you can or can't read it with English_US ? :)

@cramoisi
Copy link
Contributor

cramoisi commented Jan 2, 2021 via email

@Frenzie
Copy link
Member

Frenzie commented Jan 2, 2021

Incidentally, can you remember any words that mess up? I admit I haven't checked but I'd expect spelling differences like leveller vs leveler to more or less automatically result in different hyphenation.

@virxkane
Copy link
Contributor

virxkane commented Jan 2, 2021

Well, I just don't want to have to add any more UI hyphenation option stuff :)

OK, why did you ask then?

But we can't have RU+FR

Why not? Russian + French: https://en.wikipedia.org/wiki/War_and_Peace

In that case (except for possible performance/memory usage), having english patterns in the hyphenation data used would not hurt and just not have any effect.

OK.

That's the "not everytone" I'd like to estimate :) I guess one reason to not want them is if these English words are mostly person names, which usually should not be hyphenated.
But on the other hand, if one choses Russian hyphenation, it's also to have less whitespace in justified lines - and he would be best served also having any english/latin words also hyphenated, even if a little bad (ie. french words in russian hyphenated as english).
Dunno.

They are different people, one needs one thing, the other needs another.

Dunno much about the difference between EnglishUS and EnglishGB hyphenation - but I think these are the same language :)

OK, let's combine the English US and English UК hyphenation dictionaries. Of course a joke.

@cramoisi
Copy link
Contributor

cramoisi commented Jan 2, 2021 via email

@Frenzie
Copy link
Member

Frenzie commented Jan 2, 2021

@cramoisi Remember, I have a degree in Dutch & English. I'm well aware. ;-) But hyphenation isn't really different — my point is that it follows more or less automatically from the spelling.

In the example above:

lev·​el·​er
lev·​el·l​er

There's no difference there in how to correctly hyphenate a word. If it were spelled leveller in American English, it would be hyphenated as lev·​el·​ler too.

@cramoisi
Copy link
Contributor

cramoisi commented Jan 2, 2021 via email

@noembryo
Copy link

noembryo commented Jan 2, 2021

Well, I just don't want to have to add any more UI hyphenation option stuff :)

It could be done by letting the user select a second (or even third) checkbox with a long press (popup maybe?)

Edit: OK, I know there is the addition of the popup, but.. :o)

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2021

But we can't have RU+FR

Why not?

We can't currently.

Well, I just don't want to have to add any more UI hyphenation option stuff :)

OK, why did you ask then?

OK, I know there is the addition of the popup, but.. :o)

This UI stuff would be the most fun stuff - but there's the whole interface/passing these settings from frontend to crengine - and the internal handling of all that by crengine itself - which I really don't want to get into :)
What I asked for opinions about could just work with Russian, Greek and the few other languages. I don't want to add more complexity to this already complex handling of hyphenation/typography languages - just wondering if we could get better default behaviour for these languages with a simple merging of non-interfering/overlaping hyph dicts - instead of having "hand made" dedicated Russian_EnGB.pattern and Russian_EnUS.pattern that may never be updated when Russian.pattern or English_US/pattern is. (And that don't fit well in the lang tag paradigm, even if it may be a good solution in the Russian FB2 publishing ecosystem.)

@noembryo
Copy link

noembryo commented Jan 2, 2021

This UI stuff would be the most fun stuff -

That's why I'm so eager to think suggestions about these..

but there's the whole interface/passing these settings from frontend to crengine - and the internal handling of all that by crengine itself - which I really don't want to get into :)

... (he whistles looking at the ceiling) .. ;o)

@virxkane
Copy link
Contributor

virxkane commented Jan 2, 2021

@poire-z It's a bad idea to add "English US" (or GB) hyphenation dictionary by default for Russian books - what happens to the hyphenation if the Russian book contains fragments in French? See "War and Peace". We cannot decide for the user which second hyphenation language to use!

Therefore, I am in favor of the hypothetical option "additional hyphenation dictionary".

We can't currently.

But we can discuss hypothetical variants? :)

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2021

Sure, we can discuss that, as long as I don't have to implement it :)
But it just feels nightmarish, from the crengine low level implemtation up to the frontend setting implementation (and there, as you would do the CoolReader one, we might both agree to stop discussing it :)

Also, it's not only Russian+FR that would be needed to be supported. It's the multiple combinations of "orthogonal alphabets" - and preventing combinations of same alphabets hyph dicts.
Russian + FR would be OK
Ukrainian + FR would be OK.
Russian + Ukrainian is not.
Russian + Bulgarian is probably not
Russian + Greek + FR could work.
Russian + Greek + Armenian + Georgian + FR might also work :)

And have that working (and clear to the user in the settings) with lang tags...
A book with main lang=fr - and some inner section with lang=ru could use ru+enUS it the user has set en_US as the secondary hyph lang for ru - but not for the main FR lang. Spaghettings soup :)

I think the only proper solution to have all that done correctly is for publishers to properly set lang= attributes in the HTML - and KOReader/CoolReader would support them perfectly.
I don't know if FB2 or FB3 have support for such lang= tags.
I'd be curious if you can find some russian War and Peace ebooks, in FB2/FB3 and EPUBs, to see if they have proper lang=fr for the french sections. I hope recent EPUBs have them.

@virxkane
Copy link
Contributor

virxkane commented Jan 2, 2021

@poire-z Ok, let's leave it as it is :)

I think the only proper solution to have all that done correctly is for publishers to properly set lang= attributes in the HTML - and KOReader/CoolReader would support them perfectly.

I agree.

I don't know if FB2 or FB3 have support for such lang= tags.

As far as I know, it doesn't support it, that's the problem.

I'd be curious if you can find some russian War and Peace ebooks, in FB2/FB3 and EPUBs, to see if they have proper lang=fr for the french sections. I hope recent EPUBs have them.

Most likely it is. So what? Can I think a little about the future? :)

@poire-z poire-z mentioned this pull request Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants