Punctuation difference between languages (Quotation marks) #133

Godnoken · 2023-06-09T14:39:34Z

Hi there,

I've just come to realize that there are a lot of different types of quotation marks depending on where you're from.

Do the language models being used by translateLocally at all take the quotation marks into account, or do they only follow rules of uppercase letter, commas and dots?

For example, let's say we have a French sentence going something like this;
Vous avez dit «Je vais manger une crêpe», mais au lieu de cela, vous avez mangé un hamburger.

Would it be fair to assume that replacing the French quotation marks with simple " or ' apostrophes could affect the translation outcome?

I'm not sure where to look for this type of information and I'm hoping someone could bring some clarity with how it works.

Thank you very much.

jelmervdl · 2023-06-23T14:29:49Z

Would it be fair to assume that replacing the French quotation marks with simple " or ' apostrophes could affect the translation outcome?

Yes. Generally, we train the models with data where we don't change the quotation marks. So if the French translation of a sentence uses «» but the English uses "", the model will learn to translate between the two. However the data isn't perfect, and there will be plenty of examples where both languages use "". So depending a bit on which parts of the datasets use authentic quotes, and which ones normalised quotes, you might get different translations depending on the quotes you use.

I noticed that our models are pretty bad at translating sentences that are entirely upper case, for example, because we don't have that many sentences like that in our training data¹. And there's no rule or knowledge in the model that understands that a sentence entirely in upper case is often the same thing as a normal sentence, but uppercase. I'm also thinking that the examples of uppercase that it will see will mostly be website navigation and headlines. So again, translations will be more like those.

You can play around with normalizing the translation input in a certain way (e.g. lower-casing text, replacing all quotation marks, etc) to see whether it helps with the quality. But there's no list of things that is guaranteed to work, it all depends on how the model was trained. And that's often not documented at all 😞

We're working on improving this in future models by just applying toUpper() to random sentence pairs in our training data, so the model will see more of these examples. ↩

Godnoken · 2023-06-27T09:48:26Z

Yes. Generally, we train the models with data where we don't change the quotation marks. So if the French translation of a sentence uses «» but the English uses "", the model will learn to translate between the two. However the data isn't perfect, and there will be plenty of examples where both languages use "". So depending a bit on which parts of the datasets use authentic quotes, and which ones normalised quotes, you might get different translations depending on the quotes you use.

Very interesting! Just a thought, and maybe it isn't really an issue in practice - but what if you'd "clean" a language model's data by converting quotation marks to their respective country preferences and then using regex in real-time to make sure that the text you want to translate from is always using the same quotation mark? Or a more universal solution of converting all quotation marks before training to "" for example, and then use regex like above.

I noticed that our models are pretty bad at translating sentences that are entirely upper case, for example, because we don't have that many sentences like that in our training data1. And there's no rule or knowledge in the model that understands that a sentence entirely in upper case is often the same thing as a normal sentence, but uppercase. I'm also thinking that the examples of uppercase that it will see will mostly be website navigation and headlines. So again, translations will be more like those.

You can play around with normalizing the translation input in a certain way (e.g. lower-casing text, replacing all quotation marks, etc) to see whether it helps with the quality. But there's no list of things that is guaranteed to work, it all depends on how the model was trained. And that's often not documented at all 😞

I have noticed the uppercase issue too and I did opt for lowercasing everything and then using regex to uppercase the first letter where there should be one, but of course that ended up breaking some translations too because of names not being interpreted as names and such.

What I've gone with now is to lowercase all words that contain more than 2 letters that are uppercase. Unfortunately, that solution isn't perfect either. For example, a translation from English to Spanish of "Empires Ascendant" wouldn't translate Ascendant because both words are title cased. Don't think there's much that can be done in that case though..

Footnotes

We're working on improving this in future models by just applying toUpper() to random sentence pairs in our training data, so the model will see more of these examples. ↩

Awesome! Would that mean that the data is 'doubled'/trained twice? Just wondering if only randomly applying uppercase would affect the translation quality at all.

jelmervdl · 2023-06-27T10:09:45Z

Just a thought, and maybe it isn't really an issue in practice - but what if you'd "clean" a language model's data by converting quotation marks to their respective country preferences and then using regex in real-time to make sure that the text you want to translate from is always using the same quotation mark? Or a more universal solution of converting all quotation marks before training to "" for example, and then use regex like above.

We try to not do to much language-specific editing of the translation output (often called post-editing) because then we need to implement that in the translator for each language we try to support. Ideally the software is language agnostic, and only the model cares about language specifics.

We're also trying to not make too many assumptions about how people will use translations. If we clean our data in a specific way, but someone will try to translate text with the thing we just removed, the output will be worse. So our approach is to just hope we have enough training data to train the model to do the right thing.

What I've gone with now is to lowercase all words that contain more than 2 letters that are uppercase. Unfortunately, that solution isn't perfect either. For example, a translation from English to Spanish of "Empires Ascendant" wouldn't translate Ascendant because both words are title cased. Don't think there's much that can be done in that case though..

Hehe, this is a good example of why we try to just let the model figure it out from the training data. When you come up with rules, there's always many exceptions. And again, the rules will be language (and context! E.g. what about paper titles that are in TitleCase?) specific.

Would that mean that the data is 'doubled'/trained twice? Just wondering if only randomly applying uppercase would affect the translation quality at all.

It would. Because of how the whole sentence -> tokens -> vocabulary thing works, you have different tokenisations and vocabulary entries for "for" and "For" and "FOR". The model can learn that they are related to some degree, but there is no explicit knowledge in there that encodes "For" as "for with a capital F".

The SentencePiece tokenisation further messes with the expectations: "for" will likely occur in the vocabulary, so it will translate into 1 token. "FOR" might not, as a single word, occur in the vocabulary and might be split in "F" + "OR" or even "F" + "O" + "R": three tokens.

OpenAI has a tool to visualise their tokenizer, which works in a similar way. We use a small custom vocab per language pair so the actual tokens will be different. But the idea is the same. Try for example "fork" and "FORK" and see how they differ.

Godnoken · 2023-06-27T12:00:55Z

We try to not do to much language-specific editing of the translation output (often called post-editing) because then we need to implement that in the translator for each language we try to support.

Ah.. I didn't think about that hehe.

We're also trying to not make too many assumptions about how people will use translations. If we clean our data in a specific way, but someone will try to translate text with the thing we just removed, the output will be worse. So our approach is to just hope we have enough training data to train the model to do the right thing.

Hehe, this is a good example of why we try to just let the model figure it out from the training data. When you come up with rules, there's always many exceptions. And again, the rules will be language (and context! E.g. what about paper titles that are in TitleCase?) specific.

That does most definitely sound like the right path to go. Hopefully it will be possible to cover most edge cases by just having more data to work with in the future then.

It would. Because of how the whole sentence -> tokens -> vocabulary thing works, you have different tokenisations and vocabulary entries for "for" and "For" and "FOR". The model can learn that they are related to some degree, but there is no explicit knowledge in there that encodes "For" as "for with a capital F".

The SentencePiece tokenisation further messes with the expectations: "for" will likely occur in the vocabulary, so it will translate into 1 token. "FOR" might not, as a single word, occur in the vocabulary and might be split in "F" + "OR" or even "F" + "O" + "R": three tokens.

How/where do you gather the vocabulary that is in use? Is there a good reason to why you wouldn't make sure every word exists as capitalized and non-capitalized in that list? Speed, worse quality etc?

OpenAI has a tool to visualise their tokenizer, which works in a similar way. We use a small custom vocab per language pair so the actual tokens will be different. But the idea is the same. Try for example "fork" and "FORK" and see how they differ.

That is super cool! Thank you for that and the explanation above, that really does help me understand how it all comes together. 😃

XapaJIaMnu · 2023-06-27T12:44:11Z

How/where do you gather the vocabulary that is in use?

Vocabulary is gathered from clean training data that we believe offers an accurate word distribution of the languages we translate.

Is there a good reason to why you wouldn't make sure every word exists as capitalized and non-capitalized in that list? Speed, worse quality etc?
Many reasons

Users hate to see all upper/lowercase stuff
It used to be the case there's a truecaser model that runs as a pre/postprocessing script that tries to guess the correct case for every word, but that approach requires pre/postprocessing of user input which we want to avoid, and is not very accurate (eg if we have a new name that has been inputted by the user, the truecaser model wouldn't know what to do with it, German is a nightmare with its habit of capitalising every noun, and tHeN wHaT aBoUt pEOPle tYpInG lIkE tHiS...)

In general, letting our model handle cases produces results that are much more liked by humans, and are more accurate. We are aware of what the limitations are and will be releasing models that handle those degenerate cases.

Godnoken · 2023-06-29T09:45:38Z

Vocabulary is gathered from clean training data that we believe offers an accurate word distribution of the languages we translate.

Cool!

Users hate to see all upper/lowercase stuff

It used to be the case there's a truecaser model that runs as a pre/postprocessing script that tries to guess the correct case for every word, but that approach requires pre/postprocessing of user input which we want to avoid, and is not very accurate (eg if we have a new name that has been inputted by the user, the truecaser model wouldn't know what to do with it, German is a nightmare with its habit of capitalising every noun, and tHeN wHaT aBoUt pEOPle tYpInG lIkE tHiS...)

In general, letting our model handle cases produces results that are much more liked by humans, and are more accurate. We are aware of what the limitations are and will be releasing models that handle those degenerate cases.

Thank you for the info, all of the work you guys have done and are doing is very fascinating to me.

Another question, relevant to the title casing mentioned further above. Do you have any decent expectations of the models eventually being able to handle "Very Cool Stuff Here" well?
Is it just a matter of having enough data and time on your hands or is there technical difficulties that would prevent the models from being able to handle lowercase, uppercase and title case at the same time?

XapaJIaMnu · 2023-06-29T10:42:21Z

Yes, we will definitely eventually be able to translate Very Cool Stuff Here. This is a solved problem, just we need the time and manpower to rebuild the existing models and we are lacking this.

Godnoken · 2023-06-29T10:46:41Z

That's great to hear! I hope you'll get more support in the future, it's duly deserved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Punctuation difference between languages (Quotation marks) #133

Punctuation difference between languages (Quotation marks) #133

Godnoken commented Jun 9, 2023

jelmervdl commented Jun 23, 2023

Godnoken commented Jun 27, 2023

Footnotes

jelmervdl commented Jun 27, 2023

Godnoken commented Jun 27, 2023 •

edited

Loading

XapaJIaMnu commented Jun 27, 2023

Godnoken commented Jun 29, 2023

XapaJIaMnu commented Jun 29, 2023

Godnoken commented Jun 29, 2023

Punctuation difference between languages (Quotation marks) #133

Punctuation difference between languages (Quotation marks) #133

Comments

Godnoken commented Jun 9, 2023

jelmervdl commented Jun 23, 2023

Footnotes

Godnoken commented Jun 27, 2023

Footnotes

jelmervdl commented Jun 27, 2023

Godnoken commented Jun 27, 2023 • edited Loading

XapaJIaMnu commented Jun 27, 2023

Godnoken commented Jun 29, 2023

XapaJIaMnu commented Jun 29, 2023

Godnoken commented Jun 29, 2023

Godnoken commented Jun 27, 2023 •

edited

Loading