Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctuation difference between languages (Quotation marks) #133

Open
Godnoken opened this issue Jun 9, 2023 · 8 comments
Open

Punctuation difference between languages (Quotation marks) #133

Godnoken opened this issue Jun 9, 2023 · 8 comments

Comments

@Godnoken
Copy link

Godnoken commented Jun 9, 2023

Hi there,

I've just come to realize that there are a lot of different types of quotation marks depending on where you're from.

Do the language models being used by translateLocally at all take the quotation marks into account, or do they only follow rules of uppercase letter, commas and dots?

For example, let's say we have a French sentence going something like this;
Vous avez dit «Je vais manger une crêpe», mais au lieu de cela, vous avez mangé un hamburger.

Would it be fair to assume that replacing the French quotation marks with simple " or ' apostrophes could affect the translation outcome?

I'm not sure where to look for this type of information and I'm hoping someone could bring some clarity with how it works.

Thank you very much.

@jelmervdl
Copy link
Collaborator

Would it be fair to assume that replacing the French quotation marks with simple " or ' apostrophes could affect the translation outcome?

Yes. Generally, we train the models with data where we don't change the quotation marks. So if the French translation of a sentence uses «» but the English uses "", the model will learn to translate between the two. However the data isn't perfect, and there will be plenty of examples where both languages use "". So depending a bit on which parts of the datasets use authentic quotes, and which ones normalised quotes, you might get different translations depending on the quotes you use.

I noticed that our models are pretty bad at translating sentences that are entirely upper case, for example, because we don't have that many sentences like that in our training data1. And there's no rule or knowledge in the model that understands that a sentence entirely in upper case is often the same thing as a normal sentence, but uppercase. I'm also thinking that the examples of uppercase that it will see will mostly be website navigation and headlines. So again, translations will be more like those.

You can play around with normalizing the translation input in a certain way (e.g. lower-casing text, replacing all quotation marks, etc) to see whether it helps with the quality. But there's no list of things that is guaranteed to work, it all depends on how the model was trained. And that's often not documented at all 😞

Footnotes

  1. We're working on improving this in future models by just applying toUpper() to random sentence pairs in our training data, so the model will see more of these examples.

@Godnoken
Copy link
Author

Yes. Generally, we train the models with data where we don't change the quotation marks. So if the French translation of a sentence uses «» but the English uses "", the model will learn to translate between the two. However the data isn't perfect, and there will be plenty of examples where both languages use "". So depending a bit on which parts of the datasets use authentic quotes, and which ones normalised quotes, you might get different translations depending on the quotes you use.

Very interesting! Just a thought, and maybe it isn't really an issue in practice - but what if you'd "clean" a language model's data by converting quotation marks to their respective country preferences and then using regex in real-time to make sure that the text you want to translate from is always using the same quotation mark? Or a more universal solution of converting all quotation marks before training to "" for example, and then use regex like above.

I noticed that our models are pretty bad at translating sentences that are entirely upper case, for example, because we don't have that many sentences like that in our training data1. And there's no rule or knowledge in the model that understands that a sentence entirely in upper case is often the same thing as a normal sentence, but uppercase. I'm also thinking that the examples of uppercase that it will see will mostly be website navigation and headlines. So again, translations will be more like those.

You can play around with normalizing the translation input in a certain way (e.g. lower-casing text, replacing all quotation marks, etc) to see whether it helps with the quality. But there's no list of things that is guaranteed to work, it all depends on how the model was trained. And that's often not documented at all 😞

I have noticed the uppercase issue too and I did opt for lowercasing everything and then using regex to uppercase the first letter where there should be one, but of course that ended up breaking some translations too because of names not being interpreted as names and such.

What I've gone with now is to lowercase all words that contain more than 2 letters that are uppercase. Unfortunately, that solution isn't perfect either. For example, a translation from English to Spanish of "Empires Ascendant" wouldn't translate Ascendant because both words are title cased. Don't think there's much that can be done in that case though..

Footnotes

  1. We're working on improving this in future models by just applying toUpper() to random sentence pairs in our training data, so the model will see more of these examples.

Awesome! Would that mean that the data is 'doubled'/trained twice? Just wondering if only randomly applying uppercase would affect the translation quality at all.

@jelmervdl
Copy link
Collaborator

Just a thought, and maybe it isn't really an issue in practice - but what if you'd "clean" a language model's data by converting quotation marks to their respective country preferences and then using regex in real-time to make sure that the text you want to translate from is always using the same quotation mark? Or a more universal solution of converting all quotation marks before training to "" for example, and then use regex like above.

We try to not do to much language-specific editing of the translation output (often called post-editing) because then we need to implement that in the translator for each language we try to support. Ideally the software is language agnostic, and only the model cares about language specifics.

We're also trying to not make too many assumptions about how people will use translations. If we clean our data in a specific way, but someone will try to translate text with the thing we just removed, the output will be worse. So our approach is to just hope we have enough training data to train the model to do the right thing.

What I've gone with now is to lowercase all words that contain more than 2 letters that are uppercase. Unfortunately, that solution isn't perfect either. For example, a translation from English to Spanish of "Empires Ascendant" wouldn't translate Ascendant because both words are title cased. Don't think there's much that can be done in that case though..

Hehe, this is a good example of why we try to just let the model figure it out from the training data. When you come up with rules, there's always many exceptions. And again, the rules will be language (and context! E.g. what about paper titles that are in TitleCase?) specific.

Would that mean that the data is 'doubled'/trained twice? Just wondering if only randomly applying uppercase would affect the translation quality at all.

It would. Because of how the whole sentence -> tokens -> vocabulary thing works, you have different tokenisations and vocabulary entries for "for" and "For" and "FOR". The model can learn that they are related to some degree, but there is no explicit knowledge in there that encodes "For" as "for with a capital F".

The SentencePiece tokenisation further messes with the expectations: "for" will likely occur in the vocabulary, so it will translate into 1 token. "FOR" might not, as a single word, occur in the vocabulary and might be split in "F" + "OR" or even "F" + "O" + "R": three tokens.

OpenAI has a tool to visualise their tokenizer, which works in a similar way. We use a small custom vocab per language pair so the actual tokens will be different. But the idea is the same. Try for example "fork" and "FORK" and see how they differ.

@Godnoken
Copy link
Author

Godnoken commented Jun 27, 2023

We try to not do to much language-specific editing of the translation output (often called post-editing) because then we need to implement that in the translator for each language we try to support.

Ah.. I didn't think about that hehe.

We're also trying to not make too many assumptions about how people will use translations. If we clean our data in a specific way, but someone will try to translate text with the thing we just removed, the output will be worse. So our approach is to just hope we have enough training data to train the model to do the right thing.

Hehe, this is a good example of why we try to just let the model figure it out from the training data. When you come up with rules, there's always many exceptions. And again, the rules will be language (and context! E.g. what about paper titles that are in TitleCase?) specific.

That does most definitely sound like the right path to go. Hopefully it will be possible to cover most edge cases by just having more data to work with in the future then.

It would. Because of how the whole sentence -> tokens -> vocabulary thing works, you have different tokenisations and vocabulary entries for "for" and "For" and "FOR". The model can learn that they are related to some degree, but there is no explicit knowledge in there that encodes "For" as "for with a capital F".

The SentencePiece tokenisation further messes with the expectations: "for" will likely occur in the vocabulary, so it will translate into 1 token. "FOR" might not, as a single word, occur in the vocabulary and might be split in "F" + "OR" or even "F" + "O" + "R": three tokens.

How/where do you gather the vocabulary that is in use? Is there a good reason to why you wouldn't make sure every word exists as capitalized and non-capitalized in that list? Speed, worse quality etc?

OpenAI has a tool to visualise their tokenizer, which works in a similar way. We use a small custom vocab per language pair so the actual tokens will be different. But the idea is the same. Try for example "fork" and "FORK" and see how they differ.

That is super cool! Thank you for that and the explanation above, that really does help me understand how it all comes together. 😃

@XapaJIaMnu
Copy link
Owner

How/where do you gather the vocabulary that is in use?

Vocabulary is gathered from clean training data that we believe offers an accurate word distribution of the languages we translate.

Is there a good reason to why you wouldn't make sure every word exists as capitalized and non-capitalized in that list? Speed, worse quality etc?
Many reasons

  1. Users hate to see all upper/lowercase stuff
  2. It used to be the case there's a truecaser model that runs as a pre/postprocessing script that tries to guess the correct case for every word, but that approach requires pre/postprocessing of user input which we want to avoid, and is not very accurate (eg if we have a new name that has been inputted by the user, the truecaser model wouldn't know what to do with it, German is a nightmare with its habit of capitalising every noun, and tHeN wHaT aBoUt pEOPle tYpInG lIkE tHiS...)

In general, letting our model handle cases produces results that are much more liked by humans, and are more accurate. We are aware of what the limitations are and will be releasing models that handle those degenerate cases.

@Godnoken
Copy link
Author

Vocabulary is gathered from clean training data that we believe offers an accurate word distribution of the languages we translate.

Cool!

  1. Users hate to see all upper/lowercase stuff
  2. It used to be the case there's a truecaser model that runs as a pre/postprocessing script that tries to guess the correct case for every word, but that approach requires pre/postprocessing of user input which we want to avoid, and is not very accurate (eg if we have a new name that has been inputted by the user, the truecaser model wouldn't know what to do with it, German is a nightmare with its habit of capitalising every noun, and tHeN wHaT aBoUt pEOPle tYpInG lIkE tHiS...)

In general, letting our model handle cases produces results that are much more liked by humans, and are more accurate. We are aware of what the limitations are and will be releasing models that handle those degenerate cases.

Thank you for the info, all of the work you guys have done and are doing is very fascinating to me.


Another question, relevant to the title casing mentioned further above. Do you have any decent expectations of the models eventually being able to handle "Very Cool Stuff Here" well?
Is it just a matter of having enough data and time on your hands or is there technical difficulties that would prevent the models from being able to handle lowercase, uppercase and title case at the same time?

@XapaJIaMnu
Copy link
Owner

Yes, we will definitely eventually be able to translate Very Cool Stuff Here. This is a solved problem, just we need the time and manpower to rebuild the existing models and we are lacking this.

@Godnoken
Copy link
Author

That's great to hear! I hope you'll get more support in the future, it's duly deserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants