-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Punctuation difference between languages (Quotation marks) #133
Comments
Yes. Generally, we train the models with data where we don't change the quotation marks. So if the French translation of a sentence uses I noticed that our models are pretty bad at translating sentences that are entirely upper case, for example, because we don't have that many sentences like that in our training data1. And there's no rule or knowledge in the model that understands that a sentence entirely in upper case is often the same thing as a normal sentence, but uppercase. I'm also thinking that the examples of uppercase that it will see will mostly be website navigation and headlines. So again, translations will be more like those. You can play around with normalizing the translation input in a certain way (e.g. lower-casing text, replacing all quotation marks, etc) to see whether it helps with the quality. But there's no list of things that is guaranteed to work, it all depends on how the model was trained. And that's often not documented at all 😞 Footnotes
|
Very interesting! Just a thought, and maybe it isn't really an issue in practice - but what if you'd "clean" a language model's data by converting quotation marks to their respective country preferences and then using regex in real-time to make sure that the text you want to translate from is always using the same quotation mark? Or a more universal solution of converting all quotation marks before training to
I have noticed the uppercase issue too and I did opt for lowercasing everything and then using regex to uppercase the first letter where there should be one, but of course that ended up breaking some translations too because of names not being interpreted as names and such. What I've gone with now is to lowercase all words that contain more than 2 letters that are uppercase. Unfortunately, that solution isn't perfect either. For example, a translation from English to Spanish of "Empires Ascendant" wouldn't translate Ascendant because both words are title cased. Don't think there's much that can be done in that case though..
Awesome! Would that mean that the data is 'doubled'/trained twice? Just wondering if only randomly applying uppercase would affect the translation quality at all. |
We try to not do to much language-specific editing of the translation output (often called post-editing) because then we need to implement that in the translator for each language we try to support. Ideally the software is language agnostic, and only the model cares about language specifics. We're also trying to not make too many assumptions about how people will use translations. If we clean our data in a specific way, but someone will try to translate text with the thing we just removed, the output will be worse. So our approach is to just hope we have enough training data to train the model to do the right thing.
Hehe, this is a good example of why we try to just let the model figure it out from the training data. When you come up with rules, there's always many exceptions. And again, the rules will be language (and context! E.g. what about paper titles that are in TitleCase?) specific.
It would. Because of how the whole sentence -> tokens -> vocabulary thing works, you have different tokenisations and vocabulary entries for "for" and "For" and "FOR". The model can learn that they are related to some degree, but there is no explicit knowledge in there that encodes "For" as "for with a capital F". The SentencePiece tokenisation further messes with the expectations: "for" will likely occur in the vocabulary, so it will translate into 1 token. "FOR" might not, as a single word, occur in the vocabulary and might be split in "F" + "OR" or even "F" + "O" + "R": three tokens. OpenAI has a tool to visualise their tokenizer, which works in a similar way. We use a small custom vocab per language pair so the actual tokens will be different. But the idea is the same. Try for example "fork" and "FORK" and see how they differ. |
Ah.. I didn't think about that hehe.
That does most definitely sound like the right path to go. Hopefully it will be possible to cover most edge cases by just having more data to work with in the future then.
How/where do you gather the vocabulary that is in use? Is there a good reason to why you wouldn't make sure every word exists as capitalized and non-capitalized in that list? Speed, worse quality etc?
That is super cool! Thank you for that and the explanation above, that really does help me understand how it all comes together. 😃 |
Vocabulary is gathered from clean training data that we believe offers an accurate word distribution of the languages we translate.
In general, letting our model handle cases produces results that are much more liked by humans, and are more accurate. We are aware of what the limitations are and will be releasing models that handle those degenerate cases. |
Cool!
Thank you for the info, all of the work you guys have done and are doing is very fascinating to me. Another question, relevant to the title casing mentioned further above. Do you have any decent expectations of the models eventually being able to handle "Very Cool Stuff Here" well? |
Yes, we will definitely eventually be able to translate Very Cool Stuff Here. This is a solved problem, just we need the time and manpower to rebuild the existing models and we are lacking this. |
That's great to hear! I hope you'll get more support in the future, it's duly deserved. |
Hi there,
I've just come to realize that there are a lot of different types of quotation marks depending on where you're from.
Do the language models being used by translateLocally at all take the quotation marks into account, or do they only follow rules of uppercase letter, commas and dots?
For example, let's say we have a French sentence going something like this;
Vous avez dit «Je vais manger une crêpe», mais au lieu de cela, vous avez mangé un hamburger.
Would it be fair to assume that replacing the French quotation marks with simple " or ' apostrophes could affect the translation outcome?
I'm not sure where to look for this type of information and I'm hoping someone could bring some clarity with how it works.
Thank you very much.
The text was updated successfully, but these errors were encountered: