Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode normalization? #224

Open
davidweichiang opened this issue Feb 7, 2023 · 1 comment
Open

Unicode normalization? #224

davidweichiang opened this issue Feb 7, 2023 · 1 comment
Labels

Comments

@davidweichiang
Copy link

Currently no Unicode normalization (e.g., NFKC or NFKD) is done, so that (say) für and für would not count as a match. Would it be possible to add this or would it break too many things?

@ZJaume
Copy link
Contributor

ZJaume commented Feb 16, 2023

Had similar issues with special unicode symbols. May not be suitable for every scenario but for that I used --tok spm, as SentencePiece already does NFKC normalization by default. The inconvenience is that BLEU scores tend to be higher (there are more tokens) and you cannot compare to other BLEUs using default tokenization unless you recompute scores with spm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants