Unicode normalization? #224

davidweichiang · 2023-02-07T16:17:16Z

Currently no Unicode normalization (e.g., NFKC or NFKD) is done, so that (say) für and für would not count as a match. Would it be possible to add this or would it break too many things?

The text was updated successfully, but these errors were encountered:

ZJaume · 2023-02-16T14:48:54Z

Had similar issues with special unicode symbols. May not be suitable for every scenario but for that I used --tok spm, as SentencePiece already does NFKC normalization by default. The inconvenience is that BLEU scores tend to be higher (there are more tokens) and you cannot compare to other BLEUs using default tokenization unless you recompute scores with spm.

ozancaglayan added the question label May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode normalization? #224

Unicode normalization? #224

davidweichiang commented Feb 7, 2023

ZJaume commented Feb 16, 2023

Unicode normalization? #224

Unicode normalization? #224

Comments

davidweichiang commented Feb 7, 2023

ZJaume commented Feb 16, 2023