Fix for loading tokenizers using non-utf8 strings #55

ThomasProg · 2025-01-13T14:23:56Z

After upgrading to Tokenizers 0.20.0 or higher, BPEs can now encode into strings with non-utf8 characters.
It makes the current version of tokenizers-cpp impossible to load a tokenizer created with a newer version of Tokenizers.

This pull request is to solve that issue, simply replacing std::string::from_utf8(), causing a previous exception, into String::from_utf8_lossy(), which allows non-utf8 strings.

…n utf8 data error in case of BPE use

ThomasProg added 3 commits January 13, 2025 23:03

std::str::from_utf8() replaced by String::from_utf8_lossy() to fix no…

0658eef

…n utf8 data error in case of BPE use

Merge remote-tracking branch 'upstream/main' into utf8fix

873f641

fixed invalid type

085a8ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for loading tokenizers using non-utf8 strings #55

Fix for loading tokenizers using non-utf8 strings #55

ThomasProg commented Jan 13, 2025

Fix for loading tokenizers using non-utf8 strings #55

Are you sure you want to change the base?

Fix for loading tokenizers using non-utf8 strings #55

Conversation

ThomasProg commented Jan 13, 2025