You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in #1859, there are many places in our code that encode strings into UTF-16 and then specify their length as being 2 * (number of characters)
This is NOT true, UTF-16 does encode some Unicode code-points from U+0000 to U+FFFF as 2 bytes, however anything from U+010000 to U+10FFFF will be encoded as 4 bytes.
Some noteworthy code-points in the higher range are:
As mentioned in #1859, there are many places in our code that encode strings into
UTF-16
and then specify their length as being2 * (number of characters)
This is NOT true, UTF-16 does encode some Unicode code-points from
U+0000
toU+FFFF
as 2 bytes, however anything fromU+010000
toU+10FFFF
will be encoded as 4 bytes.Some noteworthy code-points in the higher range are:
We should identify every place in the code where we assume the length of
UTF-16
encoded strings and just replace them withlen(utf16_encoded_bytes)
Here's an example of why we shouldn't assume the length of an
UTF-16
encoded string:This is an inclomplete list of places where we're assuming the length of
UTF-16
encoded strings (credits to @rtpt-romankarwacik for most of them) :The text was updated successfully, but these errors were encountered: