Encoding issue related to UTF-16 conversion #2607

tomngo · 2017-10-31T17:27:06Z

Summary

If a message containing a high-order character is posted to a Restcomm number, the payload sent to a Restcomm client is coded with extra null characters.

Steps

Good case:

SMS to 14153014887 this text: What's available today? (note straight single quote U+0027).
See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMde237ddbef704a89b2c2e77b5d019377&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=What%27s+available+today%3F

Bad case:

SMS to 14153014887 this text: What’s available today? (note curly single quote U+2019).
See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMf8dd33d6666a4113a9f51ddd62543b60&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=%00W%00h%00a%00t+%19%00s%00+%00a%00v%00a%00i%00l%00a%00b%00l%00e%00+%00t%00o%00d%00a%00y%00%3F

Reproducibility and Age

100%. We observed this same thing on 27 Jan 2017. We didn't pay more attention to it back then because we didn't have a customer deal that might be affected by it.

Theory

This is strong circumstantial evidence that, somewhere along the path from source to Restcomm client (and therefore possibly outside of Restcomm):

One component is making a binary decision whether to encode as UTF-16BE instead of a mostly-single-byte encoding such as UTF-8
A later component is assuming that its input is in the latter encoding.

Here is why. In this discussion, I'll pretend that we know the latter encoding is UTF-8.

The character W (U+0057) is encoded in UTF-8 as 57 but in UTF-16BE as 00 57.
The character ’ (U+2019) is encoded in UTF-8 as E2 80 99 but in UTF-16BE as 20 19.
The character Null (U+0000) is encoded in UTF-8 as 00.
The character (U+0020) is encoded in UTF-8 as 20.
The character End of Medium (U+0019) is encoded in UTF-8 as 19.
A component expecting UTF-8 but receiving 00 57 would interpret Null then W, which would be percent-plus encoded as %00%57.
A component expecting UTF-8 but receiving 20 19 would interpret Space then End of Medium, which would be percent-plus encoded as +%19.

The text was updated successfully, but these errors were encountered:

ivelin · 2017-10-31T20:13:10Z

Hi Tom,

What happens when you use basic string functions to normalize messages to UTF-8 and replace non alphanumeric characters with blank space?

Ivelin

tomngo · 2017-10-31T20:20:25Z

Ivelin, good question. We could recognize this situation and conditionally fix the encoding on our end. We will do that if the problem can't be fixed upstream from us. We strongly prefer that the encoding be fixed upstream, for a couple of reasons (the recognizer would have to be 100% reliable, and we don't want components in the system to co-adapt to each other's special behaviors).

tomngo · 2017-10-31T22:00:52Z

Here are a couple more observations.

Q. Can we distinguish a UTF-16BE percent plus encoded stream from a UTF-8 percent plus encoded stream with 100% reliability?
A. Yes, if the stream starts with the BOM (U+FFEF). But these streams don't. I think that means we could use really good heuristics that are right 99.9% of the time, but I don't think we can guarantee 100%.

Q. What happens if we make a mistake, e.g., try to read a UTF-16BE stream as UTF-8?
A. Certain characters will cause the UTF-8 decoder to fail. For instance, anything in the U+00C0 to U+00FF range, all of which are legal and often very common characters such as à and é, will cause a UTF-8 decoding error. In UTF-16BE, those characters have byte streams like 00 C0 through 00 FF. A UTF-8 decoder will see U+0000 followed by an illegal start byte (since no UTF-8 character encoding can start with C0 or above).

ivelin · 2017-11-01T17:41:00Z

@tomngo that makes sense. We should try to apply this normalization at Restcomm level. @deruelle WDYT?

tomngo · 2017-11-01T20:35:57Z

I understand now that Restcomm is probably supplying UCS-2, which is a subset of UTF-16BE.

tomngo · 2017-11-10T21:32:25Z

@deruelle: Any news?

tomngo · 2018-07-27T15:18:00Z

Heads up @ivelin @deruelle : Something similar to this issue still exists. It's downstream from Restcomm, but I predict that it will affect many Telestax partners other than Lumin. I believe it's not precisely the UCS-2/UTF-8 mismatch that I described above. Here's an example.

Lumin sent this internal diagnostic message via SMS: Issue: Abstract pleasantry.hello-again has no variant with args [] (This happens to be a diagnostic message, but the brackets are not uncommon [])
Restcomm logged it correctly (SID = SM5eea3a9d79244ac6bb2a5a2606dc63f1)
That account (tom+rchook@lumin.ai, SID = AC11338a793e5113bb4adb9871e667a8ce) is tied to Hook Mobile
It arrived at Scott Barstow's handset with the last characters garbled; see screenshot

For reference:

[ is U+005B
] is U+005D
Ä is U+00C4
Ñ is U+00D1

tomngo · 2018-11-09T18:22:53Z

I have filed a new ticket that describes today's behavior, which is somewhat different. It's as if someone put in a fix for specific characters but not a comprehensive fix. See #2994.

tomngo mentioned this issue Nov 8, 2018

Double-byte encoded incoming SMS message gets corrupted #2994

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue related to UTF-16 conversion #2607

Encoding issue related to UTF-16 conversion #2607

tomngo commented Oct 31, 2017 •

edited

Loading

ivelin commented Oct 31, 2017

tomngo commented Oct 31, 2017 •

edited

Loading

tomngo commented Oct 31, 2017

ivelin commented Nov 1, 2017

tomngo commented Nov 1, 2017

tomngo commented Nov 10, 2017

tomngo commented Jul 27, 2018 •

edited

Loading

tomngo commented Nov 9, 2018

Encoding issue related to UTF-16 conversion #2607

Encoding issue related to UTF-16 conversion #2607

Comments

tomngo commented Oct 31, 2017 • edited Loading

Summary

Steps

Reproducibility and Age

Theory

ivelin commented Oct 31, 2017

tomngo commented Oct 31, 2017 • edited Loading

tomngo commented Oct 31, 2017

ivelin commented Nov 1, 2017

tomngo commented Nov 1, 2017

tomngo commented Nov 10, 2017

tomngo commented Jul 27, 2018 • edited Loading

tomngo commented Nov 9, 2018

tomngo commented Oct 31, 2017 •

edited

Loading

tomngo commented Oct 31, 2017 •

edited

Loading

tomngo commented Jul 27, 2018 •

edited

Loading