Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue related to UTF-16 conversion #2607

Open
tomngo opened this issue Oct 31, 2017 · 8 comments
Open

Encoding issue related to UTF-16 conversion #2607

tomngo opened this issue Oct 31, 2017 · 8 comments

Comments

@tomngo
Copy link

tomngo commented Oct 31, 2017

Summary

If a message containing a high-order character is posted to a Restcomm number, the payload sent to a Restcomm client is coded with extra null characters.

Steps

Good case:

  1. SMS to 14153014887 this text: What's available today? (note straight single quote U+0027).
  2. See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMde237ddbef704a89b2c2e77b5d019377&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=What%27s+available+today%3F

Bad case:

  1. SMS to 14153014887 this text: What’s available today? (note curly single quote U+2019).
  2. See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMf8dd33d6666a4113a9f51ddd62543b60&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=%00W%00h%00a%00t+%19%00s%00+%00a%00v%00a%00i%00l%00a%00b%00l%00e%00+%00t%00o%00d%00a%00y%00%3F

Reproducibility and Age

100%. We observed this same thing on 27 Jan 2017. We didn't pay more attention to it back then because we didn't have a customer deal that might be affected by it.

Theory

This is strong circumstantial evidence that, somewhere along the path from source to Restcomm client (and therefore possibly outside of Restcomm):

  • One component is making a binary decision whether to encode as UTF-16BE instead of a mostly-single-byte encoding such as UTF-8
  • A later component is assuming that its input is in the latter encoding.

Here is why. In this discussion, I'll pretend that we know the latter encoding is UTF-8.

  • The character W (U+0057) is encoded in UTF-8 as 57 but in UTF-16BE as 00 57.
  • The character (U+2019) is encoded in UTF-8 as E2 80 99 but in UTF-16BE as 20 19.
  • The character Null (U+0000) is encoded in UTF-8 as 00.
  • The character (U+0020) is encoded in UTF-8 as 20.
  • The character End of Medium (U+0019) is encoded in UTF-8 as 19.
  • A component expecting UTF-8 but receiving 00 57 would interpret Null then W, which would be percent-plus encoded as %00%57.
  • A component expecting UTF-8 but receiving 20 19 would interpret Space then End of Medium, which would be percent-plus encoded as +%19.
@ivelin
Copy link
Contributor

ivelin commented Oct 31, 2017

Hi Tom,

What happens when you use basic string functions to normalize messages to UTF-8 and replace non alphanumeric characters with blank space?

Ivelin

@tomngo
Copy link
Author

tomngo commented Oct 31, 2017

Ivelin, good question. We could recognize this situation and conditionally fix the encoding on our end. We will do that if the problem can't be fixed upstream from us. We strongly prefer that the encoding be fixed upstream, for a couple of reasons (the recognizer would have to be 100% reliable, and we don't want components in the system to co-adapt to each other's special behaviors).

@tomngo
Copy link
Author

tomngo commented Oct 31, 2017

Here are a couple more observations.

Q. Can we distinguish a UTF-16BE percent plus encoded stream from a UTF-8 percent plus encoded stream with 100% reliability?
A. Yes, if the stream starts with the BOM (U+FFEF). But these streams don't. I think that means we could use really good heuristics that are right 99.9% of the time, but I don't think we can guarantee 100%.

Q. What happens if we make a mistake, e.g., try to read a UTF-16BE stream as UTF-8?
A. Certain characters will cause the UTF-8 decoder to fail. For instance, anything in the U+00C0 to U+00FF range, all of which are legal and often very common characters such as à and é, will cause a UTF-8 decoding error. In UTF-16BE, those characters have byte streams like 00 C0 through 00 FF. A UTF-8 decoder will see U+0000 followed by an illegal start byte (since no UTF-8 character encoding can start with C0 or above).

@ivelin
Copy link
Contributor

ivelin commented Nov 1, 2017

@tomngo that makes sense. We should try to apply this normalization at Restcomm level. @deruelle WDYT?

@tomngo
Copy link
Author

tomngo commented Nov 1, 2017

I understand now that Restcomm is probably supplying UCS-2, which is a subset of UTF-16BE.

@tomngo
Copy link
Author

tomngo commented Nov 10, 2017

@deruelle: Any news?

@tomngo
Copy link
Author

tomngo commented Jul 27, 2018

Heads up @ivelin @deruelle : Something similar to this issue still exists. It's downstream from Restcomm, but I predict that it will affect many Telestax partners other than Lumin. I believe it's not precisely the UCS-2/UTF-8 mismatch that I described above. Here's an example.

  • Lumin sent this internal diagnostic message via SMS: Issue: Abstract pleasantry.hello-again has no variant with args [] (This happens to be a diagnostic message, but the brackets are not uncommon [])
  • Restcomm logged it correctly (SID = SM5eea3a9d79244ac6bb2a5a2606dc63f1)
  • That account (tom+rchook@lumin.ai, SID = AC11338a793e5113bb4adb9871e667a8ce) is tied to Hook Mobile
  • It arrived at Scott Barstow's handset with the last characters garbled; see screenshot

For reference:

  • [ is U+005B
  • ] is U+005D
  • Ä is U+00C4
  • Ñ is U+00D1

@tomngo
Copy link
Author

tomngo commented Nov 9, 2018

I have filed a new ticket that describes today's behavior, which is somewhat different. It's as if someone put in a fix for specific characters but not a comprehensive fix. See #2994.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants