-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue related to UTF-16 conversion #2607
Comments
Hi Tom, What happens when you use basic string functions to normalize messages to UTF-8 and replace non alphanumeric characters with blank space? Ivelin |
Ivelin, good question. We could recognize this situation and conditionally fix the encoding on our end. We will do that if the problem can't be fixed upstream from us. We strongly prefer that the encoding be fixed upstream, for a couple of reasons (the recognizer would have to be 100% reliable, and we don't want components in the system to co-adapt to each other's special behaviors). |
Here are a couple more observations. Q. Can we distinguish a UTF-16BE percent plus encoded stream from a UTF-8 percent plus encoded stream with 100% reliability? Q. What happens if we make a mistake, e.g., try to read a UTF-16BE stream as UTF-8? |
I understand now that Restcomm is probably supplying UCS-2, which is a subset of UTF-16BE. |
@deruelle: Any news? |
Heads up @ivelin @deruelle : Something similar to this issue still exists. It's downstream from Restcomm, but I predict that it will affect many Telestax partners other than Lumin. I believe it's not precisely the UCS-2/UTF-8 mismatch that I described above. Here's an example.
For reference:
|
I have filed a new ticket that describes today's behavior, which is somewhat different. It's as if someone put in a fix for specific characters but not a comprehensive fix. See #2994. |
Summary
If a message containing a high-order character is posted to a Restcomm number, the payload sent to a Restcomm client is coded with extra null characters.
Steps
Good case:
14153014887
this text:What's available today?
(note straight single quoteU+0027
).SmsSid=SMde237ddbef704a89b2c2e77b5d019377&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=What%27s+available+today%3F
Bad case:
14153014887
this text:What’s available today?
(note curly single quoteU+2019
).SmsSid=SMf8dd33d6666a4113a9f51ddd62543b60&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=%00W%00h%00a%00t+%19%00s%00+%00a%00v%00a%00i%00l%00a%00b%00l%00e%00+%00t%00o%00d%00a%00y%00%3F
Reproducibility and Age
100%. We observed this same thing on 27 Jan 2017. We didn't pay more attention to it back then because we didn't have a customer deal that might be affected by it.
Theory
This is strong circumstantial evidence that, somewhere along the path from source to Restcomm client (and therefore possibly outside of Restcomm):
Here is why. In this discussion, I'll pretend that we know the latter encoding is UTF-8.
W
(U+0057
) is encoded in UTF-8 as57
but in UTF-16BE as00 57
.’
(U+2019
) is encoded in UTF-8 asE2 80 99
but in UTF-16BE as20 19
.U+0000
) is encoded in UTF-8 as00
.U+0020
) is encoded in UTF-8 as20
.U+0019
) is encoded in UTF-8 as19
.00 57
would interpret Null thenW
, which would be percent-plus encoded as%00%57
.20 19
would interpret Space then End of Medium, which would be percent-plus encoded as+%19
.The text was updated successfully, but these errors were encountered: