Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-byte encoded incoming SMS message gets corrupted #2994

Open
tomngo opened this issue Nov 8, 2018 · 0 comments
Open

Double-byte encoded incoming SMS message gets corrupted #2994

tomngo opened this issue Nov 8, 2018 · 0 comments

Comments

@tomngo
Copy link

tomngo commented Nov 8, 2018

Summary

Certain higher-order characters from a user's handset to a Restcomm-connected bot get corrupted at the Restcomm level. Not all higher-order characters exhibit this problem.

Related Tickets

There are many tickets related to double-byte messages.

Scope of Impact

Every Restcomm-connected bot that can accept arbitrary natural-language input will be affected. Obviously non-US users will be more affected than US users.

There is no reliable workaround. As discussed in #2607, it's possible for the recipient to distinguish reliably between different encodings only if a BOM (U+FFEF) is present. Otherwise, only heuristics are possible and in many cases the information is simply not recoverable even if the sequence of decoding errors is known.

Isolated to Restcomm

I've changed every variable outside of Restcomm, and the behavior is identical:

  • The same thing happens when the message is sent from my handset (on T-Mobile) through a Restcomm instance, whether that Restcomm instance is tied to Teli (tom+rcteli@lumin.ai), or to Hook (tom+rchook@lumin.ai).
  • The same thing happens when the message is sent from my Google Voice line through a Restcomm instance, whether that Restcomm instance is tied to Teli (tom+rcteli@lumin.ai), or to Hook (tom+rchook@lumin.ai).
  • A message carrying an identical string arrives intact if sent from my handset to my Google Voice line without going through Restcomm, or vice versa.
  • The corruption is visible in the Restcomm logs, i.e., before reaching our platform.

Affected Characters

Here are some characters that are affected:

  • é (U+00E9): Latin Small Letter E with Acute
  • ñ (U+00F1): Latin Small Letter N with Tilde
  • [ (U+005B): Left Square Bracket
  • ] (U+005D): Left Square Bracket
  • @ (U+0040): Commercial At
  • 😀 (U+1F600): Grinning Face

Here are some characters that are not affected:

  • e (U+0065): Latin Small Letter E
  • n (U+006E): Latin Small Letter N
  • (U+2018): Left Single Quotation Mark
  • (U+2019): Right Single Quotation Mark
  • (U+201C): Left Double Quotation Mark
  • (U+201D): Right Double Quotation Mark

Strangely, some characters that are not affected are higher order than some that are affected.

Examples

My name is José Peña.

  • Restcomm via Hook: SmsSid SMa117bca5a48843ada30f545c8964134a (from T-Mobile) and SM8ab768363a8d44178fc8ff7a642d24be (from Google Voice)
  • Restcomm via Teli: SmsSid SMd933d6eff7a946c4adef1932db99debf (from T-Mobile) and SM02389664fedb4aa3afff6a79d28aa7d1 (from Google Voice)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant