What is Unicode?

What is Unicode?

Unicode is an international character encoding standard that assigns a unique number to every character across different languages and scripts, ensuring that almost all characters are accessible across various platforms, programs, and devices.

Unicode History

SMS defines a 160-character limit for a single text message — but what exactly is a character? Here’s a simplified explanation. We won't delve too deeply into the technical details, but a few Google searches can lead you down some interesting rabbit holes.

1. GSM-7 Character Encoding

There are two common ways of sending alphabetic characters over SMS. The GSM-7 character encoding standard was developed by GSM Corp. in the 1980s for messaging with pagers and later adopted for SMS. It defines the most commonly used letters and symbols in English and many Western European languages in seven bits each for use on GSM networks. GSM-7 supports 127 or 128 characters. The original standard called for 128 8-bit octets, or bytes. It was lengthened to 140 bytes, supporting 160 7-bit GSM-7 characters.

2. The Need for More Characters: UCS-2

Of course, 128 characters aren’t enough to represent all the characters used in languages around the world. To address this, a working group called Unicode created the Universal Coded Character Set (UCS-2), a standard developed in 1987 that defines characters in two 8-bit bytes. That’s 2^16, which equals 65,536 characters. UCS-2 was adopted as part of the ISO/IEC 10646 standard in 1990 and was informally called Unicode.

3. Transition to UTF-8

UCS-2 offered more characters than GSM-7 but was a temporary solution. In 1991, the Unicode working group founded Unicode, Inc., a nonprofit organization whose mission was to “enable people around the world to use computers in any language, by providing freely available specifications and data to form the foundation for software internationalization.” The consortium issued a better encoding standard, UTF-8, in 1993. (UTF stands for Unicode Transformation Format.) UCS-2 is now obsolete as an independent standard, and UTF-8 has become the most common encoding standard for characters on electronic devices.

Understanding Unicode and Code Points

Unicode assigns a unique code — a code point — to each character. Code points are represented by a U followed by a unique string that represents a hexadecimal number. For example, U+0041 is the code point for “A.”

1. Efficiency of UTF-8

UTF-8 is efficient in the use of bits. It uses variable-length encoding, meaning if a character can be represented with a single byte, that’s all the space UTF-8 will use. If a character needs two or more bytes, UTF-8 will use as much as necessary.

2. Introduction of UTF-16

UTF-8 was followed by UTF-16 in 1996. The current Unicode standard defines a possible 1,114,112 code points, grouped into 17 planes. UCS-2 occupies what’s formally called the basic multilingual plane, so it lives on.

3. Current Unicode Standard

We won’t dive any deeper into the technology and history of Unicode, other than to say version 15.0, the current version as of September 2022, defines 149,186 characters, including not only the characters for every language represented electronically but also a wide assortment of emojis. Every new version adds more emojis.

4. Unicode SMS Messages

"Unicode SMS" refers to SMS messages sent and received containing characters not found in the GSM-7 character set. An SMS allows up to 160 characters from the GSM-7 character set, which includes all Latin characters A-Z, digits 0-9, plus a few special characters. Unicode handles any known character but also takes up more SMS space than GSM's 7-bit binary code. Therefore, Unicode SMS messages are limited to 70 characters, and messages longer than this will be segmented. See more about UCS-2 character encoding, used for SMS messages which aren't encoded in GSM-7.

5. Impact on Text Messaging

But let’s bring our characters back into text messaging. In the early 1990s, the telecom industry was developing Short Messaging Service (SMS). At the time, UCS-2 was the best alternative to GSM-7, so it was incorporated into SMS standards — and it remains today.

Why Encoding Standards Matter? It's the Cost!

Why do you need to care about any of this technical underpinning? In a word, cost.

UCS-2 characters take 16 bits to encode instead of the seven bits used by the GSM character set, so when a message includes any UCS-2 character, it can have a maximum of 70 characters. Just as with GSM characters, you can send messages longer than the limit, but SMS has to chop them into multiple segments, send them separately, and concatenate them back together at the receiving end. The more messages you send, the more costs you incur.

Managing Unicode with Plivo

Plivo makes sending and receiving SMS containing Unicode characters easy by handling character recognition and encoding for you. By default, SMS messages sent with Plivo support Unicode via UCS-2 character encoding to accurately represent global languages as they are sent between different geographic locations and across carriers.

Smart Encoding, built into Plivo's messaging API , can help you avoid using easy-to-miss Unicode characters by checking for common Unicode characters such as smart quotes or Unicode whitespaces and replacing them with similar GSM-7 characters.

Whether you're trying to avoid unintentional Unicode characters sneaking into your carefully crafted SMS messages or sending messages written in Kanji, Plivo SMS has you covered.