Search This Blog

6/25/2011

Unicode

If you work with Web programming, you might have used character encoding. At begining, computers have used ASCII, that provides basic symbols of English language and some special computer control characters. It was done using 1 byte per character, that could provide 256 characters. In practice, there were only 128 characters, because the first bit was not used.

When other countries starts to use computer, they needed to work with their own language, but ASCII did not provide all language symbols. To resolve that, it was created table maps that associates a symbol with a specific byte. One common charset was ISO-8859-1, that provides Latin symbols. Although, was created many different charsets, incompatible between them and was hard to make programs that suports many languages (many charsets).

To solve this new problem, the Unicode was created. This article shows what unicode is, what are its benefits and how to work with it.


What is a character encoding?

Character encoding is a way to keep symbols in binary mode in a computer. This way is often done using a hash table with numeric values that represents symbols. So, the "encoding" specifies how to convert a symbol (number) to code (binary) and vice versa.

The ASCII table is one of the oldest character encoding in computer area. In this table, each character is represented by a number from 0 to 127. A number between 0 and 127 can be represented in binary mode with 7 bits. But as the basic unit of modern computers is the Byte, that has 8 bits, then the ASCII specifies that each character is kept in one byte and the left bit is always "ignored" (uses zero). That way, the left bit can be used to check, for example, the integrity of a character sequence that was transmited. If the left bit do not has zero value, then the package might have arrived wrong.

Some examples of ASCII symbols:

Example of ASCII symbols
Int Binary Symbol
... ... ...
40 00101000 (
41 00101001 )
42 00101010 *
43 00101011 +
... ... ...
48 00110000 0
49 00110001 1
50 00110010 2
... ... ...
65 01000001 A
66 01000010 B
... ... ...
97 01100000 a
98 01100001 b
... ... ...

It has basic mathematical symbols, some general symbols, some special control characters (not printable), numbers, uppercase and lowercase letters. Note that the symbol "1" is not represented by the integer 1 (binary 00000001), but is by the 49.

The ASCII charset is useful to the English language, that does not use accented letters. To other countries (like Brazil), ASCII is not so friendly. The ISO-8859-1 (Latin 1) charsset is similar to ASCII, but it has only printable characters. Then, it it is a person driven charset, and not a computer driver charset. This charset has 191 printable symbols that can be represented by one byte (the byte can keep values from 0 to 255, or 00000000 to 11111111 in binary). Some examples of ISO-8859-1 symbols are below:

Examples of ISO-8859-1 characters
Int Binary Symbol
... ... ...
40 00101000 (
41 00101001 )
42 00101010 *
43 00101011 +
... ... ...
48 00110000 0
49 00110001 1
50 00110010 2
... ... ...
65 01000001 A
66 01000010 B
... ... ...
97 01100000 a
98 01100001 b
... ... ...
231 11100111 ç
232 11101000 è
233 11101001 é
... ... ...

Note that many ASCII symbols has the same code in the ISO-8859-1 table, but this table has other symbols. Moreover, the ISO-8859-1 uses the left bit that can be 1 or 0 (because 191 is higher than 127, that could be represented only by 7 bits).


What is Unicode?

Unicode is a universal proposed symbols table. Its objective is to map all symbols, specially those which are used in writting systems of current world. Beyond them, there are mathematical symbols, geometric shapes, etc. Unicode has more than 100 thousands mapped symbols.

You should be asking yourself: if one byte can keep only 256 diferent values, how the UNICODE table can map each symbol in a binary code?

The answer is simple: a character encoding do not need to use only one byte to represent each symbol and vice versa. In this point the encoding/decoding algorithms come. They specify how to organize the bits and bytes to represent each symbol. Some examples of algorithms that uses the UNICODE table are: UTF-8, UTF-16, UTF-32, UCS, UCS-2, UCS-4, etc.

Let's light each part of the context using the ASCII table as example. The ASCII table defines characters and their respective code (integer). The ASCII encoding defines that the way each character will be converted into code is using 8 bits, where the left bit is always zero and the other are the binary representation of the decimal integer number (code). Table is one thing and the Encoding is an other thing. We could create an alternative charset ASCII-2 where the way to convert a symbol into code would use 8 bits, but the right bit is always 1 and would use the other bits to keep the binary representation of the code.

Then, UNICODE is only a TABLE that specifies one number (or codepoint) that represents each symbol. For example:

Example of Unicode symbols
Int Symbols
... ...
40 (
41 )
42 *
43 +
... ...
48 0
49 1
50 2
... ...
65 A
66 B
... ...
97 a
98 b
... ...
231 ç
... ...
8592 ← (left arrow)
... ...
28381 滝 ("taki" japanese kanji)
... ...

At the following link, you can see the set of symbols and their codepoints: [http://www.unicodetables.com/].

Note that printable ASCII symbols are mantained in the UNICODE table. The ISO-8859-1 symbols are also mantained with the same codepoint at UNICODE. But UNICODE provides many other symbols like the "left arrow" (codepoint 88592) or the japanese kanji "taki" (code 28381).


Charset Encoding UTF-32

Well, as we know what UNICODE is and what a charset encoding is, let's talk about the simplest UNICODE based charset encoding: the UTF-32. UTF means "Unicode Transformation Formats" and has diferent ways to encode/decode symbols from UNICODE table into computational form.

The UTF-32 is very simple because each symbols is always represented by 4 bytes. With 4 bytes, it is possible to represent more than 4 billion diferent values (232 = 4,294,967,296).

This charset encoding, although simple, is not often used because it spends many bytes to represent each single character. In other hand, the size of each character is fixed, then is easy to get the N-th symbol in a sequence of many symbols. We should jump to 4 * N byte of the sequence.

To write "AB" using ASCII, it would need only 2 bytes:

01000001 01000010

Using UTF-32, it would need 8 bytes to writhe the same sequence:

00000000 00000000 00000000 01000001
00000000 00000000 00000000 01000010

Note that the waste is big, but the way to encode/decode is very simple.


Charset Encoding UTF-8

Finally we came to charset encoding UTF-8. This encoding uses a variable number of bytes to represent each symbol. The size depends of the value of the codepoing, and can be from 1 up to 4 bytes.

The symbols that use 1 byte are identical to the ASCII. This way we have compatibility between these two encodings. In other words, the symbols with codepoing between 0 and 127 are represented in the same way the ASCII symbols are. Then, if one byte is captured from a UTF-8 sequence and it begins with 0 (zero), then the byte (alone) represents a ASCII symbol. If the byte begins with 1 (one), then the following N bytes represents a symbol from UNICODE table.

The symbols with 2 bytes are those which have this bit mask:

110xxxxx 10xxxxxx

Where you see "x" should have the significative bits. If we catch these bits and put them in sequence, the binary number represent the decimal codepoint of UNICODE.

Well, so the UNICODE codepoint 128 (that is posterior to 127) is represented by two bytes like this:

11000010 10000000
   ^^^^^   ^^^^^^
   00010   000000

Note that if we catch the bits from positions "x" and put them in sequence, we have a binary sequence "00010000000". This sequence is the same as 128 in decimal notation.

The symbol "ç", wich codepoint is 231 ("11100111" in binary) will be represented with this way:

11000011 10100111
   ^^^^^   ^^^^^^
   00011   100111

You might have seen charset problems in your life. If we make the following PHP code and save it as UTF-8, we will have a problem:

<?php
header('Content-type: text/html; charset=ISO-8859-1');
echo 'ç';
?>

The code above will send a file with 2 bytes (the same two bytes we have already seen to represent the symbol "ç"), but the HTTP header, sent to browser, specifies that its contents is encoded using ISO-8859-1. We know that each symbol in ISO-8859-1 are kept in 1 byte, then the browser will catch each byte and decode it. The first byte "11000011" represents "Ã" (codepoint 195), and the second byte "10100111" represents "§" (codepoint 167). So, the browser will show "ç" and not "ç" as expected.

As we can see, the symbol "ç" has the same codepoint in UNICODE and ISO-8859-1, but the charset encoding UTF-8 prepares the bits in one form that the binary result is diferent from ISO-8859-1.

The symbols with 3 bytes use the following bit mask:

1110xxxx 10xxxxxx 10xxxxxx
(16 efective bits)

And the symbols with 4 bytes use the following bit mask:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
(21 effective bits)

Note that the bit masks has a logic behind them:

  • The symbols with 1 byte always begin with the bit 0.
  • The symbols with 2 bytes always begin with bits 110xxxxx.
  • The symbols with 3 bytes always begin with bits 1110xxxx.
  • The symbols with 4 bytes always begin with bits 11110xxx.

And the symbols with 2, 3 or 4 bytes, the bytes after the first always begin with "10xxxxxx".

As UTF-8 uses 1 or 2 bytes to represent Latin characters, the size of texts encoded in UTF-8 is often much smaller than the size using UTF-32.


Charset Encoding UTF-16

UTF-16 uses 2 or 4 bytes to represent each character. There is a specific calculation to determine if the symbol (codepoint) will need 2 or 4 bytes. It is a less common encoding, so I think is not valuable to understand how UTF-16 exactly works. If you want to understand, you can read this arcticle: [UTF-16].


PHP and UNICODE

PHP 5 has not native support to UNICODE encodings. But, one of the more expected promises of PHP 6 is the division of the string datatype in 2: string and binary (the name can be other). One of them will be used to represent texts encoded in a UNICODE encoding, and the other will be used to keep binary values (charset independent).

Currently, if we make the following PHP code and save the file as UTF-8...

<?php
$text = 'açAEIOU';
$letter = $text[3];
$sub   = substr($text, 0, 3);
$len   = strlen($text);

... the value of the variable $letter would receive the value "A", the variable $sub would receive the value "aç" and the variable $len would receive the size 8. Although, we wrote only seven symbols.

It happens because the brackets operator ("[]") gets the N-th byte from the string and not the N-th symbol. As we know, in UTF-8, symbols are represented by one or more bytes. In the same line, the function strlen returns the number of bytes and not the number of symbols. As the symbol "ç" uses two bytes in UTF-8, then these functions have those meaning affected.

So, to work with UTF-8, we need external tools that supports UNICODE. One simple alternative is to use PCRE extension to obtain substrings, because this PHP extension suppports UTF-8.

When using UTF-8, the PHP must inform the browser about it. You should use the function header (before printing anything) and use the "Content-type" directive with the right value like this:

<?php
header('Content-Type: text/html; charset=UTF-8');
...

Other way is to simulate the HTTP behavior by using the HTML tag <meta>:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    ...

Databases and Unicode

There are many databases models that supports UTF-8. To create a DB in MySQL and PostgreSQL using UTF-8 to text datatypes, you should use these lines:

MySQL:

CREATE DATABASE dbname CHARACTER SET UTF8;

PostgreSQL:

CREATE DATABASE dbname ENCODING UNICODE;

Moreover, the PHP application need to declare that the data are traveling in UTF-8 encoding. For each database there is a specific command to do that. In native MySQL and PostgreSQL PHP functions, there are these functions:

MySQL:

mysql_set_charset('UTF8', $connection);

PostgreSQL:

pg_set_client_encoding('UNICODE', $connection);

Using PDO, you need to define it by SQL, that can be diferent at each RDBMS:

MySQL:

SET NAMES UTF8;

PostgreSQL:

SET NAMES 'UNICODE';

Note that, for this time, MySQL supports only UNICODE characters withh 3 bytes.


Conclusion

Note that the use of UNICODE is growing between newest technologies. There is a special effort above UTF-8. For example, we can say: Semantic Web, Ajax, XML, etc. The main proposal of UNICODE is the application internationalization.

But UNICODE is under development. There is some characters that can be inserted, and the technology should come to be the standard. What about starting to work with UNICODE

No comments:

Post a Comment