Sunday, April 16, 2006

Unicode Character Encoding Model

Character Encoding Series (Part2)

Unicode is an Open Character Repertoire. Until 2000, it had less than 65,000 characters in its repertoire. But with the inclusion of characters from China, the number of characters now are more than 90,000.

The Unicode Character Set (UCS) can represent hundreds of thousands of abstract characters. Each numerical representation of a character is called CODE POINT. With the inclusion of the Chinese characters, it takes about 21 bits to represent all the Code points.

Unicode has two types of Code Set Encoding - (1) UCS Encoding (2) UTF Encoding. Some of the encoding present are UCS2, UTF16, UTF8, UCS4, UTF32 etc.

UCS2 - With the original Unicode Repertoire, when the character set was less than 65,000, all the Code points could be mapped to 16 bit values. UCS2 mapped all the characters to fixed 16 bit values ranging from 0x0000 to 0xFFFF. A part of this range, about 2048 values from 0xD800 to 0xDBFF and 0xDC00 to 0xDFFF were reserved for future expansion. This 16 bit value is known as CODE UNIT in the Unicode terminology.

UTF16 - With the inclusion of the Chinese Characters, UCS2 did not suffice to encode all the Code Points. The encoding of UCS2 was enhanced by adding CODE PLANES. Totally 17 Code planes constitute the UTF16 encoding. The original UCS2 encoding forms what is called the BASIC MULTILINGUAL PLANE(BMP). All the encodings in this range are fixed 16 bit of values. The further 16 planes are called SURROGATE PLANES. These are composed of Code Unit pairs with the first pair in the range 0xD800 to 0xDBFF and the second pair in the range 0xDC00 to 0xDFFF. So, a character in UTF16 could be either 16bit or 32 bits. For transmission and storage of UTF16 characters, BOM (Byte Order Marker) Code Point is used. This is a special character with value 0xFEFF (AKA - zero width no-break space). If BOM is present, it could be taken as the 16bit value are represented as BIG ENDIAN, otherwise, it could be LITTLE ENDIAN.

UTF8 - In this encoding scheme, the Unicode Code Points could be encoded from 8bits to 32 bits. It is basically represented as a sequence of octets. The advantages is that this is relatively compact and ASCII compliant. UTF-8 text files can also use BOM to indicate that the contents are Unicode text.

UCS4/UTF32 - Here all the character set are represented as fixed 32 bit values. This is not very frequently used.