Unicode Character Encoding Model
Character Encoding Series (Part2)
Unicode is an Open Character Repertoire. Until 2000, it had less than 65,000 characters in its repertoire. But with the inclusion of characters from China, the number of characters now are more than 90,000.
The Unicode Character Set (UCS) can represent hundreds of thousands of abstract characters. Each numerical representation of a character is called CODE POINT. With the inclusion of the Chinese characters, it takes about 21 bits to represent all the Code points.
Unicode has two types of Code Set Encoding - (1) UCS Encoding (2) UTF Encoding. Some of the encoding present are UCS2, UTF16, UTF8, UCS4, UTF32 etc.
UCS2 - With the original Unicode Repertoire, when the character set was less than 65,000, all the Code points could be mapped to 16 bit values. UCS2 mapped all the characters to fixed 16 bit values ranging from 0x0000 to 0xFFFF. A part of this range, about 2048 values from 0xD800 to 0xDBFF and 0xDC00 to 0xDFFF were reserved for future expansion. This 16 bit value is known as CODE UNIT in the Unicode terminology.
UTF16 - With the inclusion of the Chinese Characters, UCS2 did not suffice to encode all the Code Points. The encoding of UCS2 was enhanced by adding CODE PLANES. Totally 17 Code planes constitute the UTF16 encoding. The original UCS2 encoding forms what is called the BASIC MULTILINGUAL PLANE(BMP). All the encodings in this range are fixed 16 bit of values. The further 16 planes are called SURROGATE PLANES. These are composed of Code Unit pairs with the first pair in the range 0xD800 to 0xDBFF and the second pair in the range 0xDC00 to 0xDFFF. So, a character in UTF16 could be either 16bit or 32 bits. For transmission and storage of UTF16 characters, BOM (Byte Order Marker) Code Point is used. This is a special character with value 0xFEFF (AKA - zero width no-break space). If BOM is present, it could be taken as the 16bit value are represented as BIG ENDIAN, otherwise, it could be LITTLE ENDIAN.
UTF8 - In this encoding scheme, the Unicode Code Points could be encoded from 8bits to 32 bits. It is basically represented as a sequence of octets. The advantages is that this is relatively compact and ASCII compliant. UTF-8 text files can also use BOM to indicate that the contents are Unicode text.
UCS4/UTF32 - Here all the character set are represented as fixed 32 bit values. This is not very frequently used.
1 Comments:
christian louboutin uk, louis vuitton outlet, christian louboutin shoes, michael kors pas cher, louis vuitton outlet, sac longchamp pas cher, prada handbags, gucci handbags, tiffany and co, polo ralph lauren outlet online, christian louboutin outlet, cheap oakley sunglasses, longchamp outlet, uggs on sale, polo outlet, louis vuitton, nike air max, oakley sunglasses, longchamp outlet, nike free, nike outlet, longchamp outlet, longchamp pas cher, chanel handbags, nike air max, oakley sunglasses, nike free run, tiffany jewelry, oakley sunglasses wholesale, louboutin pas cher, ray ban sunglasses, ugg boots, replica watches, air max, louis vuitton outlet, oakley sunglasses, nike roshe, louis vuitton, tory burch outlet, ray ban sunglasses, jordan shoes, christian louboutin, prada outlet, polo ralph lauren, burberry pas cher, ugg boots, jordan pas cher, kate spade outlet, ray ban sunglasses
Post a Comment
<< Home