Sunday, April 16, 2006

Character Encoding Model

Character Encoding Series (Part1)

Character Encoding Model - The Encoding Model has 4 constituens - (1) Character Repertoire (2) Character Set (3) Character Encoding Form (4) Character Encoding Scheme

Character Repertoire represents what characters are available in the model. There are two types of repertoires (1) Closed Repertoires where the characters are fixed and the repertoire cant be added to (For ex. ASCII) (2) Open Repertoires where the repertoires are extensible (For ex. Unicode and Windows Code Pages)

Coded Character Set - This represents the numerical value for each character in the character repertoire. Same characters in different models can have different numerical value. In Unicode terminology, this is called CODE POINT and is represented as +U. For example +U0041 is the character 'A'.

Character Encoding Form - This specifies how the character numerical value is converted to fixed bit width values called CODE VALUES for the purpose of manipulation by computers. This conversion could be as simple as in the case of ASCII where the ASCII codes are mapped directly to 8 bit values or as complex as in the case of Unicode where there are multiple conversions possible such as UCS2, UTF16, UTF8, UCS4, UTF32 etc.

Character Encoding Scheme - This specifies how the numerical character code can be represented for the purpose of Storing and transmission. This specifies such specifications as BOM (Byte Order Marker) for UTF-16 etc.