A character in a specified language can be coded as a string of bits (binary digits).
In the early days, English was the primary language used for communications in most parts of the world. The English alphabet consists of 26 letters. 5-bits (which can represent 32 symbols) are need to represent (a) upper case characters (b) lower case characters.
5-bit, 6-bit, 7-bit and 8-bit character codes were developed and used. Some early standards include (a) 7-bit ASCII (American Standard Code for Information Interchange) (b) 8-bit EBCDIC (Extended Binary Coded Decimal Interchange Code).
With the wide spread use of computer technology, Character sets for the various languages were developed. Some languages (notably Chinese) require long bit strings.
Fixed length coding gave way to Variable length coding. The most used characters in a language are represented as one byte and the lesser used characters are represented as two or more bytes.
Unicode aims to have standard character codes for the languages. There are formal and informal institutions to help develop, propose and approve new Unicode character sets.
UTF is a Unicode Transformation Format to transform Unicode characters to fit the specified length (e.g. UTF-8, UTF-16).
Standards may be (a) De Jure (set by law) (b) De Facto (set by common usage). Standards must be followed for Compliance.
Recommendations, which should be followed, can cause variations in the implementations.
Leave a comment