Updated : August 5, 2019
A character in a specified language can be coded as a string of bits (binary digits).
Coding the English alphabet
In the early days, English was the primary language used for communications in most parts of the world.
- The English alphabet consists of 26 letters.
- 26 symbols are needed to code upper case characters OR lower case characters.
5 bits are needed for standard coding.
- 52 symbols are needed to code BOTH upper case and lower case characters.
6 bits are needed are standard coding.
Fixed length coding
All the characters in a character set are coded using a fixed number of bits.
5-bit, 6-bit, 7-bit and 8-bit character codes were developed and used.
Some early standards include
- 7-bit ASCII (American Standard Code for Information Interchange)
- 8-bit EBCDIC (Extended Binary Coded Decimal Interchange Code).
Coding for other languages
- With the wide spread use of computer technology, Character sets for the various languages were developed.
- Some languages (notably Chinese) require long bit strings.
- Fixed length coding gave way to Variable length coding.
Variable length coding
- The characters in a character set are coded using a variable number of bits.
- The most used characters in a language are represented as one byte and the lesser used characters are represented as two or more bytes.
Standards and Recommendations
- De Jure Standards are set by law.
- De Facto Standards are set by common usage.
- Standards must be followed for Compliance.
- On the other hand, Recommendations, which should be followed, can cause variations in the implementations.
Unicode and UTF
- Unicode aims to have standard character codes for each supported language to facilitate processing.
- There are formal and informal institutions to help develop, propose and approve new Unicode character sets.
- UTF is a Unicode Transformation Format to transform Unicode characters to fit the specified length (e.g. UTF-8, UTF-16).