this is the difference between static and non-static

Thursday, September 4, 2008

What is UTF?

UTF is an abbreviation for UCS Transformation Format.
UCS is an abbreviation for Universal Character Set.
The Universal Character Set is synchronized with the Unicode standard.
There are three commonly known types of UTF encodings, namely UTF-8, UTF-16 and the UTF-32.
The UTF-8 encodes Unicode characters into a sequence of 8-bit values known as code units. In the UTF-8 the encoding unit is 8 bits long. Similarly the UTF-16 and the UTF-32 each uses 16 and 32 bits for encoding the Unicode characters.

There are over a million characters included in the latest version of Unicode Standard(v5.1.0). The range of code points for the unicode characters is from 0 - 10FFFF (in Hex). Out of this range of code points the values in the range from D800 - DFFF are reserved for creating surrogate pairs and is not assigned to any abstract characters. The range D800 - DBFF is for High Surrogate and DC00 - DFFF is for low surrogate. The surrogates are used for encoding supplementary characters using the UTF-16, as will be discussed later in this article. Let us look at how UTF-8 encoding is done.

In UTF-8 encoding a single unicode character is encoded into multiple octets depending on the value of the character being encoded. The following table shows the no. of bytes or (code-units) used for encoding the characters in the different code point ranges:


Code Point Values No. of Code Units (bytes)
0 - 7F (ASCII) 1
80 - 7FF 2
800 - FFFF 3
10000 - 10FFFF 4


For 7-bit ASCII values the code point is encoded and stored in a single byte with the value of the code point. For code-points in the second range (11-bits) the character is encoded into 2 bytes where the first byte has the initial 3-bits set as 110 to indicate that it is the first byte of a 2 byte encoding. and the second byte has the initial 2-bits set as 10 to indicate that it is a continuation byte. The 11-bits to be encoded are now encoded as follows:

1st byte = 110mmmmm where mmmmm are most significant 5 bits(bits 10 - 6) from the 11-bits to be encoded
2nd byte = 10nnnnnn where nnnnnn are the remaining 6 bits(bits 5 - 0) to be encoded.

For code-points in the third range (16-bits) the character is encoded into 3 bytes where the first byte has the initial 4-bits set as 1110 to indicate that it is the first byte of a 3 byte encoding and the next two bytes have the initial 2-bits set as 10 to indicate that it is a continuation byte. The 16 bits to be encoded are now encoded as follows:

1st byte = 1110wwww where wwww are most significant 4 bits(bits 15 - 12) from the 16-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 11 - 6) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the remaining 6 bits(bits 5 - 0) to be encoded.

For code-points in the fourth range (21-bits) the character is encoded into 4 bytes where the first byte has initial 5-bits as 11110 to indicate that it is the first byte of a 4 byte encoding and the next two bytes have initial 2 bits as 10 to indicate that it is a continuation byte. The 21 bits to encoded are not encoded as follows:

1st byte = 11110www where www are most significant 3 bits(bits 20 - 18) from the 21-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 17 - 12) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the next 6 bits(bits 11 - 6) to be encoded
4th byte = 10zzzzzz where zzzzzz are the remaining 6 bits(bits 5 - 0) to be encoded.

This shows how unicode characters are encoded using the standard UTF-8 encoding.

There is a variation to the standard UTF-8 encoding, called the modified UTF-8. This variation is used in Java by the writeUTF, and readUTF methods appearing in the DataOutputStream, DataInputStream and the RandomAccessFile classes. According to the variation when encoding a zero, it is encoded into 2 bytes using the 11-bit encoding which results in the 2 bytes 11000000 10000000 (C0 80 Hex). This is done to ensure that all bits = 0 is not a valid byte in this encoding. Another change is with regard to the values in the fourth range. While encoding the supplementary characters, any supplementary character will be available as 2 char values (high-surrogate followed by low-surrogate). This results in these characters getting encoded as six bytes each (3 bytes for high-surrogate and 3 bytes for low-surrogate).

Let us now look at the UTF-16 encoding.
In case of UTF-16 encoding the Unicode characters get encoded as 1 16-bit code unit or 2 16-bit code units depending on the value of the code point. We know that the range of code points for the unicode characters is from 0 - 10FFFF(hex). The characters in the range from 0 - FFFF(hex) are encoded into a single 16-bit code unit, retaining the value of the character as it is, The characters in the range from 0 - FFFF(hex) are the characters in the BMP (Basic Multi-lingual Plane). The auxillary characters which are in the range from 10000(hex) - 10FFFF(hex) are encoded into 2 16-bit values as follows:
The first step is to subtract the value 100000(hex) from the value of the code point, which would bring the value in the range from 0 - FFFFF(hex) which would be a 20-bit number. Now these 20 bits are encoded into 2 16-bit values as follows:

1st 16-bit = 110110xxxxxxxxxx where xxxxxxxxxx are the most significant 10 bits(bits 19 - 10) from the 20-bits to be encoded.
2nd 16-bit = 110111yyyyyyyyyy where yyyyyyyyyy are the remaining 10 bits(bits 9 - 0) from the 20 bits to be encoded.

so the 1st 16-bit value will be in the range from 1101100000000000 - 1101101111111111 (D800 - DBFF) which is the range for high surrogate and the 2nd 16-bit value will be in the range from 1101110000000000 - 1101111111111111 (DC00 - DFFF) which is the range for low surrogate.

This shows how unicode characters are encoded using the UTF-16 encoding.


No comments: