UCS is an abbreviation for Universal Character Set.
The Universal Character Set is synchronized with the Unicode standard.
There are three commonly known types of UTF encodings, namely UTF-8, UTF-16 and the UTF-32.
The UTF-8 encodes Unicode characters into a sequence of 8-bit values known as code units. In the UTF-8 the encoding unit is 8 bits long. Similarly the UTF-16 and the UTF-32 each uses 16 and 32 bits for encoding the Unicode characters.
There are over a million characters included in the latest version of Unicode Standard(v5.1.0). The range of code points for the unicode characters is from 0 - 10FFFF (in Hex). Out of this range of code points the values in the range from D800 - DFFF are reserved for creating surrogate pairs and is not assigned to any abstract characters. The range D800 - DBFF is for High Surrogate and DC00 - DFFF is for low surrogate. The surrogates are used for encoding supplementary characters using the UTF-16, as will be discussed later in this article. Let us look at how UTF-8 encoding is done.
In UTF-8 encoding a single unicode character is encoded into multiple octets depending on the value of the character being encoded. The following table shows the no. of bytes or (code-units) used for encoding the characters in the different code point ranges:
There are three commonly known types of UTF encodings, namely UTF-8, UTF-16 and the UTF-32.
The UTF-8 encodes Unicode characters into a sequence of 8-bit values known as code units. In the UTF-8 the encoding unit is 8 bits long. Similarly the UTF-16 and the UTF-32 each uses 16 and 32 bits for encoding the Unicode characters.
There are over a million characters included in the latest version of Unicode Standard(v5.1.0). The range of code points for the unicode characters is from 0 - 10FFFF (in Hex). Out of this range of code points the values in the range from D800 - DFFF are reserved for creating surrogate pairs and is not assigned to any abstract characters. The range D800 - DBFF is for High Surrogate and DC00 - DFFF is for low surrogate. The surrogates are used for encoding supplementary characters using the UTF-16, as will be discussed later in this article. Let us look at how UTF-8 encoding is done.
In UTF-8 encoding a single unicode character is encoded into multiple octets depending on the value of the character being encoded. The following table shows the no. of bytes or (code-units) used for encoding the characters in the different code point ranges:
Code Point Values No. of Code Units (bytes)
0 - 7F (ASCII) 1
80 - 7FF 2
800 - FFFF 3
10000 - 10FFFF 4
For 7-bit ASCII values the code point is encoded and stored in a single byte with the value of the code point. For code-points in the second range (11-bits) the character is encoded into 2 bytes where the first byte has the initial 3-bits set as 110 to indicate that it is the first byte of a 2 byte encoding. and the second byte has the initial 2-bits set as 10 to indicate that it is a continuation byte. The 11-bits to be encoded are now encoded as follows:
1st byte = 110mmmmm where mmmmm are most significant 5 bits(bits 10 - 6) from the 11-bits to be encoded
2nd byte = 10nnnnnn where nnnnnn are the remaining 6 bits(bits 5 - 0) to be encoded.
For code-points in the third range (16-bits) the character is encoded into 3 bytes where the first byte has the initial 4-bits set as 1110 to indicate that it is the first byte of a 3 byte encoding and the next two bytes have the initial 2-bits set as 10 to indicate that it is a continuation byte. The 16 bits to be encoded are now encoded as follows:
1st byte = 1110wwww where wwww are most significant 4 bits(bits 15 - 12) from the 16-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 11 - 6) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the remaining 6 bits(bits 5 - 0) to be encoded.
For code-points in the fourth range (21-bits) the character is encoded into 4 bytes where the first byte has initial 5-bits as 11110 to indicate that it is the first byte of a 4 byte encoding and the next two bytes have initial 2 bits as 10 to indicate that it is a continuation byte. The 21 bits to encoded are not encoded as follows:
1st byte = 11110www where www are most significant 3 bits(bits 20 - 18) from the 21-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 17 - 12) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the next 6 bits(bits 11 - 6) to be encoded
4th byte = 10zzzzzz where zzzzzz are the remaining 6 bits(bits 5 - 0) to be encoded.
This shows how unicode characters are encoded using the standard UTF-8 encoding.
There is a variation to the standard UTF-8 encoding, called the modified UTF-8. This variation is used in Java by the writeUTF, and readUTF methods appearing in the DataOutputStream, DataInputStream and the RandomAccessFile classes. According to the variation when encoding a zero, it is encoded into 2 bytes using the 11-bit encoding which results in the 2 bytes 11000000 10000000 (C0 80 Hex). This is done to ensure that all bits = 0 is not a valid byte in this encoding. Another change is with regard to the values in the fourth range. While encoding the supplementary characters, any supplementary character will be available as 2 char values (high-surrogate followed by low-surrogate). This results in these characters getting encoded as six bytes each (3 bytes for high-surrogate and 3 bytes for low-surrogate).
Let us now look at the UTF-16 encoding.
In case of UTF-16 encoding the Unicode characters get encoded as 1 16-bit code unit or 2 16-bit code units depending on the value of the code point. We know that the range of code points for the unicode characters is from 0 - 10FFFF(hex). The characters in the range from 0 - FFFF(hex) are encoded into a single 16-bit code unit, retaining the value of the character as it is, The characters in the range from 0 - FFFF(hex) are the characters in the BMP (Basic Multi-lingual Plane). The auxillary characters which are in the range from 10000(hex) - 10FFFF(hex) are encoded into 2 16-bit values as follows:
The first step is to subtract the value 100000(hex) from the value of the code point, which would bring the value in the range from 0 - FFFFF(hex) which would be a 20-bit number. Now these 20 bits are encoded into 2 16-bit values as follows:
1st 16-bit = 110110xxxxxxxxxx where xxxxxxxxxx are the most significant 10 bits(bits 19 - 10) from the 20-bits to be encoded.
2nd 16-bit = 110111yyyyyyyyyy where yyyyyyyyyy are the remaining 10 bits(bits 9 - 0) from the 20 bits to be encoded.
so the 1st 16-bit value will be in the range from 1101100000000000 - 1101101111111111 (D800 - DBFF) which is the range for high surrogate and the 2nd 16-bit value will be in the range from 1101110000000000 - 1101111111111111 (DC00 - DFFF) which is the range for low surrogate.
This shows how unicode characters are encoded using the UTF-16 encoding.
1st byte = 110mmmmm where mmmmm are most significant 5 bits(bits 10 - 6) from the 11-bits to be encoded
2nd byte = 10nnnnnn where nnnnnn are the remaining 6 bits(bits 5 - 0) to be encoded.
For code-points in the third range (16-bits) the character is encoded into 3 bytes where the first byte has the initial 4-bits set as 1110 to indicate that it is the first byte of a 3 byte encoding and the next two bytes have the initial 2-bits set as 10 to indicate that it is a continuation byte. The 16 bits to be encoded are now encoded as follows:
1st byte = 1110wwww where wwww are most significant 4 bits(bits 15 - 12) from the 16-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 11 - 6) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the remaining 6 bits(bits 5 - 0) to be encoded.
For code-points in the fourth range (21-bits) the character is encoded into 4 bytes where the first byte has initial 5-bits as 11110 to indicate that it is the first byte of a 4 byte encoding and the next two bytes have initial 2 bits as 10 to indicate that it is a continuation byte. The 21 bits to encoded are not encoded as follows:
1st byte = 11110www where www are most significant 3 bits(bits 20 - 18) from the 21-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 17 - 12) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the next 6 bits(bits 11 - 6) to be encoded
4th byte = 10zzzzzz where zzzzzz are the remaining 6 bits(bits 5 - 0) to be encoded.
This shows how unicode characters are encoded using the standard UTF-8 encoding.
There is a variation to the standard UTF-8 encoding, called the modified UTF-8. This variation is used in Java by the writeUTF, and readUTF methods appearing in the DataOutputStream, DataInputStream and the RandomAccessFile classes. According to the variation when encoding a zero, it is encoded into 2 bytes using the 11-bit encoding which results in the 2 bytes 11000000 10000000 (C0 80 Hex). This is done to ensure that all bits = 0 is not a valid byte in this encoding. Another change is with regard to the values in the fourth range. While encoding the supplementary characters, any supplementary character will be available as 2 char values (high-surrogate followed by low-surrogate). This results in these characters getting encoded as six bytes each (3 bytes for high-surrogate and 3 bytes for low-surrogate).
Let us now look at the UTF-16 encoding.
In case of UTF-16 encoding the Unicode characters get encoded as 1 16-bit code unit or 2 16-bit code units depending on the value of the code point. We know that the range of code points for the unicode characters is from 0 - 10FFFF(hex). The characters in the range from 0 - FFFF(hex) are encoded into a single 16-bit code unit, retaining the value of the character as it is, The characters in the range from 0 - FFFF(hex) are the characters in the BMP (Basic Multi-lingual Plane). The auxillary characters which are in the range from 10000(hex) - 10FFFF(hex) are encoded into 2 16-bit values as follows:
The first step is to subtract the value 100000(hex) from the value of the code point, which would bring the value in the range from 0 - FFFFF(hex) which would be a 20-bit number. Now these 20 bits are encoded into 2 16-bit values as follows:
1st 16-bit = 110110xxxxxxxxxx where xxxxxxxxxx are the most significant 10 bits(bits 19 - 10) from the 20-bits to be encoded.
2nd 16-bit = 110111yyyyyyyyyy where yyyyyyyyyy are the remaining 10 bits(bits 9 - 0) from the 20 bits to be encoded.
so the 1st 16-bit value will be in the range from 1101100000000000 - 1101101111111111 (D800 - DBFF) which is the range for high surrogate and the 2nd 16-bit value will be in the range from 1101110000000000 - 1101111111111111 (DC00 - DFFF) which is the range for low surrogate.
This shows how unicode characters are encoded using the UTF-16 encoding.
No comments:
Post a Comment