class<JAVA extends class<JAVA>>: September 2008

Friday, September 5, 2008

Non-english letters in Java source code

Rules for defining identifiers in Java:
1. An identifier is composed of a sequence of letters, digits, the currency symbol $ and the separator character _.
2. An identifier cannot start with a digit.
3. An identifier cannot be a reserved from the Java language.

This is not different from the rules in the C programming language. The difference is of the character set used by the programming language. Java uses the Unicode character set and hance the letters and the digits used here are not restricted to the letters and digits available from the ASCII character set, which is used in the C programming language. In Java the letters and digits from any language can be used for example, anyone can use अ (the devanagari letter A, unicode code point value +U0905) in an identifier. The question arises as how can we use these characters in our java source code while using an editor which supports only ascii files. In java source code one can use the unicode escape to specify any unicode character. A unicode escape is written as \uhhhh where hhhh are the four hexadecimal digits for the unicode code unit according to UTF-16. The java compiler would first interpret these unicode escapes before identifying the lines and tokens from the source code. The following are a few interesting examples:

char \u0905 = '\u0905'; is a valid code where the variable named अ (devanagari letter A) is assigned the value 0905(hex).

char ch = '\u000A'; is not valid since the unicode escape \u000A is a new line character and would be seen follows:

char ch = '
';

Thus creating a compilation error. anyone can easily obfuscate the source code by using the unicode escapes for space and semicolon characters, this will make the code totally unreadable.

Currently most of the programmers use the letters and digits from the english language in defining the identifiers. It may be worth trying to create obfuscators which could change the identifiers to non-english letters and digits.

Thursday, September 4, 2008

What is UTF?

UTF is an abbreviation for UCS Transformation Format.
UCS is an abbreviation for Universal Character Set.

The Universal Character Set is synchronized with the Unicode standard.
There are three commonly known types of UTF encodings, namely UTF-8, UTF-16 and the UTF-32.
The UTF-8 encodes Unicode characters into a sequence of 8-bit values known as code units. In the UTF-8 the encoding unit is 8 bits long. Similarly the UTF-16 and the UTF-32 each uses 16 and 32 bits for encoding the Unicode characters.

There are over a million characters included in the latest version of Unicode Standard(v5.1.0). The range of code points for the unicode characters is from 0 - 10FFFF (in Hex). Out of this range of code points the values in the range from D800 - DFFF are reserved for creating surrogate pairs and is not assigned to any abstract characters. The range D800 - DBFF is for High Surrogate and DC00 - DFFF is for low surrogate. The surrogates are used for encoding supplementary characters using the UTF-16, as will be discussed later in this article. Let us look at how UTF-8 encoding is done.

In UTF-8 encoding a single unicode character is encoded into multiple octets depending on the value of the character being encoded. The following table shows the no. of bytes or (code-units) used for encoding the characters in the different code point ranges:


Code Point Values     No. of Code Units (bytes)
    0  - 7F (ASCII)              1
   80  - 7FF                     2
  800  - FFFF                    3
 10000 - 10FFFF                  4

For 7-bit ASCII values the code point is encoded and stored in a single byte with the value of the code point. For code-points in the second range (11-bits) the character is encoded into 2 bytes where the first byte has the initial 3-bits set as 110 to indicate that it is the first byte of a 2 byte encoding. and the second byte has the initial 2-bits set as 10 to indicate that it is a continuation byte. The 11-bits to be encoded are now encoded as follows:

1st byte = 110mmmmm where mmmmm are most significant 5 bits(bits 10 - 6) from the 11-bits to be encoded
2nd byte = 10nnnnnn where nnnnnn are the remaining 6 bits(bits 5 - 0) to be encoded.

For code-points in the third range (16-bits) the character is encoded into 3 bytes where the first byte has the initial 4-bits set as 1110 to indicate that it is the first byte of a 3 byte encoding and the next two bytes have the initial 2-bits set as 10 to indicate that it is a continuation byte. The 16 bits to be encoded are now encoded as follows:

1st byte = 1110wwww where wwww are most significant 4 bits(bits 15 - 12) from the 16-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 11 - 6) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the remaining 6 bits(bits 5 - 0) to be encoded.

For code-points in the fourth range (21-bits) the character is encoded into 4 bytes where the first byte has initial 5-bits as 11110 to indicate that it is the first byte of a 4 byte encoding and the next two bytes have initial 2 bits as 10 to indicate that it is a continuation byte. The 21 bits to encoded are not encoded as follows:

1st byte = 11110www where www are most significant 3 bits(bits 20 - 18) from the 21-bits to be encoded
2nd byte = 10xxxxxx where xxxxxx are the next 6 bits(bits 17 - 12) to be encoded
3rd byte = 10yyyyyy where yyyyyy are the next 6 bits(bits 11 - 6) to be encoded
4th byte = 10zzzzzz where zzzzzz are the remaining 6 bits(bits 5 - 0) to be encoded.

This shows how unicode characters are encoded using the standard UTF-8 encoding.

There is a variation to the standard UTF-8 encoding, called the modified UTF-8. This variation is used in Java by the writeUTF, and readUTF methods appearing in the DataOutputStream, DataInputStream and the RandomAccessFile classes. According to the variation when encoding a zero, it is encoded into 2 bytes using the 11-bit encoding which results in the 2 bytes 11000000 10000000 (C0 80 Hex). This is done to ensure that all bits = 0 is not a valid byte in this encoding. Another change is with regard to the values in the fourth range. While encoding the supplementary characters, any supplementary character will be available as 2 char values (high-surrogate followed by low-surrogate). This results in these characters getting encoded as six bytes each (3 bytes for high-surrogate and 3 bytes for low-surrogate).

Let us now look at the UTF-16 encoding.
In case of UTF-16 encoding the Unicode characters get encoded as 1 16-bit code unit or 2 16-bit code units depending on the value of the code point. We know that the range of code points for the unicode characters is from 0 - 10FFFF(hex). The characters in the range from 0 - FFFF(hex) are encoded into a single 16-bit code unit, retaining the value of the character as it is, The characters in the range from 0 - FFFF(hex) are the characters in the BMP (Basic Multi-lingual Plane). The auxillary characters which are in the range from 10000(hex) - 10FFFF(hex) are encoded into 2 16-bit values as follows:
The first step is to subtract the value 100000(hex) from the value of the code point, which would bring the value in the range from 0 - FFFFF(hex) which would be a 20-bit number. Now these 20 bits are encoded into 2 16-bit values as follows:

1st 16-bit = 110110xxxxxxxxxx where xxxxxxxxxx are the most significant 10 bits(bits 19 - 10) from the 20-bits to be encoded.
2nd 16-bit = 110111yyyyyyyyyy where yyyyyyyyyy are the remaining 10 bits(bits 9 - 0) from the 20 bits to be encoded.

so the 1st 16-bit value will be in the range from 1101100000000000 - 1101101111111111 (D800 - DBFF) which is the range for high surrogate and the 2nd 16-bit value will be in the range from 1101110000000000 - 1101111111111111 (DC00 - DFFF) which is the range for low surrogate.

This shows how unicode characters are encoded using the UTF-16 encoding.

Wednesday, September 3, 2008

Is Jane an Object or a Class?

Very often people trying to give examples about inheritance create confusion about the relationship between 'Object and Class' and the relationship between 'sub class and super class'. For example, I recently observed an article at http://java.sun.com/developer/technicalArticles/wombat_world/ which gives an example of inheritance as

public class Jane extends Person {...}

The relation between Jane and Person is more observed as Jane being an instance of Person, and not Jane being a sub class of Person, as is being implied by the example. The class is like a template using which instances can be created. A sub class is also a class, and so it is also a template which either inherits or overrides methods defined in the super class. A better example to show inheritance would be something like

public class ShopKeeper extends Person {...}

Tuesday, September 2, 2008

Object class in Java should have been abstract

The Object class which is the super class for all the classes in Java should have been abstract. Going by principles of inheritance, a sub class has more functionality as compared to the super class. This is not true in Java. since it allows any one to create an abstract sub class of a concrete class. Which implies that one cannot create instances of a sub-class whereas the super-class allows instantiation.

So three changes would be required:

abstract class should not be allowed to sub class from a concrete class
non-abstract methods should not be allowed to be overridden to be abstract.
Object class should be abstract

This is not a problem with Java alone, even other OOP languages also suffer from this drawback.

Whenever I talked about this to some of the Java developers their initial reaction was "how can Object be abstract, since it does not have abstract methods?". Well an abstract class need not have any abstract methods, only differences between the abstract and non-abstract classes are that an abstract can have all kinds of members which can be there in a non-abstract class, and it can additionally also have abstract methods. An abstract class can have constructors also, it may not be directly be used to create instances of the abstract class, but will still be used by the constructors of its sub class.

The change can easily be done in Java by imposing the restrictions in the compiler and as far as the existing code which uses instances of Object class directly is concerned, it could be updated to "new Object(){}" instead of "new Object()", wherever it appears.

Monday, September 1, 2008

Worthiness of Java Certification

I have been conducting technical interviews for a software company in Vadodara. While conducting the interviews, I found that there are a lot of experienced Java Programmers, who have got nice scores in the Java Certification exams(SCJP), but do not know the basics. A few of them, when questioned about their certification, simply replied that they just prepared from the Kathy Sierra and Bert Bates book and that is the secret of their good score. Most of them failed to answer questions relating to checked and unchecked exceptions, and had no clue about threads. They had wrong concepts related to the use of synchronized keyword. What is the use of such certification, which can be obtained by just mugging up the question and answers from a book.

class<JAVA extends class<JAVA>>

Friday, September 5, 2008

Non-english letters in Java source code

Thursday, September 4, 2008

What is UTF?

Wednesday, September 3, 2008

Is Jane an Object or a Class?

Tuesday, September 2, 2008

Object class in Java should have been abstract

Monday, September 1, 2008

Worthiness of Java Certification

Followers

Contributors

Blog Archive

My Shelfari Bookshelf

class<JAVA extends class<JAVA>>

Friday, September 5, 2008

Non-english letters in Java source code

Thursday, September 4, 2008

What is UTF?

Wednesday, September 3, 2008

Is Jane an Object or a Class?

Tuesday, September 2, 2008

Object class in Java should have been abstract

Monday, September 1, 2008

Worthiness of Java Certification

Followers

Contributors

Subscribe To

Blog Archive

My Shelfari Bookshelf