this is the difference between static and non-static

Friday, September 5, 2008

Non-english letters in Java source code

Rules for defining identifiers in Java:
1. An identifier is composed of a sequence of letters, digits, the currency symbol $ and the separator character _.
2. An identifier cannot start with a digit.
3. An identifier cannot be a reserved from the Java language.

This is not different from the rules in the C programming language. The difference is of the character set used by the programming language. Java uses the Unicode character set and hance the letters and the digits used here are not restricted to the letters and digits available from the ASCII character set, which is used in the C programming language. In Java the letters and digits from any language can be used for example, anyone can use (the devanagari letter A, unicode code point value +U0905) in an identifier. The question arises as how can we use these characters in our java source code while using an editor which supports only ascii files. In java source code one can use the unicode escape to specify any unicode character. A unicode escape is written as \uhhhh where hhhh are the four hexadecimal digits for the unicode code unit according to UTF-16. The java compiler would first interpret these unicode escapes before identifying the lines and tokens from the source code. The following are a few interesting examples:

char \u0905 = '\u0905'; is a valid code where the variable named अ (devanagari letter A) is assigned the value 0905(hex).

char ch = '\u000A'; is not valid since the unicode escape \u000A is a new line character and would be seen follows:

char ch = '
';

Thus creating a compilation error. anyone can easily obfuscate the source code by using the unicode escapes for space and semicolon characters, this will make the code totally unreadable.

Currently most of the programmers use the letters and digits from the english language in defining the identifiers. It may be worth trying to create obfuscators which could change the identifiers to non-english letters and digits.

2 comments:

Chandrakant said...

Sir,

Does this mean that if someone is keen to protect his/her code from pirates who reverse engineer the code, should use the unicode escapes for space and semicolon characters.
anyone can easily obfuscate the source code by using the unicode escapes for space and semicolon characters, this will make the code totally unreadable.

Pravin Jain said...

No, this will not protect against reverse engineering since when the code is decompiled using some decompiler, the original space and semicolons will appear in the decompiled code.
yes but in obfuscators which are available for protecting the code, most of them replace the identifiers with letters from latin character set. It is only a suggestion that these obfuscators may not restrict to latin characters only and may use letters and digits from other languages as well.