Hashcode Of A String In Java

Many of the Java programmers know what 'Hashcode' means, but don't really know how exactly it is calculated and why 31 is used to calculate the hashcode. Below is the code snippet from Java 1.6, which calculates the hashcode for a string:

public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;

for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}

return h;
}
Even if someone knows why 31 is used, there is a lot of stuff to know about 'Hashing', 'Hash Collisions' and multiple algorithms related to calculating hash values. First off, its a known fact that there is no perfect hashing algorithm, for which there are no collisions. But there are several algorithms, which minimize the collisions and are good enough to use. Now, coming to why 31 is used in calculating hashcode, this is the reason given by Joshua Bloch, in the book 'Effective Java':
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
This wasn't sufficient for me, to understand why 31 is used. I did a bit of research and found some good links, providing some info about why 31 is used. Here are some links with very good info: Stack Overflow - Why does java hashcode use 31 as a multiplier Apart from the above link, there are a couple of other links with pretty good info about hashing, hash collisions and performance of hashing algorithms: Consistency Of Hashcode On A Java String Best Hashing Algos, In Terms Of Collisions and Performance Java Array Hashcode Implementation

After reading up a bit, I wrote a sample test Java program, to find the hashcode of a string by multiplying by 31 (which is the same as shifting left (bitwise) by 5 times and subtracting, as in (i << 5) - i). Below is the sample test program:
public class TestHash {
public static void main(String[] args) {
String str1 = "What the heck?";

int hashcode1 = 0;
int hashcode2 = 0;

for(int i=0;i<str1.length();i++) {
hashcode1 = 31*hashcode1 + str1.charAt(i);
hashcode2 = (hashcode2 << 5) - hashcode2 + str1.charAt(i);
}

System.out.println("Hashcode1 : " + hashcode1);
System.out.println("Hashcode2 : " + hashcode2);
}
}


The output for this program is:

Hashcode1 : 277800975
Hashcode2 : 277800975


Apart from the above info, I want to share some info from the recent article on java.sun.com, written by Joseph Darcy. It was an interesting case of 'Unhashing' - Reverse Engineering Hashcode, To Find A String That Collides With The Actual String. This was an interesting way of finding 'Hash Collisions'. I then tested the code from Joseph Darcy, by writing a sample program, as below:

public class TestHash {
/**
* @author - Babji P, Chetty
*/
public static void main(String[] args) {
String str1 = "what the heck?";

int hashcode1 = 0;
int hashcode2 = 0;

for(int i=0;i<str1.length();i++) {
hashcode1 = 31*hashcode1 + str1.charAt(i);
hashcode2 = (hashcode2 << 5) - hashcode2 + str1.charAt(i);
}

System.out.println("Hashcode1 : " + hashcode1);
System.out.println("Hashcode2 : " + hashcode2);

String str2 = unhash(hashcode1);
System.out.println("Unhashed String From Hashcode : " + str2);

int hashcode3 = 0;
int hashcode4 = 0;

for(int i=0;i<str2.length();i++) {
hashcode3 = 31*hashcode3 + str2.charAt(i);
hashcode4 = (hashcode4 << 5) - hashcode4 + str2.charAt(i);
}

System.out.println("Hashcode3 : " + hashcode3);
System.out.println("Hashcode4 : " + hashcode4);

}

/**
* Returns a string with a hash equal to the argument.
* @return string with a hash equal to the argument.
* @author - Joseph Darcy
*/
public static String unhash(int target) {
StringBuilder answer = new StringBuilder();
if (target < 0) {
// String with hash of Integer.MIN_VALUE, 0x80000000
answer.append("\u0915\u0009\u001e\u000c\u0002");

if (target == Integer.MIN_VALUE)
return answer.toString();
// Find target without sign bit set
target = target & Integer.MAX_VALUE;
}

unhash0(answer, target);
return answer.toString();
}

/**
* 
* @author - Joseph Darcy
*/
private static void unhash0(StringBuilder partial, int target) {
int div = target / 31;
int rem = target % 31;

if (div <= Character.MAX_VALUE) {
if (div != 0)
partial.append((char)div);
partial.append((char)rem);
} else {
unhash0(partial, div);
partial.append((char)rem);
}
}
}



The output for the above program is:

Hashcode1 : 1279794159
Hashcode2 : 1279794159
Unhashed String From Hashcode : ?☻§◄
Hashcode3 : 1279794159
Hashcode4 : 1279794159


So the strings "what the heck?" and "?☻§◄" have the same hashcode. What the heck?!?!

6 comments:

  1. The program is good but how do we get back the orignal string "what the heck?" from the hash code 1279794159

    ReplyDelete
  2. There's no way to get back the original string. That's the mind of an hash code...

    You only can find collisions in hash functions, so, two strings will follow in the same hash code, such as 1279794159 in the example above.

    ReplyDelete
  3. 31*i = (32-1)*i = ((2^5)-1)*i = (i<<5) - i

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. brother,can you explain about relation in between hashcode and byte array hashcode,i wanna convert byte array hashcode to vb6,i have vb6 function for hashcode,so i think if u explain about byte array hashcode output process it may helpful for me

    ReplyDelete

Note: Only a member of this blog may post a comment.