An introduction to optimising a hashing strategy
Overview
The strategy that's used for hashing keys, can have a direct impact on the performance of a hashed collections such as a HashMap or HashSet.The builtin hashing functions are designed to be generic and work well in a wide range of use cases. Can we do better, especially if you have a good idea of the use case?
Testing a hashing strategy
In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimised for "Orthogonal Bits" which looked at making sure each hash result was as different as possible based on just one bit changing.However, if you have a known set of elements/keys to hash you can optimise for that specific use case, rather trying to find a generic solution.
Minimising Collisions
One of the main things you want to avoid in a hashed collection is collisions. This is when two or more keys map to the same bucket. These collisions mean you have to do more work to check the key is the one you expected as there is now multiple keys in the same bucket. Ideally there is at most 1 key in each bucket.
I just need unique hash codes don't I?
A common misconception is that to avoid collisions all you need to have unique hash code. While unique hash codes is highly desirable, it is not enough.
Say you have a set of keys and all of them have unique 32bit hash codes. If you then have an array of 4 billion buckets, each key will have its own bucket, and there is no collisions. It is generally undesirable to have such large arrays for all hash collections. In fact HashMap and HashSet are limited by the largest power of 2 size you can have for an array which is 2^30 or just over one billion.
What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes are moduloed to the number of buckets. If the number of buckets is a power of two, you can use a mask of the lowest bits.
Lets look at an example, ftse350.csv If we take the first column as a key or element, we get 352 strings. These strings have unique String.hashCode()s, but say we take the lower bits of these hash code. Do we see collisions?
Mask  String.hashCode() masked  HashMap.hash(
String.hashCode()) masked 
32 bits  No collisions  No collisions 
16 bits  1 collision  3 collisions 
15 bits  2 collisions  4 collisions 
14 bits  6 collisions  6 collisions 
13 bits  11 collisions  9 collisions 
12 bits  17 collisions  15 collisions 
11 bits  29 collisions  25 collisions 
10 bits  57 collisions  50 collisions 
9 bits  103 collisions  92 collisions 
The size of the HashMap for a load factor of 0.7 (the default) is 512 which uses a mask of the lower 9 bits. As you can see around 30% of keys have a collision even though we started with unique hash codes.
The code for HashTesterMain is here.
To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.
From the source for HashMap.hash You can read the Javadoc for more details
static final int hash(Object key) { int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }
This mixes the high bits of the hash code with the low bits, to improve the randomness of the lower bits. For the case above where there is a high collision rate, there is an improvement. See the third column.
A look at the hash function for String
The code for String.hashCode()
public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }
Note: the implementation for String is defined in the Javadoc so there is little chance we can change it but we could define a new hashing strategy.
Components of a hashing strategy.
There is two parts I look at in a hashing strategy.
 The magic numbers. You can try different numbers to find the best result.
 The structure of the code. You want a structure where you get a good result for any sane choice of magic number.
While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case. This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.
Lets try some different multiplying factors instead of 31.
Multiplier  Collisions 
1

230

2

167

3

113

4

99

5

105

6

102

7

93

8

90

9

100

10

91

11

91

You can see that the choice of a magic number matters, but also there is lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain
Hash function  Best multiplier  Lowest collisions  Worst multiplier  Highest Collisions 
hash() 
130795

81 collisions 
126975

250 collisions 
xorShift16(hash()) 
2104137237

68 collisions 
1207975937

237 collisions 
addShift16(hash()) 
805603055

68 collisions 
1040130049

243 collisions 
xorShift16n9(hash()) 
841248317

69 collisions 
467648511

177 collisions 
The key code to look at is
public static int hash(String s, int multiplier) { int h = 0; for (int i = 0; i < s.length(); i++) { h = multiplier * h + s.charAt(i); } return h; } private static int xorShift16(int hash) { return hash ^ (hash >> 16); } private static int addShift16(int hash) { return hash + (hash >> 16); } private static int xorShift16n9(int hash) { hash ^= (hash >>> 16); hash ^= (hash >>> 9); return hash; }
As you can see the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.
If you use the agitation function as well you can get around 68 collisions. This is getting close to the same collision rate as doubling the size of the array. i.e. an improved collision rate without using more memory.
But what happens when we add new keys to the hash collection, will our magic number still be good for us? This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs. The worst case for hash() is 250 collisions, That is 70% of keys colliding which is pretty bad. The agitation function improves this a little however it's still not great. Note: if we add the shifted value instead of xoring it we get a worse result in this case.
However if we do two shifts, to mix not just the top and bottom bits, but bits from four different portions of the hash code generated, we find that the worst case collision rate is much lower. This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.
What if we have add instead of xor in the hash function?
In the agitation function using xor was perhaps better than using add. What happens if we change this
h = multiplier * h + s.charAt(i);
with
h = multiplier * h ^ s.charAt(i);
Hash function  Best multiplier  Lowest collisions  Worst Score  Highest Collisions 
hash() 
1724087

78 collisions 
247297

285 collisions 
xorShift16(hash()) 
701377257

68 collisions 
369082367

271 collisions 
addShift16(hash()) 
1537823509

67 collisions 
1409310719

290 collisions 
xorShift16n9(hash()) 
1638982843

68 collisions 
1210040321

206 collisions 
The best case numbers are slightly better, however the worst case collision rate are notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more. This would seem a risky choice as you have to consider that the keys may change over time.
Why do we chose odd multipliers?
When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 * 1 = 0 and 1 * 1 = 1. However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?
Hash function  Best multiplier  Lowest collisions  Worst Score  Highest Collisions 
hash() 
82598

81 collisions 
290816

325 collisions 
xorShift16(hash()) 
1294373564

68 collisions 
1912651776

301 collisions 
addShift16(hash()) 
448521724

69 collisions 
872472576

306 collisions 
xorShift16n9(hash()) 
1159351160

66 collisions 
721551872

212 collisions 
If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.
How do more advanced hashing strategies differ?
For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own)
 The hashing strategy reads 64bits at a time which is faster than reading bytebybyte.
 The working value computed is two 64bit values.
 The working value is reduced to a 64bit long.
 More multiplying constants are used as a result.
 The agitation function is more complex.
We use long hash codes in our implementation as;
 we optimise for 64bit processors,
 the longest primitive data type is 64bit in Java, and
 if you have large hash collections (i.e. millions) 32bit hashes are unlikely to be unique.
In summary
By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions but also have some confidence than should the key set change, we have reduced the impact this might have had.
This is without using more memory, or even much more processing power.
We still have the option of utilising more memory.
For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a miss match between the key set and the magic number can still have a high collision rate.
Hash function  Best multiplier  Lowest collisions  Worst Score  Highest Collisions 
hash() 
2924091

37 collisions 
117759

250 collisions 
xorShift16(hash()) 
543157075

25 collisions 
 469729279

237 collisions 
addShift16(hash()) 
1843751569

25 collisions 
 1501097607

205 collisions 
xorShift16n9(hash()) 
2109862879

27 collisions 
2082455553

172 collisions 
Conclusion
In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.
You also need tests which indicate how bad things are likely to get if the key set changes without reoptimisation.
Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.
Interesting study, thanks Peter.
ReplyDelete"325 collisions means that only 27 out of 512 buckets are being used". Surely it means only 27 buckets have a single entry? For only 27 to be used, all collisions must hash to the same bucket.
Also pertinent is the "Birthday Problem", which led me to this: http://matt.might.net/articles/countinghashcollisions/. Assuming a good hash is equally likely to assign an element to any of the buckets, with 352 elements and 512 buckets you might expect 352  512 + 512.(511/512)^352 = 97 collisions. You've done better than this by playing with a fixed input set. With a larger unknown set, I'd say its a good check that a hash function under tests that generates around this many collisions.
It means that 325 keys are in the same bucket as another keys. It could be that 26 buckets have 1 keys and 1 has all the others. In any case, it is a poor utilisation.
DeleteGreat Article on Hashing
ReplyDeleteOnline Java Training