An introduction to optimising a hashing strategy

Overview

The strategy that's used for hashing keys, can have a direct impact on the performance of a hashed collections such as a HashMap or HashSet.

The built-in hashing functions are designed to be generic and work well in a wide range of use cases.  Can we do better, especially if you have a good idea of the use case?

Testing a hashing strategy

In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimised for "Orthogonal Bits" which looked at making sure each hash result was as different as possible based on just one bit changing.

However, if you have a known set of elements/keys to hash you can optimise for that specific use case, rather trying to find a generic solution.

Minimising Collisions

One of the main things you want to avoid in a hashed collection is collisions.  This is when two or more keys map to the same bucket.  These collisions mean you have to do more work to check the key is the one you expected as there is now multiple keys in the same bucket.  Ideally there is at most 1 key in each bucket.

I just need unique hash codes don't I?

A common misconception is that to avoid collisions all you need to have unique hash code.  While unique hash codes is highly desirable, it is not enough.

Say you have a set of keys and all of them have unique 32-bit hash codes.  If you then have an array of 4 billion buckets, each key will have its own bucket, and there is no collisions.  It is generally undesirable to have such large arrays for all hash collections. In fact HashMap and HashSet are limited by the largest power of 2 size you can have for an array which is 2^30 or just over one billion.

What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes are modulo-ed to the number of buckets. If the number of buckets is a power of two, you can use a mask of the lowest bits.

Lets look at an example, ftse350.csv If we take the first column as a key or element, we get 352 strings.  These strings have unique String.hashCode()s, but say we take the lower bits of these hash code. Do we see collisions?

Mask String.hashCode() masked HashMap.hash(
String.hashCode()) masked
32 bits No collisions No collisions
16 bits 1 collision 3 collisions
15 bits 2 collisions 4 collisions
14 bits 6 collisions 6 collisions
13 bits 11 collisions 9 collisions
12 bits 17 collisions 15 collisions
11 bits 29 collisions 25 collisions
10 bits 57 collisions 50 collisions
9 bits 103 collisions 92 collisions

The size of the HashMap for a load factor of 0.7 (the default) is 512 which uses a mask of the lower 9 bits.  As you can see around 30% of keys have a collision even though we started with unique hash codes.


To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

From the source for HashMap.hash You can read the Javadoc for more details

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This mixes the high bits of the hash code with the low bits, to improve the randomness of the lower bits.  For the case above where there is a high collision rate, there is an improvement. See the third column.

A look at the hash function for String

The code for String.hashCode()

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

Note: the implementation for String is defined in the Javadoc so there is little chance we can change it but we could define a new hashing strategy.

Components of a hashing strategy.

There is two parts I look at in a hashing strategy.
  • The magic numbers.  You can try different numbers to find the best result.
  • The structure of the code.  You want a structure where you get a good result for any sane choice of magic number.
While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case.  This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.

Lets try some different multiplying factors instead of 31.

Multiplier Collisions
1
230
2
167
3
113
4
99
5
105
6
102
7
93
8
90
9
100
10
91
11
91

You can see that the choice of a magic number matters, but also there is lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain

Hash function Best multiplier Lowest collisions Worst multiplier Highest Collisions
hash()
130795
81 collisions
126975
250 collisions
xorShift16(hash())
2104137237
68 collisions
-1207975937
237 collisions
addShift16(hash())
805603055
68 collisions
-1040130049
243 collisions
xorShift16n9(hash())
841248317
69 collisions
467648511
177 collisions

The key code to look at is

public static int hash(String s, int multiplier) {
    int h = 0;
    for (int i = 0; i < s.length(); i++) {
        h = multiplier * h + s.charAt(i);
    }
    return h;
}

private static int xorShift16(int hash) {
    return hash ^ (hash >> 16);
}

private static int addShift16(int hash) {
    return hash + (hash >> 16);
}

private static int xorShift16n9(int hash) {
    hash ^= (hash >>> 16);
    hash ^= (hash >>> 9);
    return hash;
}

As you can see the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.

If you use the agitation function as well you can get around 68 collisions.  This is getting close to the same collision rate as doubling the size of the array. i.e. an improved collision rate without using more memory.

But what happens when we add new keys to the hash collection, will our magic number still be good for us?  This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs.  The worst case for hash() is 250 collisions, That is 70% of keys colliding which is pretty bad.  The agitation function improves this a little however it's still not great.  Note: if we add the shifted value instead of xor-ing it we get a worse result in this case.

However if we do two shifts, to mix not just the top and bottom bits, but bits from four different portions of the hash code generated, we find that the worst case collision rate is much lower.  This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.

What if we have add instead of xor in the hash function?

In the agitation function using xor was perhaps better than using add. What happens if we change this

h = multiplier * h + s.charAt(i);


with


h = multiplier * h ^ s.charAt(i);

Hash function Best multiplier Lowest collisions Worst Score Highest Collisions
hash()
1724087
78 collisions
247297
285 collisions
xorShift16(hash())
701377257
68 collisions
-369082367
271 collisions
addShift16(hash())
-1537823509
67 collisions
-1409310719
290 collisions
xorShift16n9(hash())
1638982843
68 collisions
1210040321
206 collisions

The best case numbers are slightly better, however the worst case collision rate are notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more.  This would seem a risky choice as you have to consider that the keys may change over time.

Why do we chose odd multipliers?

When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 * 1 = 0 and 1 * 1 = 1.  However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?

Hash function Best multiplier Lowest collisions Worst Score Highest Collisions
hash()
82598
81 collisions
290816
325 collisions
xorShift16(hash())
1294373564
68 collisions
1912651776
301 collisions
addShift16(hash())
448521724
69 collisions
872472576
306 collisions
xorShift16n9(hash())
1159351160
66 collisions
721551872
212 collisions

If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.

How do more advanced hashing strategies differ?

For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own)
  • The hashing strategy reads 64-bits at a time which is faster than reading byte-by-byte.
  • The working value computed is two 64-bit values.
  • The working value is reduced to a 64-bit long.
  • More multiplying constants are used as a result.
  • The agitation function is more complex.
We use long hash codes in our implementation as;
  • we optimise for 64-bit processors, 
  • the longest primitive data type is 64-bit in Java, and 
  • if you have large hash collections (i.e. millions) 32-bit hashes are unlikely to be unique.

In summary

By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions but also have some confidence than should the key set change, we have reduced the impact this might have had.

This is without using more memory, or even much more processing power.
We still have the option of utilising more memory.

For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a miss match between the key set and the magic number can still have a high collision rate.

Hash function Best multiplier Lowest collisions Worst Score Highest Collisions
hash()
2924091
37 collisions
117759
250 collisions
xorShift16(hash())
543157075
25 collisions
- 469729279
237 collisions
addShift16(hash())
-1843751569
25 collisions
- 1501097607
205 collisions
xorShift16n9(hash())
-2109862879
27 collisions
-2082455553
172 collisions

Conclusion

In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.  
You also need tests which indicate how bad things are likely to get if the key set changes without re-optimisation. 
Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.




Comments

  1. Interesting study, thanks Peter.

    "325 collisions means that only 27 out of 512 buckets are being used". Surely it means only 27 buckets have a single entry? For only 27 to be used, all collisions must hash to the same bucket.


    Also pertinent is the "Birthday Problem", which led me to this: http://matt.might.net/articles/counting-hash-collisions/. Assuming a good hash is equally likely to assign an element to any of the buckets, with 352 elements and 512 buckets you might expect 352 - 512 + 512.(511/512)^352 = 97 collisions. You've done better than this by playing with a fixed input set. With a larger unknown set, I'd say its a good check that a hash function under tests that generates around this many collisions.

    ReplyDelete
    Replies
    1. It means that 325 keys are in the same bucket as another keys. It could be that 26 buckets have 1 keys and 1 has all the others. In any case, it is a poor utilisation.

      Delete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues