Order of elements in a hash collection

Overview

While is it generally understood that keys or elements in a HashMap or HashSet occur in a pseudo random order, what is not obvious is that two collections with the same elements can be in different orders. This is because the capacity of the collection also determines the order the elements appear.

This can be important if you have only an Iterator to a Set. It is easy to assume the order will be the same and most of the time it will work (you can write unit tests with this assumption which will pass) However, if the Set has a different capacity (something you don't normally know or worry about) the order will change.

Find what order elements can appear in

The following test adds the same 11 elements to a HashSet, each time with a different collection and prints out all the combinations it finds.

@Test
public void testSetOrder() {
  Set<String> order = new HashSet<String>();
  Collection<String> elements = Arrays.asList(
                     "zero, one, two, three, four, five, six, seven, eight, nine, ten".split(", "));

  try {
  for(int i=1;i > 0;i*=2) {
    Set set = new HashSet<String>(i, 100.0f);
    set.addAll(elements);
    String str = set.toString();
    if (order.add(str))
      System.out.println("HashSet("+i+") order was "+str);
  }
  } catch(OutOfMemoryError ignored) { }
}

prints this output

HashSet(1) order was [ten, nine, eight, seven, six, five, four, three, two, one, zero]
HashSet(2) order was [eight, three, zero, ten, nine, seven, six, five, four, two, one]
HashSet(4) order was [three, zero, ten, two, eight, nine, seven, six, five, four, one]
HashSet(8) order was [three, ten, two, seven, five, four, zero, eight, nine, six, one]
HashSet(16) order was [ten, two, seven, five, nine, one, three, four, zero, eight, six]
HashSet(32) order was [ten, five, nine, one, zero, eight, six, two, seven, three, four]
HashSet(64) order was [ten, nine, one, zero, eight, six, three, four, five, two, seven]
HashSet(128) order was [ten, nine, zero, eight, three, one, six, four, five, two, seven]
HashSet(256) order was [nine, zero, six, ten, eight, three, one, four, five, two, seven]
HashSet(512) order was [six, four, five, seven, nine, zero, ten, eight, three, one, two]
HashSet(1024) order was [six, five, nine, eight, two, four, seven, zero, ten, three, one]
HashSet(2048) order was [eight, seven, zero, six, five, nine, two, four, ten, three, one]
HashSet(4096) order was [eight, seven, zero, six, five, nine, three, one, two, four, ten]
HashSet(8192) order was [eight, seven, zero, six, nine, four, five, three, one, two, ten]
HashSet(16384) order was [eight, zero, nine, five, three, two, ten, seven, six, four, one]
HashSet(32768) order was [zero, six, one, eight, nine, five, three, two, ten, seven, four]
HashSet(65536) order was [zero, eight, five, seven, four, six, one, nine, three, two, ten]
HashSet(131072) order was [nine, zero, eight, five, seven, four, six, one, three, two, ten]
HashSet(262144) order was [nine, five, seven, six, one, two, ten, zero, eight, four, three]
HashSet(524288) order was [nine, seven, six, one, two, ten, zero, four, five, eight, three]
HashSet(1048576) order was [nine, seven, six, one, two, ten, four, eight, three, zero, five]
HashSet(2097152) order was [seven, six, one, two, ten, five, nine, four, eight, three, zero]
HashSet(4194304) order was [six, one, two, ten, eight, seven, five, nine, four, three, zero]
HashSet(8388608) order was [six, one, two, ten, eight, five, nine, four, zero, seven, three]
HashSet(16777216) order was [six, one, two, ten, five, nine, four, zero, eight, seven, three]
HashSet(33554432) order was [six, one, two, ten, five, nine, four, zero, seven, three, eight]

As you can see almost every initial capacity has a different order (even for a relatively small number of elements)

Single note: with a capacity of 1 and a large load factor (to prevent it resizing the capacity), the HashSet becomes a linked list which shows object in reverse order. i.e. as there is 100% collisions.


As @Alex Zhang, notes, if you have small positive numbers (Byte, Short, Integer, Long) and you allow the capacity to grow naturally, you will get those numbers in order. However if you vary this a little the pattern breaks up.
Collection<Integer> col = Arrays.asList(-1, 1, 10, 100, 1000, 10000, 100000, 1000000);
for (int i = 1; i < 5000000; i *= 2) {
    Set<Integer> set = new HashSet<Integer>(i, 100f);
    set.addAll(col);
    System.out.println("HashSet(" + i + ") " + set);
}
prints
HashSet(1) [1000000, 100000, 10000, 1000, 100, 10, 1, -1]
HashSet(2) [1000000, 100000, 100, 10, 10000, 1000, 1, -1]
HashSet(4) [10000, 1000, 1, 1000000, 100000, 100, 10, -1]
HashSet(8) [1000, 1, 1000000, 100, 10, 10000, 100000, -1]
HashSet(16) [1000, 1, 100, 1000000, 10, 10000, 100000, -1]
HashSet(32) [1, 100, 10, 10000, 1000, 1000000, 100000, -1]
HashSet(64) [1, 10, 1000, 1000000, 100000, -1, 100, 10000]
HashSet(128) [1, 10, 1000000, -1, 10000, 1000, 100000, 100]
HashSet(256) [1, 10, 1000000, -1, 10000, 100, 1000, 100000]
HashSet(512) [1, 10, 1000000, 100, -1, 10000, 1000, 100000]
HashSet(1024) [1, 10, 1000000, 100, 10000, 100000, -1, 1000]
HashSet(2048) [1, 10, 1000000, 100, 1000, 10000, 100000, -1]
HashSet(4096) [1, 10, 100, 1000, 10000, 1000000, 100000, -1]
HashSet(8192) [1, 10, 100, 1000, 10000, 1000000, -1, 100000]
HashSet(16384) [1, 10, 100, 1000, 100000, 10000, 1000000, -1]
HashSet(32768) [1, 10, 100, 1000, 100000, 10000, 1000000, -1]
HashSet(65536) [1, 10, 100, 1000, 10000, 100000, 1000000, -1]
HashSet(131072) [1, 10, 100, 1000, 10000, -1, 100000, 1000000]
HashSet(262144) [1, 10, 100, 1000, 10000, -1, 100000, 1000000]
HashSet(524288) [1, 10, 100, 1000, 10000, -1, 100000, 1000000]
HashSet(1048576) [1, 10, 100, 1000, 10000, -1, 100000, 1000000]
HashSet(2097152) [1, 10, 100, 1000, 10000, 100000, 1000000, -1]
HashSet(4194304) [1, 10, 100, 1000, 10000, 100000, 1000000, -1]

The Code

Comments

  1. The order of elements largely depends on the hash function though. In the JDK's default impl, it is the hashcode % capacity. so if your elements are sorted in the hashcode, then you can expect an ordered list regardless of the collection size.

    private void elementOrderInHashCollection() {
    Collection col = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
    for(int i = 1; i<1000000; i*=2){
    Set set = new HashSet(i);
    set.addAll(col);
    System.out.println("HashSet(" + i + ") " + set );
    }
    }

    HashSet(1) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(2) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(4) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(8) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(16) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(32) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(64) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(128) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(256) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(512) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(1024) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(2048) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(4096) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(8192) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(16384) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(32768) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(65536) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(131072) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(262144) [1, 2, 3, 4, 5, 6, 7, 8, 9]
    HashSet(524288) [1, 2, 3, 4, 5, 6, 7, 8, 9]

    ReplyDelete
  2. @Alex Zhang, Thank you for your comment. I have replied in the main body (for the sake of formatting)

    ReplyDelete
  3. One of the many reasons you should avoid writing code like this is that hashing algorithms can change, as they did between Java 5 and 6 - in java 6 an extra hash is applied upon the object supplied of (if memory serves right) to somewhat alleviate the problem of poor hashes. Thus sets in Java 5 and 6 will contain elements in different order, but this should be a non-issue, since their public contract doesn't specify any particular order!

    As a sidenode: Perl goes even further and at every run a pseudo-random number is generated which influences the iteration order of hashes (which are like maps) to discourage programmers from writing code which is based on the iteration order.

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues