Collections Library for millions of elements

August 09, 2011

When to use this library

If you want to efficiently store large collections of data in memory. This library can dramatically reduce Full GC times and reduce memory consumption as well.

When you have a data type which can be represented by an interface and you want a List for this type.

List<Iinterfacetype> list = new HugeArrayBuilder<Interfacetype>() {}.create();

The type needs to be described using an interface so its represented can be changed. (Using generated byte code)

The HugeArrayBuilder builds generated classes for the InterfaceType on demand. The builder can be configured to change how the HugeArrayList is created.

A more complex example is

HugeArrayList<Mutabletypes> hugeList = new HugeArrayBuilder<Mutabletypes>() {{
        allocationSize = 1024*1024;
        classLoader = myClassLoader;
    }}.create();

How does the library differ

Uses long for sizes and indecies.
Uses column based data making the per element overhead minimal and speed up scans over a single or small set of attributes. This reduces memory usages by 2x or more.
Stores attributes in direct memory as much as possible, reducing heap usage dramatically, 10x or more.
Allocates blocks of data for consistent add() times, rather than exponentially growing an underlying store (with exponentially increasing delays on a grow of capacity)

Performance comparison

This test compares using HugeCollection vs ArrayList of JavaBeans. The class has 12 fields of various types which can be converted to primitives in different ways. By storing these primitives (or objects encoded as primitives), the time to perform GCs is almost eliminated.

With 20 GB of free memory, only 250 million JavaBeans could be created and a HugeCollection can store 500 million elements.

Classes with fields which cannot be encoded as primitives are supported in version 0.1.1 and still show an improvement in GC times.

In both cases, the amount of memory used was halved.

The project web site

HugeCollections on Google Code

The source

The best way to get the whole source, including tests is to use subversion.

svn co http://vanilla-java.googlecode.com/svn/trunk/collections/ vanilla-java-collections
cd vanilla-java-collections
mvn test

Comments

ido14 August 2011 at 08:48
This sounds great. How is access time and how is append time compared to ArrayList? How does it compare in access time to an Array of the same size?
ReplyDelete
Replies
Peter Lawrey15 August 2011 at 12:23
@Ido, The write time is higher but the read times lower. If you do a write and a read, the totla time is about 10% higher than for an ArrayList or an Array.
ReplyDelete
Replies
Parwinder Sekhon16 August 2011 at 20:55
Have you seen http://ehcache.org/documentation/user-guide/bigmemory
ReplyDelete
Replies
Peter Lawrey17 August 2011 at 11:07
@Parwinder, The difference is that BigMemory uses Serialization/Deserialization. This means you can use polymorphism and regular data value objects. The typical access time for 8 GB is 200 us.

My approach avoids the need for any serialization which makes the amount of CPU and memory used much lower. Also no garbage is produced. Typical access time is around 0.5 us to perform a update (read+write) a random object!

You can iterate and update billions of entries without creating significant garbage. ;) (I have a unit test which does exactly that, it creates one Iterator) If you use indexed access, you can avoid any object creation.
ReplyDelete
Replies
Peter Lawrey17 August 2011 at 11:19
Another difference is that objects in HugeCollections can occupy less memory, typically about half. The overhead of serialization means the objects in BigMemory can be 4x larger for a typical object. Memory is cheap, however cache is limited and more compact data structures can improve performance.
ReplyDelete
Replies
Peter Lawrey17 August 2011 at 11:20
What is missing? A Map implementation, eviction policies and disk storage. All non-trivial but planned.
ReplyDelete
Replies

Add comment

Vanilla Java