Showing posts from August, 2011

Java can be significantly faster than C

Preface When I wrote this article it was to comment on how writing the same algorithm implement a different way can make a major difference. I now believe this is not the point of the web site perhaps shouldn't be the point of this article either. I have written a follow up article. The importance of innovation Overview Whether you use Java or C is not always as important as the approach you use. By "approach" I mean; algorithms not specified in the requirements or "the how" to do something. You might not see this as a fair test of Java vs C, but in the real world human factors are matter. There is no point saying C is faster in theory but there isn't anyone available to make it so. Its like getting a very cheap price on an item which is not in stock. ;) Benchmark Shootout Java is temporarily the fastest for this particular knucleotide benchmark. It is quite likely that if the algorithm I used is translated to C it would be faster again. But for

MethodHandle performance in Java 7

Overview A new feature in Java 7 provides a MethodHandle which "is a typed, directly executable reference to an underlying method, constructor, field, or similar low-level operation, with optional transformations of arguments or return values" This supports currying methods and many other features lambda based languages take for granted. Note: this method compares invoke() in reflections and MethodHandles. I couldn't get invokeExact to work when returning a primitive (it got confused between int and Integer ) However another benchmark Using Method Handles by Andrew Till indicated using invokeExact is much faster than using Method.invoke() with reflections. Thank you wensveen for the link. Example Say you have two method public static int multiply(int i, int j) { return i * j; } public static int subtract(int i, int j) { return i - j; } With MethodHandle you can not only get a reference to the Method (like in reflection) but you can construct new MethodH

Added Memory Mapped support to HugeCollections

Overview The Huge collections library is designed to support large collections on data in memory efficiently without GC impact. It does this using heap less memory and generated code for efficiency. One of the benefits of this approach is memory mapped files can be larger than the available memory, use trivial heap and no direct memory. Memory mapping Loading large amounts of data is time consuming so being able to re-use a store persisted to disk can improve efficiency. With the use of SSD drives the amount of "memory" a Java system can access is effectively extended provided it can be used. How does it perform Adding an object with 12 fields and reading objects sequentially takes about 110 ns for one billion elements. It takes less than two minutes to add all the entries, and the same to read all the elements. Random access is much more expensive until all the data is in memory, with a fast SSD drive it takes about 600 ns when most of the data is in cache, slower i

Writing to stdin

Overview or stdin is usually used as an InputStream. However on Linux you can access this stream in other ways. Accessing file descriptor 0 In linux, each file descriptor is accessible via /proc/{process-id}/fd/{file-id} You can use this to see what files a process has open but also see the contents of a file. Writing and memory mapping Getting the process id in Java is obscure, but once you have this you can re-open existing file descriptors such as stdin, file descriptor 0. int processId = Integer.parseInt(ManagementFactory.getRuntimeMXBean().getName().split("@")[0]); RandomAccessFile raf = new RandomAccessFile("/proc/" + processId + "/fd/0", "rw"); final FileChannel fc = raf.getChannel(); final MappedByteBuffer bb =, 0, fc.size()); bb.putLong(System.nanoTime()); bb.force(); raf.close(); This opens the is a read-write mode and memory maps it before changing its contents.

Avoiding Java Serialization to increase performance

Overview Many frameworks for storing objects in an off-line or cached manner, use standard Java Serialization to encode the object as bytes which can be turned back into the original object. Java Serialization is generic and can serialise just about any type of object. Why avoid it The main problem with Java Serialization is performance and efficiency. Java serialization is much slower than using in memory stores and tends to significantly expand the size of the object. Java Serialization also creates a lot of garbage. Access performance Say you have a collection and you want to update a field of many elements. Something like for (MutableTypes mt : mts) { mt.setInt(mt.getInt()); } If you update one million elements for about five seconds how long does each one take. Collection Fast PC Labtop Huge Collection 5.1 ns 33 ns List<JavaBean> 6.5 ns 54 ns List<byte[]> with Externalizable 5,841 ns 17,508 ns List<byte[]> with Serializable 23,217 ns 60

Generating ASCII Banners

Overview Generating ASCII banners can be tedious. However there is a simple short cut, use an image to display the text. Creating an invisible image You don't have to have a GUI to create images. The following code creates an image in memory without a GUI. BufferedImage image = new BufferedImage(144, 32, BufferedImage.TYPE_INT_RGB); Graphics g = image.getGraphics(); g.setFont(new Font("Dialog", Font.PLAIN, 24)); Graphics2D graphics = (Graphics2D) g; graphics.setRenderingHint(RenderingHints.KEY_TEXT_ANTIALIASING, RenderingHints.VALUE_TEXT_ANTIALIAS_ON); graphics.drawString("Hello World!", 6, 24); ImageIO.write(image, "png", new File("text.png")); for (int y = 0; y < 32; y++) { StringBuilder sb = new StringBuilder(); for (int x = 0; x < 144; x++) sb.append(image.getRGB(x, y) == -16777216 ? " " : image.getRGB(x, y) == -1 ? "#" : "*"); if (sb.toString().trim().isEmp

Double your money again

Overview A long time ago I wrote an article on using double for money. However, it is still a common fear for many developers when the solution is fairly simple. Why not use BigDecimal @myfear (Markus) asks a good question, doesn't BigDecimal do rounding already, (and with more options) IMHO, There are two reason you may want to avoid BigDecimal Clarity.  This may be a matter of option but I find x * y / z clearer than x.multiply(y).divide(z, 10, BigDecimal.ROUND_HALF_UP) Performance. BigDecimal is often 100x slower. Took an average of 6 ns for rounding using cast Took an average of 17 ns for rounding using Math.round Took an average of 932 ns for rounding using BigDecimal.setScale You might like to test the difference on your machine. The code When to use BigDecimal The better question is why would you ever use BigDecimal? Arbitary precision. If you want more than 15 decimal places of precision, use BigDecimal. Precise rounding

Why testing code for thread safety is flawed

Overview We know that ++ is not thread safe, even for volatile field, however there is a trick to proving this. The problem with testing code for thread safety is that it can happen to work repeatedly, but still be unsafe. An example I wrote a lock free ring buffer which passed multi-threaded tests of one billion entries repeatedly. However, a couple of days later the test consistently failed. What was the difference? There was a couple of CPU intensive processes also running on the same box. When these processes finished, the test passed again. The problem If you try to prove that incremental volatile variable is not thread safe, you can get a result like this. public static void main(String... args) throws InterruptedException { for (int nThreads = 1; nThreads <= 64; nThreads*=2) doThreadSafeTest(nThreads); } static class VolatileInt { volatile int num = 0; } private static void doThreadSafeTest(final int nThreads) throws InterruptedException {

Collections Library for millions of elements

When to use this library If you want to efficiently store large collections of data in memory. This library can dramatically reduce Full GC times and reduce memory consumption as well. When you have a data type which can be represented by an interface and you want a List for this type. List<Iinterfacetype> list = new HugeArrayBuilder<Interfacetype>() {}.create(); The type needs to be described using an interface so its represented can be changed. (Using generated byte code) The HugeArrayBuilder builds generated classes for the InterfaceType on demand. The builder can be configured to change how the HugeArrayList is created. A more complex example is HugeArrayList<Mutabletypes> hugeList = new HugeArrayBuilder<Mutabletypes>() {{ allocationSize = 1024*1024; classLoader = myClassLoader; }}.create(); How does the library differ Uses long for sizes and indecies. Uses column based data making the per element overhead minimal and s

Tuning buffer sizes

Overview When the typical read/write size is small, using a buffer can make a significant improvement in performance. However when reading/writing large amounts of data, additional buffered doesn't help and can add a small amount of overhead. Additionally, it can be tempting to assume; the larger the buffer the better. However is appears a more modest buffer size, around the size of your L1 cache could be better. e.g 8 KB to 32 KB. The test In the following test I compared using buffered and unbuffered reads and writes to a temporary file on a tmpfs filesystem (ram disk) Size of read/write Unbuffered Writes   Buffered Writes Unbuffered Reads   Buffered Reads 1 2 MB/s 86 MB/s 3 MB/s 72 MB/s 2 4 MB/s 165 MB/s 6 MB/s 147 MB/s 4 8 MB/s 333 MB/s 11 MB/s 291 MB/s 8 17 MB/s 578 MB/s 24 MB/s 560 MB/s 16 34 MB/s 920 MB/s 49 MB/s 983 MB/s 32 67 MB/s 1,345 MB/s 99 MB/s 1,679 MB/s 64 133 MB/s 1,746 MB/s 198 MB/s 2,518 MB/s 128 254 MB/s 2,024 MB/s 391 MB/s 3,387 MB/s 2

Comparing Collections speeds

Overview The speed of different collections tends to depend on their usage pattern however it can be useful to have an idea of their relative performance. The test In this test I compare both single thread and thread safe collections of different sizes. Adding is from lowest to highest values. Removal is from start and end, finishing with the middle value. This disadvantages ArrayList and LinkedList relatively equally. Collection Size Adding Iterating Removing ArrayList 100,000 30.1 14.5 67,359 ArrayList 10,000 8.3 11.2 6,323 ArrayList 1,000 6.4 9.4 585 ArrayList 100 6.8 9.5 92.9 ArrayList 10 9.7 13.2 36.6 LinkedList 100,000 46.4 41.4 139,326 LinkedList 10,000 9.3 30.3 13,413 LinkedList 1,000 8.7 27.6 1,025 LinkedList 100 7.8 19.8 110 LinkedList 10 11.2 30.3 38.7 HashSet 100,000 28.6 115 69.5 HashSet 10,000 25.7 324 58.6 HashSet 1,000 26.0 2,521 59.2 HashSet 100 30.0 24,490 64.6 HashSet 10 51.7 244,045 131 LinkedHashSet 100,000 35.9 103 69.2 Li

Reading/writing GC-less memory

Overview How you access data can make a difference to the speed. Whether you use manual loop unrolling or let the JIT do it for you can also make a difference to performance. I have included C++ and Java tests doing the same thing for comparison. Tests In each case, different approaches to storing 16 GB of data were compared. In the following tests I compared storing data allocating, writing to, reading from and total GC times byte[] (smallest primitive) and long[] (largest primitive) arrays, direct ByteBuffer and Unsafe JIT optimised and hand unrolled four times store type size unrolled allocate writing reading GC time C++ char[] native 8-bit char no 31 μs 12.0 s 8.7 s N/A C++ char[] native 8-bit char yes 5 μs 8.8 s 6.6 s N/A C++ long long[] native 64-bit int no 11 μs 4.6 s 1.4 s N/A C++ long long[] native 64-bit int yes 12 μs 4.2 s 1.2 s N/A byte[] heap byte no 4.9 s 20.7/7.8 s 7.4 s 51 ms byte[] heap byte yes 4.9 s 7.1 s 8.5 s 44 ms long[] heap long

50,000 pages views last month

Thank you for your interest. At the moment I am working a few pet projects including A low latency, low GC, parsing, logging and writing library. A direct memory collection library for storing massive amounts of data efficiently with minimal heap. See some of my previous post for teasers on how these will work.

A collection with billions of entries

Overview There are a number of problems with having a large number of records in memory. One way around this is to use direct memory, but this is too low level for most developers. Is there a way to make this more friendly? Limitations of large numbers of objects The overhead per object is between 12 and 16 bytes for 64-bit JVMs. If the object is relatively small, this is significant and could be more than the data itself. The GC pause time increases with the number of objects. Pause times can be around one second per GB of objects. Collections and arrays only support two billion elements Huge collections One way to store more data and still follow object orientated principles is have wrappers for direct ByteBuffers.  This can be tedious to write, but very efficient. What would be ideal is to have these wrappers generated automatically. Small JavaBean Example This is an example of JavaBean which would have far more overhead than actual data contained. interface MutableBy

Should server applications limit themselves to 4 GB?

Overview A very interesting talk by CTO of Azul, Gil Tene raises many timely issues about GC performance and server memory sizes. A typical new server should be 100 GB One point he brings up is; with memory so cheap a typical server should be around 100 GB (at around $7K) Installing less memory is wasting data centre space. A typical JVM only uses 2-4 GB as the GC pause hurts responsiveness at this point. However, if you are developing an application which only uses 2-4 GB you are writing an application which will soon fit on a mobile phone. So how can we write server applications which can use the whole server in Java? Java without the GC Pauses: Keeping Up with Moore’s Law and Living in a Virtualized World He raises questions about how best to use that memory and not surprisingly Azul have a product which will scales to that much memory efficiently. For those who are not planing to buy such a system, you could be thinking about how you can use more memory efficiently. My

Strange Hello World in Java

Overview This is a strange Hello World program because it only needs to be compiled, not run. How does it work? Hint: it only works on MS-DOS/Windows. ;) The Program In a file called or any name exception class Con { String hi = "\n\n\tHello World!\n\n"; } when compiled produces How does this work MS-DOS/Windows treats all files which start with CON (in any case and with any extension) as the console. So when javac writes the Con.class file, it is written to the console instead. In the MS world; its not a bug, its a feature. For more details Use PRINT# to MS-DOS "CON" Device to Send ANSI Escape Codes

Comparing Java 7 Async NIO with NIO.

Overview SDP ( Sockets Direct Protocol ) in Java 7 promises better integration with InfiniBand. This comes with a new Socket classes AsynchronousSocketChannel and AsynchronousServerSocketChannel. Is there any advantage in using these libraries if you don't have InfiniBand? Creating a server socket The API for AsynchronousServerSocketChannel is slightly different to ServerSocketChannel. The most notable difference is the use of Futures instead of always blocking. final AsynchronousServerSocketChannel ssc = InetSocketAddress("localhost", 9999)); Future<AsynchronousSocketChannel> accepted = ssc.accept(); AsynchronousSocketChannel sc2 = accepted.get(); ByteBuffer bb = ByteBuffer.allocateDirect(4096); Future<Integer> readFuture =; Creating a client socket Similarly the client can start a connection and later wait when the connection is really needed. AsynchronousSocketChannel sc = Asynchron

Using tmpfs to speed up "disk" access

Overview If you have an application which performs a lot disk reads/writes and a modest amount of data, you don't need to keep. e.g. a unit tests, application builds. Using tmpfs may be the answer speeding up the process. In this article I compare using a tmpfs, an SSD, a hard drive and an NFS drive. Comparing creating a file and deleting it Small files (256 bytes) are created repeatedly and an average time taken. These files are deleted in a second test. File system type machine file creation file deletion tmpfs fast 7.2 us 3.9 us tmpfs older 19.9 us 8.9 us SSD ext4 fast 22.9 us 9.1 us 7200 RPM ext4 older 42.9 us 24.1 us NFS - 11,238 us 5,294 us The "fast" machine was a 3.8 GHz i7 with 1600 MHz memory. The "older" machine was a 2.5 GHz AMD 4800 with 533 MHz memory. The NFS drive is a QNAP TS-210 with mirrored 7,200 RPM drives over a home plug network. The write cache for the 7,200 RPM disk is making a big difference. Without the disk cache