## Posts

Showing posts from 2013

### Computing units don't have to be confusing

Overview Often programmers will use units in a non standard way and usually you can work out what they meant by context.  Many developers never realise there is inconsistencies is technical material as the difference is small to the amount of error and the difficult of reproducing benchmarks reported. There are times when the difference is so large it does matter and the writer has confused themselves by a lack of understanding which makes your job of trying to work out what they done harder. An array of measures b  = bit. B = byte or 8 bits or 8b g = gram kb = kilobit or 1000 bits. kilo being the scientific prefix for 1000 like kilometer. kB = kilobyte or 1000 bytes, sometimes used for storage kg = kilogram Kb = Kibit =  kibib it or 1024 bytes. KB = KiB =  kibibyte  or 1024 bytes, sometimes used for memory Mb = megabit  = 1000^2 bits or 125 kB. Mb/s is used for network bandwidth. MB = megabyte = 1000^2 bytes, MB/s is used for disk bandwidth. MiB = me

### Unique hashCodes is not enough to avoid collisions.

There is a common misconception that if you have unique hashCode() you won't have collisions.  While unique, or almost unique, hashCodes are good, this is not the end of the story. The problem is that the size of a HashMap is not unlimited (or at least 2^32 in size)  This means the hashCode() number has to be reduced to a smaller number of bits. The way HashMap, and thus HashSet and LinkedHashMap ,work is to mutate the bits in the following manner     h ^= (h >>> 20) ^ (h >>> 12);     return h ^ (h >>> 7) ^ (h >>> 4); and then apply a mask for the lowest bits to select a bucket.  The problem is that even with unique hashCode()s as Integer does, there will be values with different hash code map to the same bucket. You can research how Integer.hashCode() works ;) public static void main(String[] args) {     Set integers = new HashSet<>();     for (int i = 0; i <= 400; i++)         if ((hash(i) & 0x1f) == 0)

### Writing and Testing High Frequency Trading Engines talk at JavaOne

My first talk at JavaOne went as well as I could hope.  We started 5 minute early as the room was full. My apologies to those turned away, I wouldn't have minded a larger room.  There were still many questions after we were moved out of the room. The slides for the presentation is here. Writing and Testing High Frequency Trading Engines in Java The code for the libraries mentioned is here https://github.com/OpenHFT and in particular the demo source is available here https://github.com/OpenHFT/Java-Chronicle/tree/master/chronicle-demo To run the demo I started java -Xmx32m -verbose:gc vanilla.java.processingengine.PEMain  and then java -Xmx32m -verbose:gc vanilla.java.processingengine.GWMain 1 false The presentation was recorded an I will let you know the link to it as soon as I do.

### Update on Writing and Monitoring HFT systems

I gave a presentation to JAX Frankfurt on Thursday which went well with about 50 attending. My public presentations are available on GitHub PerfromanceJUG/Presentations I plan to update my content for JavaOne 2013 as the presentation is very dense,. I plan to cut some of the topics and focus on the aspects which got the audience's interest most. ;) The OpenHFT project is getting closer to its first releases and I will be including mention of these in my presentation.  Chronicle 2.0 in this project appears to be about 3x faster than Chronicle 1.7.  The initial release will be have cut down functionality and you may not be able to switch immediately. My hands on Tutorial on writing and monitoring low latency, high throughput systems in Java.  (3 hours) at JAX London has been split into an afternoon and morning session as the morning session is already full. I hope to be speaking at Devoxx Belgium and W-Jax in Munich.  If so, I might see you there.

### Tutorial on writing and monitoring low latency, high throughput systems in Java.

I am providing a three hour tutorial at JAX London this year. In a short presentation similar to the one I will be giving at JavaOne 2013, I will be covering such topics as GC free, lockless coding Low latency coding (less than 10 microseconds ) High throughput (Over 100K request/responses per second on a laptop) Using Chronicle for low latency persistence Using sun.misc.Unsafe Using System.nanoTime() between machines Use cases in finance All using Apache 2.0 open source software. i.e. nothing to buy. About 80% of the time will be hands on examining a demonstration program which you can run on a laptop, lifting the bonnet and learning how to extend it for your purposes. What you should get out of this is to see how being able to log everything you could want changes your architecture and the way you test and monitor your applications in production (in particular its performance) If you are going to JAX London and you want to attend, sign up quick because this sessio

### C++ like Java for low latency

Overview Previously I wrote an article on C like Java.  This is term I had come across before. However, on reflection I thought C++ like Java is a better term as you still use OOP practices, (which not C-like) but you put more work into managing and recycling memory yourself. The term I favour now is "low level" Java programming. Low latency and "natural" Java Many low latency systems are still programmed in what I call natural Java.  In most cases, this is the simplest and most productive.  The problems really start when you have large heap sizes and low latency requirements as these don't work well together.   If you have an established program and a rewrite is not reasonable, you need something like Zing which has a truly concurrent collector. Although you may see worst pauses of 1-4 ms this as about as good as a non real time system will get.  RedHat is rumoured to be producing a concurrent collector but as Oracle have found with G1, writing an e

### How do I calculate the remainder for extremely large exponential numbers using Java?

This StackOverflow question asks         How do i calculate the remainder for extremely large exponential numbers using java ?         eg. (48^26)/2401 I though the answer was worth reproducing as there is a surprisingly simple solution. An answer is to modulus the value in each iteration of calculating the power ( a * b ) % n (( A * n + AA ) * ( B * n + BB )) % n | AA = a % n & BB = b % n ( A * B * n ^ 2 + A * N * BB + AA * B * n + AA * BB ) % n AA * BB % n since x * n % n == 0 ( a % n ) * ( b % n ) % n In your case, you can write 48 ^ 26 % 2401 ( 48 ^ 2 ) ^ 13 % 2401 as int n = 48 ; for ( int i = 1 ; i < 26 ; i ++) n = ( n * 48 ) % 2401 ; System . out . println ( n ); int n2 = 48 * 48 ; for ( int i = 1 ; i < 13 ; i ++) n2 = ( n2 * 48 * 48 ) % 2401 ; System . out . println ( n2 ); System . out . println ( BigInteger . valueOf (

### OpenHFT Java Lang project

Overview OpenHFT/Java Lang started as an Apache 2.0 library to provide the low level functionality used by Java Chronicle without the need to persist to a file. This allows serializable and deserialization of data and random access to memory in native space (off heap)  It supports writing and reading enumerable types with object pooling. e.g. writing and reading String without creating an object (if it has been pooled).  It also supports writing and read primitive types in binary and text without creating any garbage. Small messages can be serialized and deserialized in under a micro-second. Recent additions Java Lang supports a DirectStore which is like a ByteBuffer but can be any size (up to 40 to 48-bit on most systems)  It support 64-bit sizes and offset. It support compacted types, and object serialization. It also supports thread safety features such as volatile reads, ordered (lazy) writes, CAS operations and using an int (4 bytes) as a lock in native memory. Testi

### ext4 slower than tmpfs

While you might expect ext4 to be slower than tmpfs, I found it is slower to have buffered writes to ext4 than tmpfs. i.e. The writes are not actually made to disk in either case. I have a test which writes data/messages via a shared memory file as an IPC between two threads.  The data written is smaller than the dirty data cache size.  I have used these kernel parameters on a machine with 32 GB. The test only writes 8 - 12 GB so it should fit within space easily     vm.dirty_background_ratio = 50     vm.dirty_ratio = 80 I have mounted the file system with the following.  The disk is an PCI SSD, but this shouldn't make a difference.      noatime,data=writeback,barrier=0,nobh This is the results I get (using a beta version of Chronicle 2.0) filesystem tiny   4 bytes   small    16 byte     medium 64 byte large    256 byte    tmpfs  185 M msg/sec 96 M msg/sec 30.7 M msg/sec 10.9 M msg/sec   ext4 148 M msg/sec 71 M msg/sec 17.7 M msg/sec 7.2 M msg/sec

### Micro jitter, busy waiting and binding CPUs

Performance profiling a new machine When I work on a new machine, I like to get an understanding of it's limitations.  In this post I am looking at the jitter on the machine and the impact of busy waiting for a new PC I built this weekend. The specs for the machine are interesting but not the purpose of the post.  Never the less they are: i7-3970X six core running at 4.5 GHz (with HT turned on) 32 GB of PC-1600 memory An OCZ RevoDrive 3, PCI SSD (actual write bandwidth of 600 MB/s) Ubuntu 13.04 Note: the OCZ RevoDrive is not officially supported on Linux, but is much cheaper than their models which are. Tests for jitter My micro jitter sampler looks at interrupts to a running thread.  It is similar to jHiccup  but instead of measuring how delayed a thread is in waking up, it measures how delays a thread gets once it has started running.  Surprisingly how you run your threads impacts the sort of delays it will see once it wakes up. This chart is a bit dense.  I

### Easy to add but hard to remove

In rapid development projects, how easy and interesting it is to add new functionality is key.  In this environment, exciting, new, terser languages are more productive. It helps to keep developers motivated and happier which leads to more code. In low latency environment, adding code is not the hard problem.  Understanding the problem, the behaviour of the system, and what you program is really doing is the hard program. This leads to removing code, layers and complexity are more important and harder to achieve. To remove something you gain from static analysis, a minimum of side effects and clear relationships.  You also benefit from a clear understanding of what the code is really doing all the way down to the CPU level.  Without this you cannot remove things which don't need to be done and your code will be needlessly slower. A simple example is the amount of garbage you are producing. One way to slow down a program is to produce lots of garbage.  This not only cause GC pau

### High resolution timings without the custom hardware

Overview There is a a wide range of vendors who support PTP and packet capture time-stamping with micro-second timestamps or better.  This works well in large complex organisations where changing the software to do this is not an easy option, any you need something which stands back and tries to make sense of it all.  The biggest problem with this approach is that your analysis of the data is typically twice the effort of just capturing the data, if you are lucky and use standard protocols which are already supported. As software developers is there as easy, simple, cheap way to get high resolution timings without the hardware complexity.  Can we also find a simple way to ensure the data we capture will be used? Bottom Up approach Using hardware is a very bottom up approach.  You have to Distribute accurate timing using something like PTP. (requires special hardware) You have to add timestamps to the packets (more special hardware) You have to record all the packets on the

### Plans for Chronicle 2.0

Plans for Chronicle I have a number of changes planned for Chronicle. Wider interest domain using openhft.net  https://github.com/OpenHFT / More modular design, extracting serialization and parsing from Chronicle. Take advantage of performance improvements using Java 7 Support for FIX directly to demonstrate it's support for text writing and parsing (as an additional module) Latest performance results I recently bought a PCI SSD and this made it clear to me that there is room for performance improvements in Chronicle.  Java 7 may provide improved performance in a thread safety. (In particular the Unsafe.putOrderXxx methods) Small Messages (13 bytes) Time to write and read as individual messages Time to write and read in batches of 10  5 billion  5 mins,  6 sec  2 mins,  6 sec 10 billion 10 mins, 13 sec  4 mins, 17 sec 15 billion 15 mins, 23 sec  6 mins, 28 sec 20 billion 20 mins, 47 sec  8 mins, 45 sec The test ExampleSimpleWriteReadMain from Chroni

### The Java version of patenting the Wheel

Overview In 2001, an Australian man successfully patented a round device to aid movement.   Wheel patented in Australia More recently, a company patented using native memory as way of storing data and objects.   OFF-HEAP DIRECT-MEMORY DATA STORES, METHODS OF CREATING AND/OR MANAGING OFF-HEAP DIRECT-MEMORY DATA STORES, AND/OR SYSTEMS INCLUDING OFF-HEAP DIRECT-MEMORY DATA STORE The "off-heap" here being the Java terminology for direct or native memory. " computer systems that leverage an off-heap direct-memory data store that is massively scalable and highly efficient" It's such a good idea you might wonder, has this every been done before?   Well I deployed a project called Yellow Page's online in Australia in 2001 where the bulk of the data was in native memory. Why? because the JVMs at that time were not as efficient as I could write the code in C++ to pack and search the data. The project used EJBs, but they were implemented in native methods.