Showing posts from 2013

Why we shouldn't use more threads than we need to

Overview There is a common argument that because we have lots of cores, and will have even more in the future we have to use them.  We just we need to find the best ways to use them but just because we can doesn't mean we should. What is our goal? Good reasons to use multiple threads are the performance of using one thread is not enough. you have profiled your application to ensure there is no low hanging fruit. multiple threads improve the throughput,  latency or consistency. At this point you should add a thread when you know it gets you closer to your goal. A bad reason to use multiple threads Just because we can use more threads doesn't mean we should.  Multiple threads Adds complexity to the code There are other was to speed up an application.  You L1 cache is 10-20x faster than you L3 cache and if you can spend more time in you L1 cache by optimising your memory usage and access, you can gain more performance than using every CPU in your socket. Mul

Computing units don't have to be confusing

Overview Often programmers will use units in a non standard way and usually you can work out what they meant by context.  Many developers never realise there is inconsistencies is technical material as the difference is small to the amount of error and the difficult of reproducing benchmarks reported. There are times when the difference is so large it does matter and the writer has confused themselves by a lack of understanding which makes your job of trying to work out what they done harder. An array of measures b  = bit. B = byte or 8 bits or 8b g = gram kb = kilobit or 1000 bits. kilo being the scientific prefix for 1000 like kilometer. kB = kilobyte or 1000 bytes, sometimes used for storage kg = kilogram Kb = Kibit =  kibib it or 1024 bytes. KB = KiB =  kibibyte  or 1024 bytes, sometimes used for memory Mb = megabit  = 1000^2 bits or 125 kB. Mb/s is used for network bandwidth. MB = megabyte = 1000^2 bytes, MB/s is used for disk bandwidth. MiB = me

Unique hashCodes is not enough to avoid collisions.

There is a common misconception that if you have unique hashCode() you won't have collisions.  While unique, or almost unique, hashCodes are good, this is not the end of the story. The problem is that the size of a HashMap is not unlimited (or at least 2^32 in size)  This means the hashCode() number has to be reduced to a smaller number of bits. The way HashMap, and thus HashSet and LinkedHashMap ,work is to mutate the bits in the following manner     h ^= (h >>> 20) ^ (h >>> 12);     return h ^ (h >>> 7) ^ (h >>> 4); and then apply a mask for the lowest bits to select a bucket.  The problem is that even with unique hashCode()s as Integer does, there will be values with different hash code map to the same bucket. You can research how Integer.hashCode() works ;) public static void main(String[] args) {     Set integers = new HashSet<>();     for (int i = 0; i <= 400; i++)         if ((hash(i) & 0x1f) == 0)            

Interning in High Frequency Trading systems

I am interested in providing intern-ships developing open source code which can be using in high frequency trading systems as well as systems which need near real time processing.  These libraries should also be useful for applications without strict latency requirements. What I would like to know is how to go about doing this.  I don't have the answer, but if you can provide feed back please let me know. I have created a forum for the OpenHFT project in general.   OpenHFT forum  with a topic for Advice on conducting Intern-ships  , Advice on conducting grants and another for Interest in Joining . If you are interested in seeing what has been done already OpenHFT on GitHub The project includes already Direct access to large (>> 2 GB) off heap memory with thread safe constructs. Low latency (sub micro-second) serialization, persistence and IPC. Thread affinity to minimise jitter. Simple library for compiling and loading generated Java code at runtime. The cur

Writing and Testing High Frequency Trading Engines talk at JavaOne

My first talk at JavaOne went as well as I could hope.  We started 5 minute early as the room was full. My apologies to those turned away, I wouldn't have minded a larger room.  There were still many questions after we were moved out of the room. The slides for the presentation is here. Writing and Testing High Frequency Trading Engines in Java The code for the libraries mentioned is here and in particular the demo source is available here To run the demo I started java -Xmx32m -verbose:gc  and then java -Xmx32m -verbose:gc 1 false The presentation was recorded an I will let you know the link to it as soon as I do.

Update on Writing and Monitoring HFT systems

I gave a presentation to JAX Frankfurt on Thursday which went well with about 50 attending. My public presentations are available on GitHub PerfromanceJUG/Presentations I plan to update my content for JavaOne 2013 as the presentation is very dense,. I plan to cut some of the topics and focus on the aspects which got the audience's interest most. ;) The OpenHFT project is getting closer to its first releases and I will be including mention of these in my presentation.  Chronicle 2.0 in this project appears to be about 3x faster than Chronicle 1.7.  The initial release will be have cut down functionality and you may not be able to switch immediately. My hands on Tutorial on writing and monitoring low latency, high throughput systems in Java.  (3 hours) at JAX London has been split into an afternoon and morning session as the morning session is already full. I hope to be speaking at Devoxx Belgium and W-Jax in Munich.  If so, I might see you there.

Tutorial on writing and monitoring low latency, high throughput systems in Java.

I am providing a three hour tutorial at JAX London this year. In a short presentation similar to the one I will be giving at JavaOne 2013, I will be covering such topics as GC free, lockless coding Low latency coding (less than 10 microseconds ) High throughput (Over 100K request/responses per second on a laptop) Using Chronicle for low latency persistence Using sun.misc.Unsafe Using System.nanoTime() between machines Use cases in finance All using Apache 2.0 open source software. i.e. nothing to buy. About 80% of the time will be hands on examining a demonstration program which you can run on a laptop, lifting the bonnet and learning how to extend it for your purposes. What you should get out of this is to see how being able to log everything you could want changes your architecture and the way you test and monitor your applications in production (in particular its performance) If you are going to JAX London and you want to attend, sign up quick because this sessio

C++ like Java for low latency

Overview Previously I wrote an article on C like Java.  This is term I had come across before. However, on reflection I thought C++ like Java is a better term as you still use OOP practices, (which not C-like) but you put more work into managing and recycling memory yourself. The term I favour now is "low level" Java programming. Low latency and "natural" Java Many low latency systems are still programmed in what I call natural Java.  In most cases, this is the simplest and most productive.  The problems really start when you have large heap sizes and low latency requirements as these don't work well together.   If you have an established program and a rewrite is not reasonable, you need something like Zing which has a truly concurrent collector. Although you may see worst pauses of 1-4 ms this as about as good as a non real time system will get.  RedHat is rumoured to be producing a concurrent collector but as Oracle have found with G1, writing an e

How do I calculate the remainder for extremely large exponential numbers using Java?

This StackOverflow question asks         How do i calculate the remainder for extremely large exponential numbers using java ?         eg. (48^26)/2401 I though the answer was worth reproducing as there is a surprisingly simple solution. An answer is to modulus the value in each iteration of calculating the power ( a * b ) % n (( A * n + AA ) * ( B * n + BB )) % n | AA = a % n & BB = b % n ( A * B * n ^ 2 + A * N * BB + AA * B * n + AA * BB ) % n AA * BB % n since x * n % n == 0 ( a % n ) * ( b % n ) % n In your case, you can write 48 ^ 26 % 2401 ( 48 ^ 2 ) ^ 13 % 2401 as int n = 48 ; for ( int i = 1 ; i < 26 ; i ++) n = ( n * 48 ) % 2401 ; System . out . println ( n ); int n2 = 48 * 48 ; for ( int i = 1 ; i < 13 ; i ++) n2 = ( n2 * 48 * 48 ) % 2401 ; System . out . println ( n2 ); System . out . println ( BigInteger . valueOf (

OpenHFT Java Lang project

Overview OpenHFT/Java Lang started as an Apache 2.0 library to provide the low level functionality used by Java Chronicle without the need to persist to a file. This allows serializable and deserialization of data and random access to memory in native space (off heap)  It supports writing and reading enumerable types with object pooling. e.g. writing and reading String without creating an object (if it has been pooled).  It also supports writing and read primitive types in binary and text without creating any garbage. Small messages can be serialized and deserialized in under a micro-second. Recent additions Java Lang supports a DirectStore which is like a ByteBuffer but can be any size (up to 40 to 48-bit on most systems)  It support 64-bit sizes and offset. It support compacted types, and object serialization. It also supports thread safety features such as volatile reads, ordered (lazy) writes, CAS operations and using an int (4 bytes) as a lock in native memory. Testi

ext4 slower than tmpfs

While you might expect ext4 to be slower than tmpfs, I found it is slower to have buffered writes to ext4 than tmpfs. i.e. The writes are not actually made to disk in either case. I have a test which writes data/messages via a shared memory file as an IPC between two threads.  The data written is smaller than the dirty data cache size.  I have used these kernel parameters on a machine with 32 GB. The test only writes 8 - 12 GB so it should fit within space easily     vm.dirty_background_ratio = 50     vm.dirty_ratio = 80 I have mounted the file system with the following.  The disk is an PCI SSD, but this shouldn't make a difference.      noatime,data=writeback,barrier=0,nobh This is the results I get (using a beta version of Chronicle 2.0) filesystem tiny   4 bytes   small    16 byte     medium 64 byte large    256 byte    tmpfs  185 M msg/sec 96 M msg/sec 30.7 M msg/sec 10.9 M msg/sec   ext4 148 M msg/sec 71 M msg/sec 17.7 M msg/sec 7.2 M msg/sec

Micro jitter, busy waiting and binding CPUs

Performance profiling a new machine When I work on a new machine, I like to get an understanding of it's limitations.  In this post I am looking at the jitter on the machine and the impact of busy waiting for a new PC I built this weekend. The specs for the machine are interesting but not the purpose of the post.  Never the less they are: i7-3970X six core running at 4.5 GHz (with HT turned on) 32 GB of PC-1600 memory An OCZ RevoDrive 3, PCI SSD (actual write bandwidth of 600 MB/s) Ubuntu 13.04 Note: the OCZ RevoDrive is not officially supported on Linux, but is much cheaper than their models which are. Tests for jitter My micro jitter sampler looks at interrupts to a running thread.  It is similar to jHiccup  but instead of measuring how delayed a thread is in waking up, it measures how delays a thread gets once it has started running.  Surprisingly how you run your threads impacts the sort of delays it will see once it wakes up. This chart is a bit dense.  I

Easy to add but hard to remove

In rapid development projects, how easy and interesting it is to add new functionality is key.  In this environment, exciting, new, terser languages are more productive. It helps to keep developers motivated and happier which leads to more code. In low latency environment, adding code is not the hard problem.  Understanding the problem, the behaviour of the system, and what you program is really doing is the hard program. This leads to removing code, layers and complexity are more important and harder to achieve. To remove something you gain from static analysis, a minimum of side effects and clear relationships.  You also benefit from a clear understanding of what the code is really doing all the way down to the CPU level.  Without this you cannot remove things which don't need to be done and your code will be needlessly slower. A simple example is the amount of garbage you are producing. One way to slow down a program is to produce lots of garbage.  This not only cause GC pau

High resolution timings without the custom hardware

Overview There is a a wide range of vendors who support PTP and packet capture time-stamping with micro-second timestamps or better.  This works well in large complex organisations where changing the software to do this is not an easy option, any you need something which stands back and tries to make sense of it all.  The biggest problem with this approach is that your analysis of the data is typically twice the effort of just capturing the data, if you are lucky and use standard protocols which are already supported. As software developers is there as easy, simple, cheap way to get high resolution timings without the hardware complexity.  Can we also find a simple way to ensure the data we capture will be used? Bottom Up approach Using hardware is a very bottom up approach.  You have to Distribute accurate timing using something like PTP. (requires special hardware) You have to add timestamps to the packets (more special hardware) You have to record all the packets on the

Plans for Chronicle 2.0

Plans for Chronicle I have a number of changes planned for Chronicle. Wider interest domain using / More modular design, extracting serialization and parsing from Chronicle. Take advantage of performance improvements using Java 7 Support for FIX directly to demonstrate it's support for text writing and parsing (as an additional module) Latest performance results I recently bought a PCI SSD and this made it clear to me that there is room for performance improvements in Chronicle.  Java 7 may provide improved performance in a thread safety. (In particular the Unsafe.putOrderXxx methods) Small Messages (13 bytes) Time to write and read as individual messages Time to write and read in batches of 10  5 billion  5 mins,  6 sec  2 mins,  6 sec 10 billion 10 mins, 13 sec  4 mins, 17 sec 15 billion 15 mins, 23 sec  6 mins, 28 sec 20 billion 20 mins, 47 sec  8 mins, 45 sec The test ExampleSimpleWriteReadMain from Chroni

The Java version of patenting the Wheel

Overview In 2001, an Australian man successfully patented a round device to aid movement.   Wheel patented in Australia More recently, a company patented using native memory as way of storing data and objects.   OFF-HEAP DIRECT-MEMORY DATA STORES, METHODS OF CREATING AND/OR MANAGING OFF-HEAP DIRECT-MEMORY DATA STORES, AND/OR SYSTEMS INCLUDING OFF-HEAP DIRECT-MEMORY DATA STORE The "off-heap" here being the Java terminology for direct or native memory. " computer systems that leverage an off-heap direct-memory data store that is massively scalable and highly efficient" It's such a good idea you might wonder, has this every been done before?   Well I deployed a project called Yellow Page's online in Australia in 2001 where the bulk of the data was in native memory. Why? because the JVMs at that time were not as efficient as I could write the code in C++ to pack and search the data. The project used EJBs, but they were implemented in native methods.