C++ or Java, which is faster for high frequency trading?
Overview
There are conflicting views as to what is the best solution for high frequency trading. Part of the problem is that what is high frequency trading varies more than you might expect, another part is what is meant by faster.My View
If you have a typical Java programmer and typical C++ programmer, each with a few years experience writing a typical Object Oriented Program, and you give them the same amount of time, the Java programmer is likely to have a working program earlier and will have more time to tweak the application. In this situation it is likely the Java application will be faster. IMHO.In my experience, Java performs better at C++ at detecting code which doesn't need to be done. esp micro-benchmarks which don't do anything useful. ;) If you tune Java and C++ as far as they can go given any amount of expertise and time, the C++ program will be faster. However, given limited resources and in changing environment a dynamic language will out perform. i.e. in real world applications.
In the equities space latency you need latencies sub-10 us to be seriously high frequency. Java and even standard OOP C++ on commodity hardware is not an option. You need C or a cut down version of C++ and specialist hardware like FPGAs, GPUs.
In FX, high frequency means a latencies of sub-100 us. In this space C++ or a cut down Java (low GC) with kernel bypass network adapter is an option. In this space, using one language or another will have pluses and minuses. Personally, I think Java gives more flexibility as the exchanges are constantly changing, assuming you believe you can use IT for competitive advantage.
In many cases, when people talk about high frequency, esp Banks, they are talking sub 1 ms or single digit ms. In this space, I would say the flexibility/dynamic programming of Java, Scala or C# etc would give you time to market, maintainability and reliability advantages over C/C++ or FPGA.
The problem Java faces
The problem is not in the language as such, but a lack of control over caches, context switches and interrupts. If you copy a block of memory, something which occurs in native memory, but using a different delay between runs, that copy gets slower depending on what has happened between copies.The problem is not GC, or Java as neither of these play much of a part. The problem is that part of the cache has been swapped out and the copy itself takes longer. This is the same for any operation which accesses memory. e.g. accessing plain objects will also be slower.
private void doTest(Pauser delay) throws InterruptedException { int[] times = new int[1000 * 1000]; byte[] bytes = new byte[32* 1024]; byte[] bytes2 = new byte[32 * 1024]; long end = System.nanoTime() + (long) 5e9; int i; for (i = 0; i < times.length; i++) { long start = System.nanoTime(); System.arraycopy(bytes, 0, bytes2, 0, bytes.length); long time = System.nanoTime() - start; times[i] = (int) time; delay.pause(); if (start > end) break; } Arrays.sort(times, 0, i); System.out.printf(delay + ": Copy memory latency 1/50/99%%tile %.1f/%.1f/%.1f us%n", times[i / 100] / 1e3, times[i / 2] / 1e3, times[i - i / 100 - 1] / 1e3 ); }The test does the same thing many times, with different delays between performing that test. The test spends most of its time in native methods and no objects are created or discarded as during the test.
YIELD: Copy memory latency 1/50/99%tile 1.6/1.6/2.3 us NO_WAIT: Copy memory latency 1/50/99%tile 1.6/1.6/1.6 us BUSY_WAIT_10: Copy memory latency 1/50/99%tile 2.8/3.5/4.4 us BUSY_WAIT_3: Copy memory latency 1/50/99%tile 2.7/3.0/4.0 us BUSY_WAIT_1: Copy memory latency 1/50/99%tile 1.6/1.6/2.5 us SLEEP_10: Copy memory latency 1/50/99%tile 2.2/3.4/5.1 us SLEEP_3: Copy memory latency 1/50/99%tile 2.2/3.4/4.4 us SLEEP_1: Copy memory latency 1/50/99%tile 1.8/3.4/4.2 us
With -XX:+UseLargePages with Java 7
YIELD: Copy memory latency 1/50/99%tile 1.6/1.6/2.7 us NO_WAIT: Copy memory latency 1/50/99%tile 1.6/1.6/1.8 us BUSY_WAIT_10: Copy memory latency 1/50/99%tile 2.7/3.6/6.6 us BUSY_WAIT_3: Copy memory latency 1/50/99%tile 2.7/2.8/5.0 us BUSY_WAIT_1: Copy memory latency 1/50/99%tile 1.7/1.8/2.6 us SLEEP_10: Copy memory latency 1/50/99%tile 2.4/4.0/5.2 us SLEEP_3: Copy memory latency 1/50/99%tile 2.3/3.9/4.8 us SLEEP_1: Copy memory latency 1/50/99%tile 2.1/3.3/3.7 usThe best of three runs was used.
The typical time (the middle value) it takes to perform the memory copy varies between 1.6 and 4.6 us depending on whether there was a busy wait or sleep for 1 to 10 ms. This is a ratio of about 3x which has nothing to do with Java, but something it has no real control over. Even the best times vary by about 2x.
It would be interesting to see some actual benchmarks backing this view. I've always heard that a typical latency margin between Java and C/C++ is a factor of 3. I's say that's a very significant competitive advantage, even assuming both languages satisfy the application latency requirements.
ReplyDeleteDepending on the context, a factor of 3 is realistic comparing Java and C. This is what makes Java unrealistic for sub 10 us latencies.
ReplyDeleteThe main problem is not in the language itself but the lack of control over the caches, context switches and interrupts, which are essential at this level.
My contention is that there is a big difference between standard OOP C++ coding and low level C. While C++ is a superset of C, you have to be careful which portions of C++ you use to achieve low latency. i.e. it much more like C, even assembly than C++.
However, C tends to require more human resource/expertise to implement the same application. If your engine is small fraction of the overall problem, as it is in Spot FX, having free time to analyse the end to end system is more of an advantage.
What baffles me is when people insist on using C++ for performance in systems with latency requirement over 1 ms.
Even around 100 us latency, choosing either low level Java or C/C++ has pluses and minuses.
You speak of latency as if it is a single number. It is not. It is a distribution.
ReplyDelete@Paul, that is true. Its a simplification as is talking about broad ranges like 10 us, 100 us or 1 ms.
ReplyDeleteLatency will have a distribution in real applications which tends to have a "fat tail".
Peter,
ReplyDeleteWhat do you mean by latency? In process calculation? Time it takes to react to an external event coming to your process and react to it, i.e., sending another event out?
Excellent question. The timings are so broad (10, 100, 1000 us) you can take it either way. ;)
ReplyDeleteWhen measuring a systems latency, I would measure/estimate the latency at the point you connect to the exchange incl. networks between market data event in and an order out. i.e. the broadest time under your control.
You want to also measure the internal latency of your application as this is likely to be the most variable esp. in Java, and then add estimated the network/TCP/kernel latency based on previous tests.
Exactly... so it does not matter if your algo speed is 1 micro if the round trip to the exchange is 20 ms. That was my point.
ReplyDeleteMy gut feeling is that its bad if your algo is 1 us if your exchange is round trip of 3 ms. By having an algo simple enough to be performed in 1 us its probably not unique enough to make money and by focusing on the algo to get the speed down to 1 us, you are high likely to have missed something simple in how you interact with exchange which will save much more times.
ReplyDeleteOn an FX exchange, even the most minor change in how you interact with it can save 25 - 800 us. If you miss one of these and have a sub optimal interaction with the exchange, you can have a trading system which takes 0 ns and still miss fills.
Which OpenCL API do you use for Java?
ReplyDeleteHFT is program trading platform..The HFT mainly uses C++ for its many projects as it is cheap and offers good performance in trading market..But Java shows much better and faster results then C++..Therefore Java has high performance then C++..
ReplyDeletehft
I don't think its black and white, but I do believe that Java may be the right solution more often than many in HFT might think.
ReplyDelete