Local variables inside a loop and performance
Overview
Sometimes a question comes up about how much work allocating a new local variable takes. My feeling has always been that the code becomes optimised to the point where this cost is static i.e. done once, not each time the code is run.Recently Ishwor Gurung suggested considering moving some local variables outside a loop. I suspected it wouldn't make a difference but I had never tested to see if this was the case.
The test.
This is the test I ranpublic static void main(String... args) {
for (int i = 0; i < 10; i++) {
testInsideLoop();
testOutsideLoop();
}
}
private static void testInsideLoop() {
long start = System.nanoTime();
int[] counters = new int[144];
int runs = 200 * 1000;
for (int i = 0; i < runs; i++) {
int x = i % 12;
int y = i / 12 % 12;
int times = x * y;
counters[times]++;
}
long time = System.nanoTime() - start;
System.out.printf("Inside: Average loop time %.1f ns%n", (double) time / runs);
}
private static void testOutsideLoop() {
long start = System.nanoTime();
int[] counters = new int[144];
int runs = 200 * 1000, x, y, times;
for (int i = 0; i < runs; i++) {
x = i % 12;
y = i / 12 % 12;
times = x * y;
counters[times]++;
}
long time = System.nanoTime() - start;
System.out.printf("Outside: Average loop time %.1f ns%n", (double) time / runs);
}
and the output ended with
Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns
Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns
Increasing the time the test takes to 100 million iterations made little difference to the results.
Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns
Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns
Replacing the modulus and multiplication with >>, &, + I got
int x = i & 15;
int y = (i >> 4) & 15;
int times = x + y;
prints
Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns
Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns
While modulus is relatively expensive the resolution of the test is to 0.1 ns or less than 1/3 of a clock cycle. This would show any difference between the two tests to an accuracy of this.
Using Caliper
As @maaartinus comments, Caliper is a micro-benchmarking library so I was interested in how much slower it might be that doing the code by hand.public static void main(String... args) {
Runner.main(LoopBenchmark.class, args);
}
public static class LoopBenchmark extends SimpleBenchmark {
public void timeInsideLoop(int reps) {
int[] counters = new int[144];
for (int i = 0; i < reps; i++) {
int x = i % 12;
int y = i / 12 % 12;
int times = x * y;
counters[times]++;
}
}
public void timeOutsideLoop(int reps) {
int[] counters = new int[144];
int x, y, times;
for (int i = 0; i < reps; i++) {
x = i % 12;
y = i / 12 % 12;
times = x * y;
counters[times]++;
}
}
}
The first thing to note is the code is shorter as it doesn't include timing and printing boiler plate code. Running this I get on the same machine as the first test.
0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials
benchmark ns linear runtime
InsideLoop 4.23 ==============================
OutsideLoop 4.23 =============================
vm: java
trial: 0
Replacing the modulus with shift and and
0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 1.27 ns; σ=0.01 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 1.27 ns; σ=0.00 ns @ 3 trials
benchmark ns linear runtime
InsideLoop 1.27 =============================
OutsideLoop 1.27 ==============================
vm: java
trial: 0
This is consistent with the first result and only about 0.4 - 0.6 ns slower for one test. (about two clock cycles), and next to no difference for the shift, and, plus test. This may be due to the way calliper samples the data but doesn't change the outcome.
It is worth nothing that when running real programs, you typically get longer times than a micro-benchmark as the program will be doing more things so the caching and branch predictions is not as ideal. A small over estimate of the time taken may be closer to what you can expect to see in a real program.
I guess this is even true for interpreted code: All the local variables gets their slots at the method start (some slots might be shared by multiple variable if their lifetime doesn't overlap, I think).
ReplyDeleteHowever, if there was any cost, I wouldn't bet your benchmark would find it as the modulus and division operations are very expensive. Moreover, they slow execution could overlap with other code, so that the slowdown would be exactly zero.
It looks like your benchmark ran for less than 1 millisecond - you shouldn't trust any such benchmark. Look at caliper, its runs take much longer.
@maaartinus You raise some interesting points which I have tried to address in the posting. I think when you get to differences of a small fraction of a clock cycle you can say the results are much the same.
ReplyDeleteIn Java, the local variables are all allocated as slots in the stack frame at compile time. This is the maxlocals value for the method. Declaration inside the loop or outside the loop makes no difference to when/where the storage is allocated. If you take a simple example and look at the byte code that is output, you can see that they compile almost identically. The only difference is that declaration inside vs. outside the loop will write a different stack map frame at the loop. This _could_ be used by the JIT for better optimization of the loop. For a trivial example, it shouldn't make any difference. For a more complicated loop, it could. I think the general rule is to not worry about such an optimization. However, good code practice, in general, dictates that you should limit the scope of any variable, which would prefer the declaration inside of the loop.
ReplyDelete@philwb I agree with limiting the scope of variables improves clarity. Simple, clear code often gives you an efficient solution.
ReplyDelete