Local variables inside a loop and performance

Overview

Sometimes a question comes up about how much work allocating a new local variable takes.  My feeling has always been that the code becomes optimised to the point where this cost is static i.e. done once, not each time the code is run.

Recently Ishwor Gurung suggested considering moving some local variables outside a loop. I suspected it wouldn't make a difference but I had never tested to see if this was the case.

The test.

This is the test I ran

public static void main(String... args) {
    for (int i = 0; i < 10; i++) {
        testInsideLoop();
        testOutsideLoop();
    }
}

private static void testInsideLoop() {
    long start = System.nanoTime();
    int[] counters = new int[144];
    int runs = 200 * 1000;
    for (int i = 0; i < runs; i++) {
        int x = i % 12;
        int y = i / 12 % 12;
        int times = x * y;
        counters[times]++;
    }
    long time = System.nanoTime() - start;
    System.out.printf("Inside: Average loop time %.1f ns%n", (double) time / runs);
}

private static void testOutsideLoop() {
    long start = System.nanoTime();
    int[] counters = new int[144];
    int runs = 200 * 1000, x, y, times;
    for (int i = 0; i < runs; i++) {
        x = i % 12;
        y = i / 12 % 12;
        times = x * y;
        counters[times]++;
    }
    long time = System.nanoTime() - start;
    System.out.printf("Outside: Average loop time %.1f ns%n", (double) time / runs);
}


and the output ended with

Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns
Inside: Average loop time 3.6 ns
Outside: Average loop time 3.6 ns


Increasing the time the test takes to 100 million iterations made little difference to the results.

Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns
Inside: Average loop time 3.8 ns
Outside: Average loop time 3.8 ns


Replacing the modulus and multiplication with >>, &, + I got

            int x = i & 15;
            int y = (i >> 4) & 15;
            int times = x + y;

prints

Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns
Inside: Average loop time 1.2 ns
Outside: Average loop time 1.2 ns

While modulus is relatively expensive the resolution of the test is to 0.1 ns or less than 1/3 of a clock cycle. This would show any difference between the two tests to an accuracy of this.

Using Caliper

As @maaartinus comments, Caliper is a micro-benchmarking library so I was interested in how much slower it might be that doing the code by hand.

public static void main(String... args) {
    Runner.main(LoopBenchmark.class, args);
}

public static class LoopBenchmark extends SimpleBenchmark {

    public void timeInsideLoop(int reps) {
        int[] counters = new int[144];
        for (int i = 0; i < reps; i++) {
            int x = i % 12;
            int y = i / 12 % 12;
            int times = x * y;
            counters[times]++;
        }
    }

    public void timeOutsideLoop(int reps) {

        int[] counters = new int[144];
        int x, y, times;
        for (int i = 0; i < reps; i++) {
            x = i % 12;
            y = i / 12 % 12;
            times = x * y;
            counters[times]++;
        }
    }
}

The first thing to note is the code is shorter as it doesn't include timing and printing boiler plate code.  Running this I get on the same machine as the first test.

 0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 4.23 ns; σ=0.01 ns @ 3 trials

  benchmark   ns linear runtime
 InsideLoop 4.23 ==============================
OutsideLoop 4.23 =============================

vm: java
trial: 0


Replacing the modulus with shift and and

 0% Scenario{vm=java, trial=0, benchmark=InsideLoop} 1.27 ns; σ=0.01 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=OutsideLoop} 1.27 ns; σ=0.00 ns @ 3 trials

  benchmark   ns linear runtime
 InsideLoop 1.27 =============================
OutsideLoop 1.27 ==============================

vm: java
trial: 0



This is consistent with the first result and only about 0.4 - 0.6 ns slower for one test. (about two clock cycles), and next to no difference for the shift, and, plus test.  This may be due to the way calliper samples the data but doesn't change the outcome.

It is worth nothing that when running real programs, you typically get longer times than a micro-benchmark as the program will be doing more things so the caching and branch predictions is not as ideal.  A small over estimate of the time taken may be closer to what you can expect to see in a real program.

Conclusion

This indicated to me that in this case it made no difference.  I still suspect the cost of allocating local variables is don't once when the code is compiled by the JIT and there is no per-iteration cost to consider.

 

Comments

  1. I guess this is even true for interpreted code: All the local variables gets their slots at the method start (some slots might be shared by multiple variable if their lifetime doesn't overlap, I think).

    However, if there was any cost, I wouldn't bet your benchmark would find it as the modulus and division operations are very expensive. Moreover, they slow execution could overlap with other code, so that the slowdown would be exactly zero.

    It looks like your benchmark ran for less than 1 millisecond - you shouldn't trust any such benchmark. Look at caliper, its runs take much longer.

    ReplyDelete
  2. @maaartinus You raise some interesting points which I have tried to address in the posting. I think when you get to differences of a small fraction of a clock cycle you can say the results are much the same.

    ReplyDelete
  3. In Java, the local variables are all allocated as slots in the stack frame at compile time. This is the maxlocals value for the method. Declaration inside the loop or outside the loop makes no difference to when/where the storage is allocated. If you take a simple example and look at the byte code that is output, you can see that they compile almost identically. The only difference is that declaration inside vs. outside the loop will write a different stack map frame at the loop. This _could_ be used by the JIT for better optimization of the loop. For a trivial example, it shouldn't make any difference. For a more complicated loop, it could. I think the general rule is to not worry about such an optimization. However, good code practice, in general, dictates that you should limit the scope of any variable, which would prefer the declaration inside of the loop.

    ReplyDelete
  4. @philwb I agree with limiting the scope of variables improves clarity. Simple, clear code often gives you an efficient solution.

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues