Measuring Branch Prediction Accuracy Using Linux Perf Tools

Continuing a little further the recent series of articles on Linux perf tools (1, 2) here is a little illustration how they can be used to identify programs which performance is limited by branch mis-prediction. This post is motivated by this Stack Overflow question about why the following code runs much faster if data re sorted then if they are not:

#include <algorithm>
#include <vector>
#include <iostream>

int main()
    // generate data
    const size_t arraySize = 32768;
    std::vector<int> data(arraySize);

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // If the data are sorted like shown here the program runs about
   // 6x faster (on my test machine, with -O2)
    std::sort(data.begin(), data.end());

    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
        for (unsigned c = 0; c < arraySize; ++c)
            if (data[c] >= 128)
                sum += data[c];
    std::cout << "sum = " << sum << std::endl;

The reason of course is branch prediction/mis-prediction and indeed the code can be much accelerated by the compiler itself when the -O3 option is used. Sticking with the -O2 setting however, here is the outputs of the perf stat command when the data are not sorted:

sum = 314931600000

 Performance counter stats for './t1':

      13141.967743 task-clock                #    0.997 CPUs utilized
             1,152 context-switches          #    0.000 M/sec
                40 CPU-migrations            #    0.000 M/sec
               354 page-faults               #    0.000 M/sec
    41,471,369,485 cycles                    #    3.156 GHz
    20,376,264,579 stalled-cycles-frontend   #   49.13% frontend cycles idle
    18,430,220,109 stalled-cycles-backend    #   44.44% backend  cycles idle
    22,975,946,466 instructions              #    0.55  insns per cycle
                                             #    0.89  stalled cycles per insn
     6,557,966,797 branches                  #  499.010 M/sec
     1,555,828,712 branch-misses             #   23.72% of all branches

      13.180519310 seconds time elapsed

And here is the output of perf stat when the sorting of the data is enabled:

Performance counter stats for './t1':

      2120.863322 task-clock                #    0.997 CPUs utilized
              182 context-switches          #    0.000 M/sec
                3 CPU-migrations            #    0.000 M/sec
              354 page-faults               #    0.000 M/sec
    6,719,583,415 cycles                    #    3.168 GHz
    1,786,885,720 stalled-cycles-frontend   #   26.59% frontend cycles idle
       17,443,785 stalled-cycles-backend    #    0.26% backend  cycles idle
   22,963,668,101 instructions              #    3.42  insns per cycle
                                            #    0.08  stalled cycles per insn
    6,556,000,606 branches                  # 3091.194 M/sec
          364,267 branch-misses             #    0.01% of all branches

      2.127129252 seconds time elapsed

The main thing to note is the very large difference in the branch-misses line: if the data are not sorted there are 1.5 billion branch misses in this sort program while if the data are sorted there are only about 300 thousand. This immediately shows the benefits of sorting the data for the ability of the CPU to take the correct branch.

In fact, the output of the slower version can be itself be used to find that it is likely that the branch prediction is a problem for this program. The first thing to note the branches line which shows that the program is doing 0.5 billion branches per second, or about one every 7th effective cycle. Clearly branching is very important for the performance of this program! The second thing to note is the 24% branch miss factor -- this is a large fraction, together with branches number shows that branches are important and that many are being missed, suggesting that this needs to be addressed in the algorithm.