Measuring Branch Prediction Accuracy Using Linux Perf Tools
Continuing a little further the recent series of articles on Linux perf tools (1, 2) here is a little illustration how they can be used to identify programs which performance is limited by branch mis-prediction. This post is motivated by this Stack Overflow question about why the following code runs much faster if data re sorted then if they are not:
#include <algorithm>
#include <vector>
#include <iostream>
int main()
{
// generate data
const size_t arraySize = 32768;
std::vector<int> data(arraySize);
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// If the data are sorted like shown here the program runs about
// 6x faster (on my test machine, with -O2)
std::sort(data.begin(), data.end());
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
std::cout << "sum = " << sum << std::endl;
}
The reason of course is branch prediction/mis-prediction and indeed the code can be much accelerated by the compiler itself when the -O3 option is used. Sticking with the -O2 setting however, here is the outputs of the perf stat command when the data are not sorted:
sum = 314931600000 Performance counter stats for './t1': 13141.967743 task-clock # 0.997 CPUs utilized 1,152 context-switches # 0.000 M/sec 40 CPU-migrations # 0.000 M/sec 354 page-faults # 0.000 M/sec 41,471,369,485 cycles # 3.156 GHz 20,376,264,579 stalled-cycles-frontend # 49.13% frontend cycles idle 18,430,220,109 stalled-cycles-backend # 44.44% backend cycles idle 22,975,946,466 instructions # 0.55 insns per cycle # 0.89 stalled cycles per insn 6,557,966,797 branches # 499.010 M/sec 1,555,828,712 branch-misses # 23.72% of all branches 13.180519310 seconds time elapsed
And here is the output of perf stat when the sorting of the data is enabled:
Performance counter stats for './t1': 2120.863322 task-clock # 0.997 CPUs utilized 182 context-switches # 0.000 M/sec 3 CPU-migrations # 0.000 M/sec 354 page-faults # 0.000 M/sec 6,719,583,415 cycles # 3.168 GHz 1,786,885,720 stalled-cycles-frontend # 26.59% frontend cycles idle 17,443,785 stalled-cycles-backend # 0.26% backend cycles idle 22,963,668,101 instructions # 3.42 insns per cycle # 0.08 stalled cycles per insn 6,556,000,606 branches # 3091.194 M/sec 364,267 branch-misses # 0.01% of all branches 2.127129252 seconds time elapsed
The main thing to note is the very large difference in the branch-misses line: if the data are not sorted there are 1.5 billion branch misses in this sort program while if the data are sorted there are only about 300 thousand. This immediately shows the benefits of sorting the data for the ability of the CPU to take the correct branch.
In fact, the output of the slower version can be itself be used to find that it is likely that the branch prediction is a problem for this program. The first thing to note the branches line which shows that the program is doing 0.5 billion branches per second, or about one every 7th effective cycle. Clearly branching is very important for the performance of this program! The second thing to note is the 24% branch miss factor -- this is a large fraction, together with branches number shows that branches are important and that many are being missed, suggesting that this needs to be addressed in the algorithm.