Branch prediction in the wild


You must have previously heard of the coin change problem in some form or the other. I was revisiting this problem today and was reminded of an interesting answer I read on stackoverflow a long time back -

Why is it faster to process a sorted array than an unsorted array?

The source code (in C++) of the solution to the coin change problem goes as follows -

int coinChange(const vector<int>& coins, int amount) {
    // dp[i] tracks the min. number of coins required to create denomination i.
    vector<int> dp(amount + 1, amount + 1);
    dp[0] = 0;  // no coins are required to create a nil denomination
    for (int i = 1; i <= amount; ++i) {
        for (auto& c : coins) {
            if (c <= i) {
                dp[i] = min(dp[i], 1 + dp[i - c]);
            }
        }
    }
    return (dp[amount] > amount) ? -1 : dp[amount];
}

This happens to be the optimal solution (to my knowledge) of the coin change problem with a time complexity of O(n * amount) and a space complexity of O(amount) (n being the size of coins vector). There are tons of articles available which explain the above algorithm, however I won’t go into those details here. What I want to explore is whether it is possible to make this program run faster without making any semantic changes to the algorithm.

All tests were done on my machine which had the following specs at the time of testing -

  • Kernel version: 4.15.0-30-generic x86_64 GNU/Linux
  • gcc version: Ubuntu 7.3.0-16ubuntu3

Let’s collect some data on the speed of the current program. The following input was used as test data -

coins:  {3, 7, 405, 436, 4, 23}
amount: 88392175

Note that in this specific example, amount is significantly larger than coins.size().

The program was compiled with the default optimization flag (-O0) using g++. The running time recorded on the above input was -

14.83s user 0.13s system 99% cpu 14.961 total

As is obvious from the title of this post, we can probably make use of branch prediction to squeeze out a few milliseconds. The first step in that direction will be to sort the coins vector. The time complexity of this step is asymptotically smaller than our current algorithm’s time complexity.

     dp[0] = 0;
+    sort(coins.begin(), coins.end());
     for (int i = 1; i <= amount; ++i) {

On making the relevant changes and running the tests, the following result is recorded -

15.42s user 0.11s system 99% cpu 15.525 total

Well, that didn’t work out as we expected! In fact, the program is now slower than before. (need to inspect this more)

One possible reason behind this is that even though the coins vector is sorted, the processor is still switching branches quite frequently. To be more precise, we make a single pass through the coins vector in each iteration of the outer for-loop, and the distance between consecutive iterations where branch switch occurs (due to the if-statement) is coins.size() on average. All we need to do now is to make these switches lie farther apart.

One solution is to exchange the inner and outer for-loops. This won’t have any effect on the semantics of the program, but now the branch switches will be amount distance apart on average. amount is significantly larger than coins.size() (for this example), and now the processor will be switching branches a lot less frequently than before.

     sort(coins.begin(), coins.end());
-    for (int i = 1; i <= amount; ++i) {
-        for (auto& c : coins) {
+    for (auto& c : coins) {
+        for (int i = 1; i <= amount; ++i) {
             if (c <= i) {

Let’s run the tests again -

6.66s user 0.11s system 98% cpu 6.853 total

A reduction of 8.672 seconds - the gains are significant indeed!


Benchmarks

Below are a few benchmarks generated via perf(1). Note that branch misses are maximum in the first case and minimum in the last case.

Without sort and without loop exchange

  15559.449556      task-clock (msec)         #    0.999 CPUs utilized
           229      context-switches          #    0.015 K/sec
             6      cpu-migrations            #    0.000 K/sec
        86,438      page-faults               #    0.006 M/sec
40,868,138,499      cycles                    #    2.627 GHz
90,146,840,053      instructions              #    2.21  insn per cycle
13,718,299,799      branches                  #  881.670 M/sec
   121,424,551      branch-misses             #    0.89% of all branches

  15.570897771 seconds time elapsed

With sort and without loop exchange

  15790.581172      task-clock (msec)         #    1.000 CPUs utilized
           192      context-switches          #    0.012 K/sec
             5      cpu-migrations            #    0.000 K/sec
        86,436      page-faults               #    0.005 M/sec
41,377,099,491      cycles                    #    2.620 GHz
90,145,379,523      instructions              #    2.18  insn per cycle
13,719,167,993      branches                  #  868.820 M/sec
   124,073,050      branch-misses             #    0.90% of all branches

  15.796056471 seconds time elapsed

Without sort and with loop exchange

   7240.322581      task-clock (msec)         #    0.999 CPUs utilized
           132      context-switches          #    0.018 K/sec
             2      cpu-migrations            #    0.000 K/sec
        86,437      page-faults               #    0.012 M/sec
18,930,247,315      cycles                    #    2.615 GHz
48,527,363,072      instructions              #    2.56  insn per cycle
 6,937,998,117      branches                  #  958.244 M/sec
     7,821,022      branch-misses             #    0.11% of all branches

   7.249432807 seconds time elapsed

With sort and with loop exchange

   6775.111452      task-clock (msec)         #    0.999 CPUs utilized
           106      context-switches          #    0.016 K/sec
             4      cpu-migrations            #    0.001 K/sec
        86,439      page-faults               #    0.013 M/sec
17,840,286,625      cycles                    #    2.633 GHz
48,648,004,601      instructions              #    2.73  insn per cycle
 7,057,327,634      branches                  # 1041.655 M/sec
       255,889      branch-misses             #    0.00% of all branches

   6.779488807 seconds time elapsed

NOTE: there might be other factors at play apart from the one mentioned. Do let me know in case you find something missing!

Thanks for reading.

Updated:

Leave a Comment