Loop unrolling
To understand loop unrolling, we will look into the next two pieces of code.
Normal loop
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
// Generate data
const unsigned arraySize = 16384; //32768
int data[arraySize];
srand((int)123);
for (unsigned c = 0; c < arraySize; ++c)
data[c] = rand() % 256;
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
// Primary loop
for (unsigned c = 0; c < arraySize; ++c) {
sum += data[c];
}
}
double elapsedTime = ((double)(clock() - start)) / CLOCKS_PER_SEC;
printf("elapsedTime = %f\n", elapsedTime);
printf("sum = %llu\n", sum);
}
After loop unrolling
Now we do loop unrolling on the primary for
loop. The modified program is as
following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
// Generate data
const unsigned arraySize = 16384; //32768
int data[arraySize];
srand((int)123);
for (unsigned c = 0; c < arraySize; ++c)
data[c] = rand() % 256;
// Test
clock_t start = clock();
long long sum = 0;
long long sum1 = 0, sum2=0, sum3=0, sum4=0;
for (unsigned i = 0; i < 100000; ++i) {
// Primary loop
for (unsigned c = 0; c < arraySize; c+=4) {
sum1 += data[c+0];
sum2 += data[c+1];
sum3 += data[c+2];
sum4 += data[c+3];
}
}
sum = sum1 + sum2 + sum3 + sum4;
double elapsedTime = ((double)(clock() - start)) / CLOCKS_PER_SEC;
printf("elapsedTime = %f\n", elapsedTime);
printf("sum = %llu\n", sum);
}
Output
The output of normal loop using perf stat
is:
The output after loop unrolling using perf stat
is:
Explanation
- As a result of loop unrolling, the new program has to make only 25% of loops that the normal way has. Afterwards, only 25% of the jumps and conditional branches need to be taken. A potentially significant decrease in the loop administration overhead.
- Moreover, the dependency chain of the arguments in the program of normal loop is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait. However, after loop unrolling, there are still three other dependency chains that don't depend on the stall if you get a cache miss or other stall in one calculation. This is why the latter runs faster with much less branches.
References: