Writing optimal code is sometimes very important - in some cases we should put performance over anything else and make everything to make the code run as fast as possible. In many cases the algorithm is the paramount factor, however it is not necessarily the only one. There are some ways to speed up the code - even the fastest algorithm can be forced to decrease it’s running time then.

In this post we will discuss one of the possible options - vector operations. Optimizing code this way involves leveraging SIMD (Single Instruction, Multiple Data) instructions to perform parallel operations on multiple data elements simultaneously. Let’s analyze a basic example to understand how it works.


Basic algorithm

The function we are going to optimize is very simple - it receives an array and calculates the sum of the elements in it. It can look something like that:

double sum(double* array, int size) {
    double sum = 0.;
    for (int i = 0; i < size; ++i) {
        sum += array[i];
    }
    return sum;
}

It seems to be a reasonable way to solve this task. So is there anything we can do with this code?


SSE (Streaming SIMD Extensions)

Yes! As mentioned earlier - we can perform operations on multiple elements of the array. One can think of unrolling the loop to achieve that - and it is, indeed, a good idea. However, we will not just make bigger steps and execute code from a few iterations at once. We will only work on two elements during one iteration. But parallelly. And that is exactly what SSE vector operations allow us to do. Let’s take a look at this code:

// header for SSE
#include <emmintrin.h>
double sum_sse(double* array, int size) {
    __m128d sum_vector = _mm_setzero_pd();
    double remaining_sum = 0.;

    for (int i = 0; i < size;) {
        if (i + 2 < size) {
            __m128d data_vector = _mm_loadu_pd(&array[i]);
            sum_vector = _mm_add_pd(sum_vector, data_vector);
            i += 2;
        } else {
            remaining_sum += array[i];
            i++;
        }
    }

    double sum[2];
    _mm_storeu_pd(sum, sum_vector);
    return sum[0] + sum[1] + remaining_sum;
}

It creates __m128d sum_vector = _mm_setzero_pd(), which we can interpret as a two-element vector of doubles. Then we load numbers from array[i] and array[i + 1] into __m128d data_vector = _mm_loadu_pd(&array[i]) and add them to sum_vector using _mm_add_pd(sum_vector, data_vector). We need to consider also the case when array size is not divisible by 2 - that is the explanation for if-else used in loop (we do not check this condition only once, outside the loop, to make this code easy modificable for bigger vectors). The final step requires getting elements from sum_vector and adding them together (and adding remaining_sum which was not stored in vector).

We have not changed the logic of the algorithm, just implemented usage of vector operations. So is this everything we can do?


AVX (Advanced Vector Extensions)

Not really. We can replace 128-bit vectors with 256-bit ones. It allows us to perform addition on four elements at the same time. The code would be pretty similar to the previous one:

// header for AVX
#include <immintrin.h>
double sum_avx(double* array, int size) {
    __m256d sum_vector = _mm256_setzero_pd();
    double remaining_sum = 0.;

    for (int i = 0; i < size;) {
        if (i + 4 < size) {
            __m256d data_vector = _mm256_loadu_pd(&array[i]);
            sum_vector = _mm256_add_pd(sum_vector, data_vector);
            i += 4;
        } else {
            remaining_sum += array[i];
            i++;
        }
    }

    double sum[4];
    _mm256_storeu_pd(sum, sum_vector);
    return sum[0] + sum[1] + sum[2] + sum[3] + remaining_sum;
}

In this case vectors and functions have been replaced with proper equivalents, so there is no need to analyze the code step-by-step. Notwithstanding, in this case we need to add compiler option -mavx to make the code compile:

g++ -mavx optimization.cpp

Measuring performance

OK, we have modified the sum function, but how can we know that it has really been optimized? The simpliest way is to calculate the sum many times using each function and measure times taken. We can do it with function like:

double measure_performance(
        std::function<double(double*, int)> function,
        double* array, int array_size, int counter = 1000000) {
    auto start = std::chrono::steady_clock::now();
    double result = 0.;
    while (counter--) result = function(array, array_size);
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    return elapsed_seconds.count();
}

So the full code for our measurements is:

#include <iostream>
#include <chrono>
#include <functional>
#include <emmintrin.h>
#include <immintrin.h>


#define print_var_value(x) {std::cout << #x << ":\t" << x << '\n';}


double sum(double* array, int size) {
    ...
}

double sum_sse(double* array, int size) {
    ...
}

double sum_avx(double* array, int size) {
    ...
}

double measure_performance(
    std::function<double(double*, int)> function,
    double* array, int array_size, int counter = 1000000) {
    ...
}


int32_t main() {
    constexpr int array_size = 10000;
    double* array = new double[array_size];

    for (int i = 0; i < array_size; ++i) {
        array[i] = static_cast<double>(rand()) / RAND_MAX;
    }

    double time_sum = measure_performance(sum, array, array_size);
    double time_sum_sse = measure_performance(sum_sse, array, array_size);
    double time_sum_avx = measure_performance(sum_avx, array, array_size);

    print_var_value(time_sum);
    print_var_value(time_sum_sse);
    print_var_value(time_sum_avx);

    delete[] array;
    return 0;
}

Now we only need to compile and run it:

g++ -mavx optimization.cpp -o optimization && ./optimization

Results

The times (in seconds) I have got in a measurement run are:

time_sum:       24.4956
time_sum_sse:   20.1775
time_sum_avx:   11.7836

We can notice that SSE made code run a little bit faster than the starting one. What is more, AVX made code run more than two times faster than the starting one! Adding a simple optimization made the code perform two times better in this case.

Everything seems amazing - now we can write code that runs faster. However, there is one problem - the optimizations we have implemented are architecture-dependent. It means that there is no guarantee that we will get the same performance results on a different machine. Nonetheless, there are some cases when we want to improve code performance for a given architecture - and that is one of the ways to do it.

Is there anything else we can do? Of course. There are many optimization methods and vector operations are only one of them. However, without any further knowledge we can experiment with the code from this example. What about unrolling the loops and measuring the performance then? I strongly suggest you trying to do it and checking how many times faster you can get your code running comparing to the starting one.


Here you can find the full code from the post.

And that’s it for now! Hope you liked this post and we will meet again in the future.

See you then! 😎