Click here to Skip to main content
15,667,231 members
Articles / Programming Languages / C++
Posted 30 Dec 2017

Tagged as


3 bookmarked

C++ Summing For Loop Benchmark

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
30 Dec 2017CPOL2 min read
C++ Summing For Loop Benchmark yield interesting results and assembly code across different compilers.


The initial motivation is to find out the overhead of different for-loop types in C++.


Copy and paste below code into Godbolt Online C++ Compiler to see the generated assembly code. Note: The array or vector are initialized in the benchmark. The simplified code below is for you to copy and paste into the Godbolt Online C++ Compiler so that you only read the relevant assembly code, including other code just adds to noise where you have to wade through to find your assembly code.

Prior to using Godbolt, I was poring the assembly code generated by Visual C++ which was difficult to read because the optimized assembly code for each single C++ line were interleaved with assembly code for other lines. My guess is reason for doing that probably is to take advantage of the CPU pipelining so that code which are not dependent on the result of previous operation, can execute independently. Using Godbolt is the best and most easy way. Trust me.

#include <cstdint>
#include <algorithm>
#include <numeric>
#include <iterator>

const size_t LEN = 1000000;

// Increment For Loop
uint64_t func1()
    uint64_t vec[LEN];

    uint64_t sum = 0;
    for (size_t i = 0; i < LEN; ++i)
        sum += vec[i];
    return sum;

// Range For Loop
uint64_t func2()
    uint64_t vec[LEN];

    uint64_t sum = 0;
    for (auto n : vec)
        sum += n;
    return sum;

// Iterator For Loop
uint64_t func3()
    uint64_t vec[LEN];

    uint64_t sum = 0;
    for (auto it = std::cbegin(vec); it != std::cend(vec); ++it)
        sum += *it;
    return sum;

// Accumulator
uint64_t func4()
    uint64_t vec[LEN];

    uint64_t sum = 0;
    const uint64_t Zero = 0;

    sum = std::accumulate(std::cbegin(vec), std::cend(vec), Zero);
    return sum;

Test Machine: Intel i7 6700 at 3.4 GHz

Visual C++ 2017 (15.4 Update) Result

Please ignore the sum result. I display resultant sum to prevent compiler from optimizing away for loop. Visual C++ vectorized the code with SSE2.

Increment For Loop:  599ms, sum:500000500000
    Range For Loop:  446ms, sum:500000500000
 Iterator For Loop:  558ms, sum:500000500000
       Accumulator:  437ms, sum:500000500000

Investigation shows multiplication by 8 for array index subscripting could be the culprit in the slowdown in the Increment For Loop.

sum += vec[i];
movdqu   xmm0, XMMWORD PTR vec$[rsp+rax*8] <== multiplication by 8

As for the Range For Loop, the address is incremented by 16 (= 8 + 8 because of loop unrolling), multiplication is not used to calculate the address. Accumulator code use the same tactics. Earlier in the decade, C programmers were baffled as to why std::accumulate was faster than for loop. Now we know the reason.

As for the Iterator For Loop poor result, my guess is the iterator overhead.

Cygwin clang++ 3.9.1 Result

clang++ generated the similar code for all 4 loops, hence, similar timing. clang++ vectorized the loops with SSE2. To compile the code with clang++, use the command below.

# clang++ ForLoopBenchmark.cpp -O2 -std=c++14
Increment For Loop:  392ms, sum:500000500000
    Range For Loop:  406ms, sum:500000500000
 Iterator For Loop:  381ms, sum:500000500000
       Accumulator:  391ms, sum:500000500000

Cygwin g++ 5.4 Result

Like clang++, g++ also generated the similar code for all 4 loops, so they had similar timing but sadly, loops are not vectorized in O2. Specifying O3 vectorize all loops and result is on par with clang++'s O2. To compile the code with g++, use the command below.

g++ ForLoopBenchmark.cpp -O2 -std=c++14
Increment For Loop:  558ms, sum:500000500000
    Range For Loop:  552ms, sum:500000500000
 Iterator For Loop:  542ms, sum:500000500000
       Accumulator:  544ms, sum:500000500000

"Is this information even useful?"

There is FIX Protocol (for financial markets) which makes use of summing to calculate simple checksum of its message.

If you find Godbolt useful, do consider becoming a patreon of Matt Godbolt to show your appreciation and help him to defray monthly server costs. Of course, donation is not obligatory. I have been his patreon since December.

Benchmark source code is hosted here. Thanks for reading!


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
Software Developer (Senior)
Singapore Singapore
Shao Voon is from Singapore. His interest lies primarily in computer graphics, software optimization, concurrency, security, and Agile methodologies. His hobby is writing a free C++/DirectX photo slideshow application which can be viewed here.

Comments and Discussions

QuestionQuestions about your performance measurements Pin
SeattleC++1-Jan-18 10:45
SeattleC++1-Jan-18 10:45 
You didn't describe the compile flags for visual studio. Did you do a 32-bit compile or a 64-bit compile? Presumably you did a release build.

In Visual C++, int is 32 bits, where it is 64 bits on gcc and clang. It might have been interesting to use an int or unsigned for the type of the index variable, instead of a size_t, to see if it affected performance.

You don't say how you timed the loops, but my experience is that timing is only accurate to around 10 milliseconds, so only the top two digits of your timing numbers are significant. Based on a look at your measurements, it seems like Visual C++ must have generated rather different code for the cases, whereas the code generated by clang (cygwin) for all four cases must have been nearly the same, and there were probably two different kinds of code generated for gcc. Did you find this to be true? Did everyone use the SSE2 instructions?

AnswerRe: Questions about your performance measurements Pin
Shao Voon Wong3-Jan-18 1:47
mvaShao Voon Wong3-Jan-18 1:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.