Click here to Skip to main content
15,886,919 members
Articles / Programming Languages / C++
Tip/Trick

Two C++ Features Win Over C Equivalents

Rate me:
Please Sign up or sign in to vote.
5.00/5 (15 votes)
18 Feb 2024MIT5 min read 16.9K   251   10   28
Two C++ features win over C equivalents in performance
The tip explores the superior performance of C++ over C in sorting and summing operations, showcasing benchmarks that reveal significant speed advantages of C++ over C.

This short write-up is a tip that showcases two tricks C++ developers use to gain performance over C. As such, it is not eligible to participate in the CodeProject monthly article competition.

Table of Contents

Introduction

When it comes to performance, C is king, but there are two areas in which C++ is supreme over C: sorting and summing. When the benchmark results were revealed, the C programmers were baffled and intrigued, and finally, they were won over to the C++ side. In this tip, we examine the reason they are faster and see if it is still the case today by benchmarking.

C++ sort VS C qsort

In C, qsort is a de facto way to do QuickSort. qsort requires a comparator function to be provided as a function pointer to compare between two elements.

C++
// This comparison function is used in qsort
// to sort in descending order
int comparator(const void* p, const void* q)
{
    // Get the values at given addresses 
    const int64_t l = *(const int64_t*)p;
    const int64_t r = *(const int64_t*)q;

    // sort in descending order
    if (l > r) return -1;
    else if (l < r) return +1;
    else return 0;
}

const int size = 1000000;
int64_t arr[size];
qsort((void*)arr, size, sizeof(arr[0]), comparator);

In C++, std::sort is used to do sorting. A function object must be provided as a comparator function until the advent of C++11 where a lambda is sufficient. Before C++17, std::sort is implemented using QuickSort. In C++17, IntroSort is used instead because though QuickSort is the fastest sorting algorithm, its Achilles heel is the worst time performance of O(N2) when it comes to sorting an almost sorted array.

std::sort on C++ Reference noted

Before LWG713, the complexity requirement allowed sort() to be implemented using only Quicksort, which may need O(N2 ) comparisons in the worst case.

Introsort can handle all cases with O(N·log(N)) comparisons (without incurring additional overhead in the average case), and thus is usually used for implementing sort().

libc++ has not implemented the corrected time complexity requirement until LLVM 14

This is Wikipedia opening introduction to IntroSort:

Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort.

C++
std::vector<int64_t> vec(1000000);

// sort in descending order
std::sort(vec.begin(), vec.end(), [](int64_t a, int64_t b) {
    return a > b;
    });

As you can see from the above, the std::sort version is more concise. The reason std::sort is performant over qsort is due to its comparison function being a function pointer that requires dereferencing and inside it, two castings from void* and dereferencing are required to access the element value for comparison whilst std::sort comparison function is inlined and elements are already the correct type. C++17 parallel std::sort is added to the benchmark. Below are the benchmark results on the Intel i7 8700 (6 physical cores and 12 logical cores) with 16GB RAM. The GCC and Clang benchmarks are run on WSL Ubuntu 2022. It seems like parallel std::sort is not implemented on GCC and Clang as they shared same result as serial std::sort

VC++ 2022
             C qsort Benchmark:13103ms
     C++ stable sort Benchmark: 5702ms
            C++ sort Benchmark: 6300ms
        C++ par sort Benchmark: 1451ms

gcc version 11.3.0

             C qsort Benchmark: 8994ms
     C++ stable sort Benchmark: 6389ms
            C++ sort Benchmark: 5095ms
        C++ par sort Benchmark: 5114ms

clang version 14.0.0

             C qsort Benchmark: 8954ms
     C++ stable sort Benchmark: 6206ms
            C++ sort Benchmark: 5232ms
        C++ par sort Benchmark: 5227ms

The GCC and clang command to build SortBenchmark.cpp is as follows. Remember to copy timer.h to your Ubuntu folder. It makes no benchmark difference whether -std=c++17 or earlier standard is used. Looks like IntroSort is not an affecting factor.

g++ SortBenchmark.cpp -O3 -std=c++17

clang++ SortBenchmark.cpp -O3 -std=c++17 

C++ accumulate VS C Summing For Loop

Below is an example of the summing for loop.

C++
const int size = 1000000;
uint64_t arr[size];

uint64_t sum = 0;
for (int i = 0; i < size; ++i)
{
    sum += arr[i];
} 

arr[i] is equivalent to *(arr+i) in C which is equivalent to the below assembly code using GCC 13.2 O1 optimization. The CPU register, rdx is the sum and rax is the loop counter, i. As you can see, the multiplication of 8 because 8 is the size of uint64_t. You can examine the generated assembly code at Godbolt.

ASM
.L9:
        add     rdx, QWORD PTR [rcx+rax*8]
        add     rax, 1
        cmp     rax, rsi
        jb      .L9 

The C++ example is illustrated below using std::accumulate.

C++
std::vector<uint64_t> vec(1000000);
// ...
constexpr const uint64_t Zero = 0;
uint64_t sum = std::accumulate(vec.cbegin(), vec.cend(), Zero); 

The code std::accumulate produced is roughly below.

C++
auto it = vec.cbegin();
auto end = vec.cend();
for(; it != end; ++it)
{
    sum += *it;
} 

++it and *it is equivalent to the assembly code below, an addition of 8 without the time-consuming multiplication. The reason I use O1 optimization and not O0 because some functions like vec[i], vector::size are not inlined but called, so no assembly is shown at the site for my examination. O1 is the sweet spot for manual analysis. Of course, there are higher optimization levels like O2 and O3 but all my assumptions are off since the optimizer realizes these 3 loops are equivalent code and generates the same optimized assembly for all three.

ASM
.L19:
        add     rdx, QWORD PTR [rax]
        add     rax, 8
        cmp     rax, rcx
        jne     .L19 

C++11 range for loop generates the same code as std::accumulate and iterator code above.

C++
for (auto n : vec)
{
    sum += n;
} 

The benchmark results for different compilers are as follows. Parallel version of std::accumulate() is called std::reduce() in C++17. sum can be ignored: it is used so that the for loop won't be optimized away. For Visual C++ and Clang, accumulate and range for loop is faster. For GCC, all four results hover around 300ms. From the GCC and Clang results, we can clearly see std::reduce() is not parallelized in this Standard Library I am using.

VC++ 2022

 Increment For Loop:  439ms, sum:500000500000
     Range For Loop:  277ms, sum:500000500000
        Accumulator:  239ms, sum:500000500000
    Parallel Reduce:  111ms, sum:500000500000

gcc version 11.3.0

 Increment For Loop:  328ms, sum:500000500000
     Range For Loop:  295ms, sum:500000500000
        Accumulator:  273ms, sum:500000500000
    Parallel Reduce:  313ms, sum:500000500000

clang version 14.0.0

 Increment For Loop:  339ms, sum:500000500000
     Range For Loop:  266ms, sum:500000500000
        Accumulator:  265ms, sum:500000500000
    Parallel Reduce:  265ms, sum:500000500000

The GCC and clang command to build SummingBenchmark.cpp is as follows:

g++ SummingBenchmark.cpp -O3 -std=c++17

clang++ SummingBenchmark.cpp -O3 -std=c++17 

Reference

History

  • 19th February 2024: Added stable_sort to the SortBenchmark because qsort should be rightfully compared to stable_sort since former has equality compare and the latter checks for equality by using this expression !(x<y) && !(y<x).
  • 17th February 2024: Corrected the initialization for SortBenchmark. Added C++17's parallel std::sort to the SortBenchmark. Added C++17 parallel std::accumulate to the SummingBenchmark. Note: The parallel version of std::accumulate is called std::reduce. C++ committee decided to rename it because when using floating point, std::reduce can return slightly different results due to its parallel operations. If order of accumulation matters to you, stick to std::accumulate. Benchmark is run on Intel i7 8700 with 6 physical cores and 12 logical cores.
  • 16th February 2024: Updated messages in the benchmark code download
  • 7th February 2024: Added assembly code from GCC and removed the confusing C++ code which is not valid C++ pointer arithmetic for the summing section
  • 4th February 2024: Corrected the comparison function of qsort according to Fly Gheorghe
  • 31st January 2024: First release

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
Software Developer (Senior)
Singapore Singapore
Shao Voon is from Singapore. His interest lies primarily in computer graphics, software optimization, concurrency, security, and Agile methodologies.

In recent years, he shifted focus to software safety research. His hobby is writing a free C++ DirectX photo slideshow application which can be viewed here.

Comments and Discussions

 
SuggestionResults in Windows Pin
LaszloHars17-Feb-24 12:45
LaszloHars17-Feb-24 12:45 
GeneralRe: Opposite results in Windows Pin
Shao Voon Wong16-Feb-24 19:35
mvaShao Voon Wong16-Feb-24 19:35 
GeneralRe: Opposite results in Windows Pin
LaszloHars16-Feb-24 21:48
LaszloHars16-Feb-24 21:48 
GeneralRe: Opposite results in Windows Pin
Shao Voon Wong18-Feb-24 0:09
mvaShao Voon Wong18-Feb-24 0:09 
GeneralRe: Opposite results in Windows Pin
Shao Voon Wong16-Feb-24 20:21
mvaShao Voon Wong16-Feb-24 20:21 
GeneralRe: Opposite results in Windows Pin
LaszloHars18-Feb-24 9:53
LaszloHars18-Feb-24 9:53 
GeneralRe: Opposite results in Windows Pin
Shao Voon Wong18-Feb-24 13:05
mvaShao Voon Wong18-Feb-24 13:05 
QuestionWell done! Pin
Stacy Dudovitz16-Feb-24 9:54
professionalStacy Dudovitz16-Feb-24 9:54 
GeneralMy vote of 1 Pin
Phillip Quinn16-Feb-24 7:10
Phillip Quinn16-Feb-24 7:10 
GeneralRe: My vote of 1 Pin
Stacy Dudovitz16-Feb-24 9:52
professionalStacy Dudovitz16-Feb-24 9:52 
GeneralRe: My vote of 1 Pin
Phillip Quinn16-Feb-24 13:00
Phillip Quinn16-Feb-24 13:00 
QuestionStrange article Pin
SMDtheone16-Feb-24 6:10
SMDtheone16-Feb-24 6:10 
AnswerRe: Strange article Pin
Stacy Dudovitz16-Feb-24 14:31
professionalStacy Dudovitz16-Feb-24 14:31 
QuestionMemory footprint comparison Pin
Member 149242829-Feb-24 19:09
Member 149242829-Feb-24 19:09 
AnswerRe: Memory footprint comparison Pin
Shao Voon Wong9-Feb-24 22:23
mvaShao Voon Wong9-Feb-24 22:23 
As far as I know, QuickSort does not require memory allocation, it does its sorting in place by pivoting. Whereas MergeSort has a better worst-time performance but requires memory allocation. I do not know about IntroSort.

EDIT: QuickSort use recursion which may be a problem in embedded as it needs more stack space.

Out of the three naive sorts of BubbleSort, SelectionSort and InsertionSort, the last one has the best performance of O(N) when it comes to almost sorted array. This is why InsertionSort is used to complement IntroSort implementation.

modified 10-Feb-24 4:29am.

QuestionArticle regarding introsort and pdqsort Pin
Richard Chambers9-Feb-24 8:26
Richard Chambers9-Feb-24 8:26 
AnswerRe: Article regarding introsort and pdqsort Pin
Shao Voon Wong9-Feb-24 22:17
mvaShao Voon Wong9-Feb-24 22:17 
SuggestionImprove qsort Pin
Fly Gheorghe1-Feb-24 0:11
Fly Gheorghe1-Feb-24 0:11 
BugComparator not portable Pin
Fly Gheorghe31-Jan-24 23:36
Fly Gheorghe31-Jan-24 23:36 
PraiseRe: Comparator not portable Pin
Shao Voon Wong3-Feb-24 23:20
mvaShao Voon Wong3-Feb-24 23:20 
GeneralRe: Comparator not portable Pin
Fly Gheorghe3-Feb-24 23:30
Fly Gheorghe3-Feb-24 23:30 
GeneralRe: Comparator not portable Pin
Shao Voon Wong4-Feb-24 0:14
mvaShao Voon Wong4-Feb-24 0:14 
QuestionNice article! Pin
_Flaviu31-Jan-24 23:09
_Flaviu31-Jan-24 23:09 
GeneralMy vote of 5 Pin
Ștefan-Mihai MOGA31-Jan-24 3:50
professionalȘtefan-Mihai MOGA31-Jan-24 3:50 
SuggestionGood points Pin
Mircea Neacsu31-Jan-24 2:36
Mircea Neacsu31-Jan-24 2:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.