Click here to Skip to main content
15,867,308 members
Articles / General Programming / Performance

So You've Got 10M Particles to Simulate

Rate me:
Please Sign up or sign in to vote.
4.76/5 (21 votes)
8 Sep 2021CPOL3 min read 33.2K   257   27   55
Performance of data structures in C++ vs. C#
I've experienced that raw arrays, structs, and classes have different performance profiles. I wanted to put this experience to the test with programs written in C++ and C#.

Introduction

Let's say you're writing a particle simulator physics program. You need to simulate a lot of particles, say 10 million. To represent each particle, you have position, velocity, and acceleration. At each iteration of your program, you need to update the position with the velocity, the velocity with the acceleration, and for the sake of this article, the acceleration is constant. There would be x, y, and z, but let's just focus on one dimension.

To model your data in a computer program, you could use raw arrays for the position, velocity, and acceleration for each particle, or you could create a struct - C++ or C# - and put the position, velocity, and acceleration in there, or you could create a class like the struct.

It would seem that raw arrays would have the best performance, then C++ structs, then C++ classes, then C# structs, then C# classes. I say this because C++ structs are so thin, and C++ classes are identical to C++ structs in memory, and C# structs are pretty lean, but C# classes incur a whole other level of performance overhead, namely that an array of C# structs have data locality, but C# classes are all over the place.

The Test Programs

Using the Code

The code.zip contains the solution. plusplusms is the C++ test rig, and sharpms is the C# one.

plusplusms has a types.h file with the C++ data structures:

C++
#pragma once

// UPDATE: adding a dummy field for cache alignment (RE: Daniel Pfeffer) did not pay off
struct BareStruct
{
    double p, v, a, dummy;
};

class HeaderClass
{
public:
	HeaderClass()
	{
		set(0, 0, 0);
	}

	void set(double p, double v, double a)
	{
		m_p = p;
		m_v = v;
		m_a = a;
	}

    // UPDATE: Separating out updating position from updating velocity did not pay off
	void updatep()
	{
		m_p += m_v;
	}

	void updatev()
	{
		m_v += m_a;
	}

    // NOTE: Having one update function that calls the other two performed very well,
    //       inlining all around.
	void update()
	{
		updatep();
		updatev();
	}

private:
	double m_p, m_v, m_a, dummy;
};

class SourceClass
{
public:
	SourceClass();

	void set(double p, double v, double a);

    // UPDATE: Separating out...did not pay off
	//void updatep();
	//void updatev();
	void update();

private:
	double m_p, m_v, m_a, dummy;
};

types.cpp has the implementation of the SourceClass, same code as HeaderClass, just separated out. This separation was to test Link Time Code Generation (LTCG) performance, whether it would result in the same performance as the header-only implementation.  UPDATE: LTCG definitely works as advertised.

The C++ main.cpp has macros for the scale of the tests and timing results, and functions that do the data processing.

C++
#include "types.h"

#include <stdio.h>
#include <stdlib.h>

#include <chrono>

#define SAMPLE_SIZE 10 * 1024 * 1024
#define TEST_RUNS 4

double parr[SAMPLE_SIZE];
double varr[SAMPLE_SIZE];
double aarr[SAMPLE_SIZE];

BareStruct structs[SAMPLE_SIZE];

HeaderClass headers[SAMPLE_SIZE]; 
SourceClass sources[SAMPLE_SIZE];

typedef std::chrono::high_resolution_clock timer;
typedef std::chrono::milliseconds ms;
typedef std::chrono::duration<int> fsec;

#define START_RUN(name) { timer t; printf("%s: ", name); auto start = t.now();
#define END_RUN() \
    auto end = t.now(); \
    int elapsedMs = (int)std::chrono::duration_cast<ms>(end - start).count(); \
    printf("%d ms\n", elapsedMs); \
}

void updateArrays1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        parr[i] += varr[i];
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        varr[i] += aarr[i];
    }
}

void updateArrays2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        parr[i] += varr[i];
        varr[i] += aarr[i];
    }
}

void updateStructs1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p += structs[i].v;
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].v += structs[i].a;
    }
}

void updateStructs2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p += structs[i].v;
        structs[i].v += structs[i].a;
    }
}

void updateHeaders1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].updatep();
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].updatev();
    }
}

void updateHeaders2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].update();
    }
}

void updateSources1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].updatep();
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].updatev();
    }
}

void updateSources2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].update();
    }
}

int main()
{
    printf("Setting up...\n");
    srand(0);
    for (int i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p = parr[i] = rand();
        structs[i].v = varr[i] = rand();
        structs[i].a = aarr[i] = rand();

        headers[i].set(structs[i].p, structs[i].v, structs[i].a);
        sources[i].set(structs[i].p, structs[i].v, structs[i].a);
    }

    for (int o = 1; o <= 3; ++o)
    {
        START_RUN("arrays separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateArrays1();
        }
        END_RUN();

        START_RUN("arrays together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateArrays2();
        }
        END_RUN();

        START_RUN("structs separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateStructs1();
        }
        END_RUN();

        START_RUN("structs together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateStructs2();
        }
        END_RUN();

        START_RUN("header classes separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateHeaders1();
        }
        END_RUN();

        START_RUN("header classes together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateHeaders2();
        }
        END_RUN();

        START_RUN("source classes separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateSources1();
        }
        END_RUN();

        START_RUN("source classes together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateSources2();
        }
        END_RUN();

        printf("\n");
    }
}

The C# code is more concise, as to be expected.

C#
using System;
using System.Diagnostics;

namespace sharpms
{
    // UPDATE: Adding the dummy field did not pay off in C# either
    struct Struct
    {
        public double p, v, a, dummy; 
    }

    // UPDATE: marked class sealed (RE: Andre_Prellwitz) did not pay off
    sealed class Class 
    {
        public void updatep()
        {
            p += v;
        }

        public void updatev()
        {
            v += a;
        }

        public void update()
        {
            p += v;
            v += a;
        }

        public double p, v, a, dummy;
    }

    class Program
    {
        const int SAMPLE_SIZE = 10 * 1024 * 1024;
        const int TEST_RUNS = 4;

        static double[] parr = new double[SAMPLE_SIZE];
        static double[] varr = new double[SAMPLE_SIZE];
        static double[] aarr = new double[SAMPLE_SIZE];

        static Struct[] structs = new Struct[SAMPLE_SIZE];
        static Class[] classes = new Class[SAMPLE_SIZE];

        static void updateArrays1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                parr[i] += varr[i];
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                varr[i] += aarr[i];
            }
        }

        static void updateArrays2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                parr[i] += varr[i];
                varr[i] += aarr[i];
            }
        }

        // UPDATE: Resorting to unsafe code (// RE: igrulya) performs well,
        //         but is not necessary as the next solution shows
        static unsafe void updateArrays3() 
        {
            fixed (double* pparr0 = &parr[0])
            fixed (double* pvarr0 = &varr[0])
            fixed (double* paarr0 = &aarr[0])
            {
                double* pparr = pparr0;
                double* pvarr = pvarr0;
                double* paarr = paarr0;
                int i = SAMPLE_SIZE;
                while (i-- > 0)
                {
                    *pparr++ += *pvarr;
                    *pvarr++ += *paarr++;
                }
            }
        }

        // RE: <a href="https://www.codeproject.com/script/Membership/View.aspx?mid=15342246">OracPrime</a>
        static void updateArrays4() 
        {
            var pref = parr;
            var vref = varr;
            var aref = aarr;
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                pref[i] += vref[i];
                vref[i] += aref[i];
            }
        }

        static void updateStructs1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].p += structs[i].v;
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].v += structs[i].a;
            }
        }

        static void updateStructs2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].p += structs[i].v;
                structs[i].v += structs[i].a;
            }
        }

        static void updateClasses1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].updatep();
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].updatev();
            }
        }

        static void updateClasses2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].update();
            }
        }

        static void updateClasses3()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].p += classes[i].v;
                classes[i].v += classes[i].a;
            }
        }

        static Stopwatch sw = new Stopwatch();

        static void Start(string phase)
        {
            Console.Write(phase + ": ");
            sw.Restart();
        }

        static void Finish()
        {
            Console.WriteLine(sw.ElapsedMilliseconds + " ms");
        }

        static void Main(string[] args)
        {
            Random rnd = new Random(0);
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i] = new Struct();
                classes[i] = new Class();

                classes[i].p = structs[i].p = parr[i] = rnd.Next();
                classes[i].v = structs[i].v = varr[i] = rnd.Next();
                classes[i].a = structs[i].a = aarr[i] = rnd.Next();
            }

            for (int run = 1; run <= 3; ++run)
            {
                Start("arrays separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays1();
                }
                Finish();

                Start("arrays together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays2(); // bug fix RE: Kermel
                }
                Finish();

                Start("arrays unsafe");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays3();
                }
                Finish();

                Start("arrays with refs");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays4();
                }
                Finish();

                Start("structs separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateStructs1();
                }
                Finish();

                Start("structs together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateStructs2();
                }
                Finish();

                Start("classes separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses1();
                }
                Finish();

                Start("classes together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses2();
                }
                Finish();

                Start("classes inlined"); // RE: Andre_Prellwitz
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses3();
                }
                Finish();
            }
        }
    }
}

The C++ Results

C++
arrays separate: 75 ms
arrays together: 76 ms

structs separate: 182 ms
structs together: 100 ms

header classes separate: 182 ms
header classes together: 98 ms

source classes separate: 189 ms
source classes together: 100 ms

You can see that using raw arrays is much faster, 2X, than doing the separate technique with structs or classes. Note that LTCG inlines the Source class's code perfectly, and classes are a bit faster than structs. Interesting.

The C# Results

C#
arrays separate: 129 ms
arrays together: 112 ms

arrays unsafe: 91 ms
arrays with refs: 90 ms

structs separate: 196 ms
structs together: 104 ms

classes separate: 384 ms
classes together: 191 ms
classes inlined: 204 ms

Arrays in C# do not have the great performance gain over others as in C++. Note that C# structs have comparable performance with C++ classes. And C# structs are faster than C# arrays. Strange. As expected, C# classes do not perform well, especially with separated updating.

Conclusion and Points of Interest

  • If you want optimum data processing performance, parallel arrays in C/C++ are the way to go. You had to know that going in.
  • If that code pattern is unacceptable, using C++ classes or C# structs are good choices. The code is more manageable than parallel arrays, and the performance is good.
  • C# classes should be avoided. Sealing them and inlining their code does not make them worthwhile.
  • Using separate loops for handling different fields in classes or structs does not pay off at all, and should be avoided due to the negative effect is has on code quality. Data caching seems to trump code caching big time in this use case.
  • In C#, for parallel arrays you can get the same performance as unsafe code by using local variables to "pin" the data in place while you work with it.

History

  • 13th August, 2021: Initial version
  • 26th August, 2021: Took Andre_Prellwitz's comment and applied it to the code and the article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United States United States
Michael Balloni is a manager of software development at a cybersecurity software and services provider.

Check out https://www.michaelballoni.com for all the programming fun he's done over the years.

He has been developing software since 1994, back when Mosaic was the web browser of choice. IE 4.0 changed the world, and Michael rode that wave for five years at a .com that was a cloud storage system before the term "cloud" meant anything. He moved on to a medical imaging gig for seven years, working up and down the architecture of a million-lines-code C++ system.

Michael has been at his current cybersecurity gig since then, making his way into management. He still loves to code, so he sneaks in as much as he can at work and at home.

Comments and Discussions

 
AnswerRe: Typo in the C# implementation Pin
Michael Sydney Balloni19-Aug-21 13:50
professionalMichael Sydney Balloni19-Aug-21 13:50 
QuestionWhich compiler did you use? Pin
BugDigger15-Aug-21 21:00
BugDigger15-Aug-21 21:00 
AnswerRe: Which compiler did you use? Pin
Michael Sydney Balloni16-Aug-21 6:18
professionalMichael Sydney Balloni16-Aug-21 6:18 
AnswerRe: Which compiler did you use? Pin
Michael Sydney Balloni19-Aug-21 13:51
professionalMichael Sydney Balloni19-Aug-21 13:51 
QuestionPossible explanation for the C++ results Pin
Daniel Pfeffer15-Aug-21 14:30
professionalDaniel Pfeffer15-Aug-21 14:30 
AnswerRe: Possible explanation for the C++ results Pin
Michael Sydney Balloni19-Aug-21 13:52
professionalMichael Sydney Balloni19-Aug-21 13:52 
GeneralRe: Possible explanation for the C++ results Pin
Daniel Pfeffer19-Aug-21 18:00
professionalDaniel Pfeffer19-Aug-21 18:00 
Answerreproduced with small changes Pin
merano9914-Aug-21 10:05
mvemerano9914-Aug-21 10:05 
The results are somewhat amazing. I also tested it myself.

To compile the code i had to change the Macros, because compiler did not like t.
C++
#define START_RUN(name) { printf("%s: ", name); auto start = timer::now();
#define END_RUN() \
	auto end = timer::now(); \
	int elapsedMs = (int)std::chrono::duration_cast<ms>(end - start).count(); \
	printf("%d ms\n", elapsedMs); \
}

My results for x64 release (only nativ)
In addition, the speed optimization switch (/ Ot) is switched on.

You see two calls in a row; started with the command line.

C++
----
arrays separate: 91 ms
arrays together: 78 ms
structs separate: 177 ms
structs together: 90 ms
header classes separate: 173 ms
header classes together: 91 ms
source classes separate: 172 ms
source classes together: 90 ms

arrays separate: 90 ms
arrays together: 78 ms
structs separate: 172 ms
structs together: 88 ms
header classes separate: 175 ms
header classes together: 90 ms
source classes separate: 171 ms
source classes together: 91 ms
----
arrays separate: 90 ms
arrays together: 78 ms
structs separate: 172 ms
structs together: 87 ms
header classes separate: 171 ms
header classes together: 89 ms
source classes separate: 171 ms
source classes together: 91 ms

arrays separate: 89 ms
arrays together: 78 ms
structs separate: 173 ms
structs together: 89 ms
header classes separate: 172 ms
header classes together: 91 ms
source classes separate: 171 ms
source classes together: 91 ms
----

GeneralRe: reproduced with small changes Pin
Michael Sydney Balloni15-Aug-21 11:52
professionalMichael Sydney Balloni15-Aug-21 11:52 
GeneralNot Surprising Pin
Rick York13-Aug-21 10:19
mveRick York13-Aug-21 10:19 
GeneralRe: Not Surprising Pin
Michael Sydney Balloni15-Aug-21 11:51
professionalMichael Sydney Balloni15-Aug-21 11:51 
GeneralRe: Not Surprising Pin
Rick York15-Aug-21 15:46
mveRick York15-Aug-21 15:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.