Zero abstraction overhead with modern compilers

Tobias Wenig

4.84/5 (16 votes)

Jun 24, 2011

GPL3

7 min read

29859

352

This article shows an easy way to implement basic image processing algorithms in a generic STL like way for a pixel iterator and benchmarks the results on different compilers.

Download source code - 11.5 KB

Introduction

Problem: Usually whenever a program is written and abstraction comes into a discussion, one of the first arguments against it is performance. As Donald Knut said: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil". Following that, the complete code is just "written down". No specific "performance tricks" are used to see whether this applies when writing higher abstracted algorithms with modern compilers.

In this article, I will present a basic iterator which could be used for basic image processing to benchmark how costly such an abstraction really is. I would have liked to use many more of the cool C++0x features, but did not want to be too much dependent on just the newer g++ versions. The main idea for that came during the time when I was still writing such algorithms myself and was wondering if there was a better way to deal with that.

When implementing basic image processing algorithms in C++ it usually ends up with a lot of non algorithm related handling:

Different pixel "layouts". E.g., 8 bit gray value, 16 bit gray value, 24 bit RGB etc., and annoying conversions between them.
Handling of image dimensions and ROIs.
Range checking to avoid buffer overrun.
Memory management and clean up.
Performance.

This usually makes the algorithms harder to read and the intention unclear, as this handling somehow hides and mixes with the real algorithmic code. There are several standard solutions to overcome those problems, e.g., different pixel layouts can be addressed using templates, but that still remains non-trivial, in the case of RGB pixels, where each colour channel needs to be processed separately.

The demo project is used to demonstrate and benchmark the efficiency of the PixelIterator class. The source further demonstrates how to convert (if necessary) between different pixel representations. As the project is cross platform compilable, I was using OpenCV as a reference for execution speed and for image visualisation, and OpenMP to parallelize my own algorithms. The project itself was just for me to play with some stuff I have always wanted to have a look at: C++0x, OpenMP, and OpenCV, and actually does not implement anything particularly useful :)

original me

Sobel

Background

It might be useful to have a look at the basic image processing algorithms and the iterator pattern. The code in the article somehow tries to mimic an STL like iterator behaviour.

This article describes the concept of a generic PixelIterator that handles all the above mentioned points and removes the need for handling them in your algorithm code. We will further investigate the performance penalty we will get for the higher abstraction. Actually, the performance aspect was the main motivation for me for this article, as I did not even use the iterator in any of my projects.

Let's start with an easy example in which we just adjust the overall brightness of an image. That’s somewhat how it would look like in plain C++ (actually without the ++):

struct sRGB
{
    char cBlue;
    char cGreen;
    char cRed;
};
sRGB* pInData = (sRGB*)inImage->imageData;
sRGB* pOutData = (sRGB*)out->imageData;
sRGB* pEndIn = pInData + inImage->width * inImage->height;

while( pInData != pEndIn )
{        
    pOutData->cBlue = pInData->cBlue + 50;
    pOutData->cRed = pInData->cRed + 50;
    pOutData->cGreen = pInData->cGreen + 50;
    ++pInData;
    ++pOutData;
}

I avoided the operator+ definition for the sRGB struct for simplicity, as most superstitions about C++ usually point out the overhead of operator overloading and I wanted to avoid that for later use in benchmarked tests. Now let's have a look at how this would look like using the pixel iterator:

Image<RGBPixel> imgIn( in->imageData, in->width, in->height );
Image<RGBPixel> imgOut( imgIn.createNewImage( out->imageData ) );

Image<RGBPixel>::iterator itStart( imgIn.begin() );
Image<RGBPixel>::iterator itEnd( imgIn.end() );
Image<RGBPixel>::iterator itDestStart( imgOut.begin() );

while( itStart != itEnd )
{
    *itDestStart = *itStart + (unsigned char)50;
    ++itDestStart;
    ++itStart;
}

Pretty much the same, not surprising if we assume that iterators should actually have the same behavior as pointers anyway. So what's the benefit? Assume that we would have a much more complex algorithm than we want to implement for 8 bit grayscale, 24 bit color, 32 bit color, and 16 bit grayscale images. It would immediately break our first example as it is "fixed" implemented for RGB and explicitly for that. What would be interesting would be how that influences performance, but we will have a look at that later. What about ROIs? I will not write the "native" example now for comparison, but for the iterator, it's easy like this:

Image<RGBPixel> imgIn( in->imageData, 100, 100, 500, 500, in->width, in->height );

The point is that the "algorithm" does not need to care, it would still remain the same, not even knowing that it's working on an ROI. One other culprit with image operations is that they are usually projecting a 2D problem on a 1D data stream, which is what the iterator already does. For the purpose of testing this, I implemented a simple convolution filter. For that, 2D navigation is needed as a 2D filter is convoluted with a 2D image area (not like the simple 1D brightness example). For that, the iterator implements functions for relative 2D pixel access:

void operator()( size_t iX, size_t iY, TPixelType value );
const TPixelType& operator()( size_t iX, size_t iY )const;
//usage
while( itStart != itEnd )
{
    const PixelType pixel( itStartIn( iXPos ,iYPos ) );
    //some fancy calculations with the pixel
    //...
    //than write the result to the relative result position
    itOutStart(iDestPosX, iDestPosY, result / m_iFilterNorm );
}

How to use

The basic operation is to create an image instance, which technically is just a container of the image related data. The real processing work is done by the PixelIterator class behind the scenes, which is indirectly used through the iterator typedef: (Image<RGBPixel>::iterator). Every other pixel representation could be passed as long as it supports the basic mathematical operations.

Usage with OpenMP

For use with OpenMP, I found the best way is not to parallelize the algorithm itself at all, but to use the iterator to partition the image. Parallelizing the algorithm itself has the drawback of doing specific adaptations for each algorithm. With partitioning, all algorithms can benefit from that. Unfortunately, this "trick" does not work for all image operations, but is OK for most of the trivial ones. I further found that using OpenMP within the algorithm might easily lead to much slower code, as synchronization overhead easily gets a bottleneck which further performance more than it does good.

template<typename TIn, typename TOut, typename TImageOP>
void executeImageOP( PixelIterator<TIn> itStartIn, PixelIterator<TIn> itEndIn, 
     PixelIterator<TOut> itOutStart, TImageOP imageOP )
{
    const int iNumCPUs = omp_get_num_procs();
    std::cout << "omp detected " << iNumCPUs << std::endl;
    int k =0;
    typedef typename PixelIterator<TIn>::ItertorContainer PixelIterCntIn;
    typedef typename PixelIterator<TOut>::ItertorContainer PixelIterCntOut;
    PixelIterCntIn itStartInPart( itStartIn.partition( iNumCPUs ) );
    PixelIterCntOut itStartOutPart( itOutStart.partition( iNumCPUs ) );
    #pragma omp parallel for
    for( k=0;k < iNumCPUs;k++ )
    {
        imageOP( itStartInPart[k], itStartInPart[k].end(), itStartOutPart[k] );
    }
}

Tests and benchmarks

So the question is, how big is the performance penalty for this abstraction? As you can see in the above RGB example, each color channel needs to be processed and for that purpose, I used regular operator overloading for the RGBPixel implementation. So there must be a huge penalty. But let's have a look at the benchmark results:

	VS 10 (Centrino 2 Duo 1.7GHz) Vista	g++ 4.4.4 (Centrino 2 Duo 1.7GHz) Ubuntu 8.04	G++4.5.2 (Core 2 Duo 2.1GHz) Ubuntu 11.04	g++ 4.4.5 (Core 2 Duo 2.1GHz) Ubuntu 10.04	g++ 4.4.5 (Celeron 550MHz) Ubuntu 10.04	g++ 4.4.5 (Core 2 Duo 2.1GHz) Debug Ubuntu 10.04
OpenCV Canny	-	740	646	683	3185	-
C-like naive Sobel	7232	3219	3163	4579	18384	8342
IteratorConvolution	4242	1911	2010	2092	9619	9537
IteratorConvolutionGreyScale	1829	1126	980	1077	7789	6704
IteratorConvolutionGreyScaleParallel	1087	590	562	457	7591	6267
IteratorConvolutionParallel	2466	1030	915	849	9548	9316
BrightnesAdjustIterator	181	48	44	35	311	320
BrightnesAdjust	67	43	36	28	255	73

graph

The canny operator from OpenCV is just here to have some image processing reference for some edge detection algorithm. I am sorry to not provide the canny OpenCV result for VS10, but for some reason, the code did not work there and I was too lazy to fix that (I usually use only Linux at home).

So what about those results? For the different processor types, there was no surprise. What was strange was the slower resulting code with VS10, though I just took the default release settings. Overall, the price to pay for the iterator seems to be small. It even outperforms a naive C implementation I did as a comparison. For the OpenMP parallelised algorithms, we really have good scalability. In my results, we have a performance gain of approximately 100% for two cores. As I also found out, this ratio gets a bit worse for smaller images as the synchronization overhead does not overweigh the gain.

graph

It's also good to see that the code written to be parallelized with OpenMP does not slow significantly down, if executed on single processor systems (OpenMP is disabled in Debug mode). The compilers obviously did a great job optimizing my abstraction overhead away. In debug, it's pretty much visible. There the iterator always loses. So at least in Debug mode, some superstitions about abstraction overhead performance becomes true (a little bit :)).

lessons learned

Writing minimal C-code does not guarantee faster code.
Using OpenMP for writing parallel and scalable algorithms can significantly increase performance with no penalty for single core systems. If possible, use data partitioning which can even be accomplished with no impact to the algorithmic code itself.
When using OpenMP, the fastest thread synchronization is no synchronization. Partitioning data is sometimes the smarter choice.
Using modern C++ compilers, there is almost no excuse to sacrifice abstraction.
If you really want to optimize the last bit, it is often compiler specific, which might backfire once you switch your compiler.