Optimized Image Inversion Using SSE2

Tolga Birdal

4.11/5 (10 votes)

May 2, 2009

CPOL

6 min read

47898

1019

Fast image inversion forms a good basis for optimizing pixel wise operations. We will discuss the ways to achieve the best speed on this inversion operator.

Introduction

Image inversion is the altering of intensities or colors of an image from normal form to an opposite representation.

Background

An inverted image could be interpreted as a digital version of image negatives. After inversion, every color takes the exact opposite one (I know this terminology is not that scientific, but it’s useful as a conceptual information). Let’s put this in more scientific terms. A positive image should be defined as a normal, original RGB or gray image. A negative image denotes a tonal inversion of a positive image, in which light areas appear dark and dark areas appear light. In negative images, a color reversing is also achieved, such that the red areas appear cyan, greens appear magenta, and blues appear yellow. In simpler sense, for the grayscale case, a black and white image, using 0 for black and 255 for white, a near-black pixel value of 5 will be converted to 250, or near-white.

Image inversion is one of the easiest techniques in image processing. Therefore, it’s very applicable to demonstrations of performance, acceleration, and optimization. Many of the state of the art image processing libraries such as OpenCV, Gandalf, VXL etc., perform this operation as fast as possible, even though some more accelerations using parallel hardware are possible.

In this article, I will demonstrate three implementations of image inversion: a basic implementation, an optimized one (in C level), and an optimized one using enhanced instructions of modern CPUs (for our case, SSE2).

The code uses OpenCV, not for the implementation of the algorithms, but only for reading the image and displaying it. So, if you don’t know how to use or configure it, please refer to the OpenCV documentation, or simply: http://opencv.willowgarage.com/wiki/VisualC%2B%2B.

Algorithm

For every pixel I(x,y), the inversion mapping is defined by Y(x,y)=255-I(x,y). For color images, the exact same formula is applied for all three channels (R,G,B). In computer language, this is Y(x,y)=~I(x,y). Here is how the basic algorithm looks like:

int i=0;
for (;i!=src->imageSize; i++)
  dst->imageData[i]=~((uchar)src->imageData[i]);

Optimizations

Bit-wise optimizations

Image inversion is the same as inverting every bit of the pixel value. 1->0 and 0->1. In short, it’s not an operation. Many modern processors have registers composed of 32 (we don’t care about 64 bit machines right now). Because bit inversion is independent of data size, we should acquire data as big as our register size, to fully utilize the registers. That’s why we represent a byte image as an integer pointer, which has four times smaller size. In other words:

Sizeof(uchar) * imagesize = sizeof(int) * imagesize/4

This way, we could gain up to 3x speed.

Loop unrolling

The goal of loop unwinding is to increase the program's speed by reducing (or eliminating) the "end of loop" test on each iteration. Loops can be re-written as a sequence of independent statements which eliminates the loop controller overhead. Significant gains can be realized if the speed gained (by eliminating the "end of loop" test) offsets the performance reduction caused by the increased size of the program. Furthermore, if the statements in the loop are independent of each other, statements that occur earlier in the loop do not affect statements that follow them, and the statements can be executed in parallel. (Wikipedia)

Here is a simple loop unrolling example:

A procedure in a computer program is to delete 100 items from a collection. This is accomplished by means of a for-loop which calls the function delete(item_number):

for (int x = 0; x < 100;  x++)
{
    delete(x);
}

If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to that for delete(x), loop unwinding can be used to speed it up. This will result in an optimized code fragment like:

for (int x = 0; x < 100;  x += 5)
{
   delete(x);
   delete(x+1);
   delete(x+2);
   delete(x+3);
   delete(x+4);
}

As a result of this modification, the new program has to make only twenty loops, instead of a hundred. There are now a fifth as many jumps and conditional branches that need to be taken, which over many iterations would be a great improvement in the loop administration time. The final algorithm for processing one row looks like:

for( ; i <= step1 - 16; i += 16 )
{
  const int* src1i=(const int*)(src1+i);
  int* dst1i=(int*)(dst1+i);
  int t0 = ~(src1i)[0];
  int t1 = ~(src1i)[1];
  int t2 = ~(src1i)[2];
  int t3 = ~(src1i)[3];

  dst1i[0]=t0;
  dst1i[1]=t1;
  dst1i[2]=t2;
  dst1i[3]=t3;
 }

SSE2 Integer Instructions

Because we process the data using an int pointer, we could make use of SSE2 integer arithmetic. This way, we could process four int values at the same time. Discussion of the SSE2 entire architecture is not the aim of this article. If you are not sure how to use them, please refer to web sites like:

(I am sorry; I don’t know a great SSE tutorial on the web.)

To implement inversion using SSE2, we should take the optimized inversion code and convert it into SSE2. However, as far as I know, SSE2 doesn’t have a direct NOT instruction. Instead, we make use of the pandn (packed and not) instruction. Here, we convert the problem into a more mathematical statement, which is Y(x,y)=~(0xff & I(x,y)). Here, we bitwise-AND the pixel with 255 and apply a bitwise-NOT. There is one more additional computation, but it may still be worth it (because the instruction is directly executed on the hardware). One could simply loop through all the pixels using a loop instruction and apply the same operator in SSE2 to optimize.

Here is the SSE2 instructions I use for this application:

movdqa: Moves 128 bit data from memory or pointer to an XMM register or vice versa.
pandn: "and & not" an XMM register.
loop: continue looping.

Memory alignment has always been an important issue. To prepare the data to be used by SSE processing, we should align our pointers to at least 16 byte boundaries (optimally). In this article, I use align(32) to be on the safe side. Alignment is always good, because the CPU could accomplish faster memory accesses when reading aligned data.

Finally, the entire processing of the image can be written as:

//Traverse:
// unroll the loop 4 times. 4th unroll is a half one.
// (as much as XMM's can hold)

// load 3 pixels
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]

// 255- pixel for 3 pixels
pandn  xmm0, xmm7
pandn  xmm1, xmm7
pandn  xmm2, xmm7
pandn  xmm3, xmm7

// output the computed content
movdqa [edi], xmm0
movdqa [edi+16], xmm1
movdqa [edi+32], xmm2
movdqa [edi+48], xmm3

// traverse array
add esi,64
add edi, 64

// move on
loop Traverse;

Conclusion

Please notice that C level optimizations are pretty promising for applications that need performance. Only dig into the assembly code if you really strive to make it better. For some problems, you may not even gain performance!

Issues

Images with widths that are not divisible by 4 will have some left-over pixels at the end of each row. I left out inversion of these leftover pixels on each row. That’s why it’s best if you add this functionality or simply don’t process images with sizes not divisible by 4. This code assumes that you have the image "C:\\Data\\Waterfall.jpg" in your hard drive. If you don't have this image, please modify the code and change the first line inside main to read an image you like.