You're sharing the single 'pos' variable among all of your threads.
You'll get random values, as each thread is sharing the same location and writing and reading from it. Who knows what value it will have by the time the thread gets to the next line? And even in that line, the value can change in mid execution.
Try this.
int pos=(j*stride)+(i*3);
You might get some speed improvement trying it like this.
#pragma omp parallel for private(i)
for (int i=0;i<width;i++)
{
unsigned char * LinePtr = ptr + (j * stride);
for(int j=0;j<height;j++)
{
mat[i][j]=(float)((int) *(LinePtr++) + *(LinePtr++) + *(LinePtr++))/3;
}
}
There is some overhead in spawning the threads. So rather than spawn a thread for each pixel, spawn one for each line. They'll then get more calculations done between the spawn and the destructon.
By precalculating the line, we divide the number of multiplications by height.
Increments are also faster than adds. And by removing the j variable from the inner loop, it gives the compiler more room to optimize. Now it only has two variables to manipulate, 'mat' and 'LinePtr'.
Also notice that I cast the pixels to an int before adding. Adding unsigned chars can lead to value wrapping at the 255 boundary.