Click here to Skip to main content
15,867,833 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I tried the posted code.
My idea was to obtain partial sums of input data on array rms, then make barriers (GLOBAL and LOCAL) to wait until all rms[k] are filled, then sum all them to obtain the media value.
I placed some printf to advises if there are errors in the calculus.
I obtained errors at printf warning-2 but not warning-1 and 3 due when adding all data, bacause some of the cores still not finished to calculate the partial sums.

I did not used local memory as long as maximum local size is 256 that is lot smaller than height=10000

How I make GPU to wait until I calculated all partial sums?

What I have tried:

I have the following code:
__kernel void hallaRMS2(
	__global float*  data,  //size=WIDTH*HEIGHT
	int WIDTH,
	int HEIGHT,
	__global double *rms //size=HEIGHT
)
{
        int k = get_global_id(0); //0..HEIGHT
		__global float *x=data+k*WIDTH;


		double sum=0.0;

		for (int j=0;j<WIDTH;j++)
		{
			sum+=x[j];
		}

		rms[k]=sum;//to be used to calculate media
		if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-1: rms[%i]=%lg\n",k,rms[k]);

		barrier(CLK_GLOBAL_MEM_FENCE ); //to give time to all rms[k] be filled
		barrier(CLK_LOCAL_MEM_FENCE ); 
		

		if (k==0)
		{
			sum=0.0;
			for (int j=0;j<HEIGHT;j++)
			{
				if ((rms[j]<100*WIDTH)||(rms[j]>101*WIDTH)) printf("Warning-2: rms[%i]=%lg\n",j,rms[j]); 
				sum+=rms[j];
			}
			rms[0]=sum/(double) WIDTH/(double) HEIGHT;
			printf("GPU sum=%lg\n",sum);
			printf("GPU media=%lg\n",rms[0]);
		}
		else
			if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-3: rms[%i]=%lg\n",k,rms[k]);
...
Posted
Updated 31-Aug-17 1:03am
v2

1 solution

I am not very happy with this solution:
Opencl 1.2 does not allow synchronize across all work groups, as I stated, so it must be going out of kernel and enter in a new one to use data from all work items.

If somebody know how to do it in the new openCL 2.x standard I would appreciate it.

Fortunately for Cuda boys it allows synchronize along all the device without going out the kernel. This must be taken in account if somebody try translate code from Cuda to Opencl!
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900