Click here to Skip to main content
15,885,435 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I program in GPUs using OpenCL but I would be happy with a easier system to parallelize the program

What of them implies less code to be changed to introduce in GPU?
C++ amp and Trust allows run several functions sequentially inside the GPU before returning results?

What I have tried:

I made code using OpenCL but a lot of work must be done added to the parallelizing the algorithm that is the system preparision, buffer ctreation, data transfers, ...
Posted
Updated 11-Nov-18 2:15am
v2

1 solution

but a lot of work must be done added to the parallelizing the algorithm that is the system preparision, buffer ctreation, data transfers


If you are re-writing whole OpenCL stages for every new programming work, then you need to shorten the writing part by using OOP such as creating an object for doing not all but specialized parts of OpenCL for you within RAII style rules so that what you write 1 liner will be equal to 10s of lines of OpenCL API and you won't have to remember in which order of "resource release" operations will be done.

If you don't want to re-invent your own wheel and if you are ok with the latency of bureaucracy for bugfixing and feature integrations, then there are others wheels.

Sycl:

C++
std::vector h_a(LENGTH);             // a vector
  std::vector h_b(LENGTH);             // b vector
  std::vector h_c(LENGTH);             // c vector
  std::vector h_r(LENGTH, 0xdeadbeef); // d vector (result)
  // Fill vectors a and b with random float values
  int count = LENGTH;
  for (int i = 0; i < count; i++) {
    h_a[i] = rand() / (float)RAND_MAX;
    h_b[i] = rand() / (float)RAND_MAX;
    h_c[i] = rand() / (float)RAND_MAX;
  }
  {
    // Device buffers
    buffer d_a(h_a);
    buffer d_b(h_b);
    buffer d_c(h_c);
    buffer d_r(h_d);
    queue myQueue;
    command_group(myQueue, [&]()
    {
      // Data accessors
     auto a = d_a.get_access<access::read>();
     auto b = d_b.get_access<access::read>();
     auto c = d_c.get_access<access::read>();
     auto r = d_r.get_access<access::write>(); 
      // Kernel
      parallel_for(count, kernel_functor([ = ](id<> item) {
        int i = item.get_global(0);
        r[i] = a[i] + b[i] + c[i];
      }));
    });
  }


Arrayfire:

C++
// Select a device and display arrayfire info
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
printf("Create a 5-by-3 matrix of random floats on the GPU\n");
array A = randu(5,3, f32);
af_print(A);
printf("Element-wise arithmetic\n");
array B = sin(A) + 1.5;
af_print(B);
printf("Negate the first three elements of second column\n");
B(seq(0, 2), 1) = B(seq(0, 2), 1) * -1;
af_print(B);
printf("Fourier transform the result\n");
array C = fft(B);
af_print(C);
printf("Grab last row\n");
array c = C.row(end);
af_print(c);
printf("Scan Test\n");
dim4 dims(16, 4, 1, 1);
array r = constant(2, dims);
af_print(r);
printf("Scan\n");
array S = af::scan(r, 0, AF_BINARY_MUL);
af_print(S);
printf("Create 2-by-3 matrix from host data\n");
float d[] = { 1, 2, 3, 4, 5, 6 };
array D(2, 3, d, afHost);
af_print(D);
printf("Copy last column onto first\n");
D.col(0) = D.col(end);
af_print(D);
// Sort A
printf("Sort A and print sorted array and corresponding indices\n");
array vals, inds;
sort(vals, inds, A);
af_print(vals);
af_print(inds);


CUDA has an easier entrance than OpenCL with its runtime API. If you are ok with being resticted to Nvidia hardware, then have a look at this:

C++
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}


but if you prefer CUDA's driver-API then it will be as hard as OpenCL as a learning curve. But ofcourse, with enough encapsulation, you can make your own 1-liner algorithms to sort, map and compute things. Also CUDA has many tools to aid developers in bugfixing, optimizing and computing (so that you may not need to invent a FFT for yourself but directly use the relevant tool in CUDA platform).
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900