5 Minutes to Your First oneAPI App on DevCloud

Jeremy C. Ong

5.00/5 (3 votes)

Nov 10, 2020

CPOL

8 min read

4483

A quick demonstration of writing an application using DPC++ and oneAPI and deploying to Intel DevCloud

Introduction

Intel's oneAPI is an open, standards-based programming model, empowering C++ developers to target different accelerator architectures in a uniform manner. This is in contrast to specialized APIs such as Vulkan, CUDA, or D3D12 which target GPUs in a bespoke manner, or Verilog and VHDL which target FPGAs. When programming with oneAPI, the same code can be used to target CPUs, GPUs, and FPGAs alike. The only thing you need to get started is a working installation of the oneAPI toolkit which ships with DPC++, Intel's extension to the Clang compiler, needed to compile C++ code to target the various supported accelerator types.

An additional benefit of using oneAPI is the ability to deploy applications to Intel's DevCloud, a sandbox environment providing access to compute clusters provisioned with powerful CPUs and FPGAs. Practitioners familiar with the complexity of managing FPGA toolchains will likely be pleased to deploy software in a nearly turnkey environment, targeting FPGAs such as the Arria 10 without needing an immediate investment in hardware. In this article, we'll be writing a small program using oneAPI and DPC++ from scratch and subsequently deploy and run our program on hardware in the DevCloud.

A Sobel Convolution

The Sobel operator is a simple edge-detection filter that, when convolved with an image, produces a corresponding image with the edges emphasized. It is used primarily in computer vision or rendering applications as a precursor to downstream pipeline stages that perform inference or render various visual effects. To perform a Sobel filter, we need to convolve an image two times with the following filters:

|1  0 -1|
|2  0 -2|
|1  0 -1|

and:

| 1  2  1|
| 0  0  0|
|-1 -2 -1|

Intuitively, the first kernel here will "detect" edges in the x-direction, and the second kernel detects edges in the y-direction. The application of the kernel at a single pixel is, in reality, a directional gradient computed using finite differences. The gradients produced in this way can then be combined to return the norm of the directional derivative at each pixel. Larger norms correspond to edges as seen in the following images:

For a 3x3 kernel, application to a single pixel requires 9 multiplies, and 8 additions to accumulate the result, so 17 operations in total (not accounting for SIMD or MADD or any other such operation fusing, this is just a rough cost estimation). If an image has dimension w by h, then we expect the two convolutions to require on the order of 34wh operations. There is a trick however, which leverages the fact that the convolution operator is separable whenever the convolution matrix is rank 1. In the case of the x-derivative finite difference kernel, we can factor it out as the following outer product:

|1  0 -1|   |1|
|2  0 -2| = |2| * [1  0 -1]
|1  0 -1|   |1|

Subsequently, the image can be convolved in two steps, first with the 3x1 horizontal kernel, then by the 1x3 vertical kernel. These two kernels require 10 operations, meaning that the entire Sobel convolution can be done on the order of 20wh operations instead of 34wh (at the cost of requiring some additional intermediate memory).

The DPC++ we will write now will perform the following steps:

Load an image
Initialize a device queue
Convert the image to greyscale
In parallel
- Perform the horizontal gradient computation in two sequential convolutions
- Perform the vertical gradient computation in two sequential convolutions
Combine the two results from step two into the output image
Write the image to disk

Loading the Image

To read and write images, we'll be using the single-file libraries stb_image and stbimagewrite respectively. After putting these files in your source tree, create a main.cpp file with the following included headers at the top.

#include <CL/sycl.hpp>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"

Then, we can load an image like so:

int channels;
int width; 
int height; 
uint8_t* image = stbi_load(path, &width, &height, &channels, 3);
cl::sycl::buffer<uint8_t, 1> image_buffer{image, width * height * channels};

The memory is already available on the host after the stbi_load function executes. However, we also construct a buffer type to manage read and write barriers to the image later.

Initializing a Device Queue

In order to submit hardware-accelerated commands, we need to construct a cl::sycl::queue. Queues are needed to marshall commands and memory to a device.

cl::sycl::queue queue{cl::sycl::default_selector};

In addition to the default selector (which attempts to pick the "best" one according to a set of heuristics), you can specify a gpu_selector, fpga_selector, host_selector or even a custom selector you define. Without the cl::sycl::queue abstraction, we would need to implement custom logic to interact with different device drivers for each accelerator type. The chief benefit of writing SYCL code is the ability to target all the above with a single unified interface.

Converting the Image to Greyscale

To evaluate the edge gradients, we need to first convert the 3-channel color image to greyscale. On an individual pixel, we perform this mapping with the following function:

float luminance(uint8_t r, uint8_t g, uint8_t b)
{
    float r_lin = static_cast<float>(r) / 255;
    float g_lin = static_cast<float>(g) / 255;
    float b_lin = static_cast<float>(b) / 255;

    // Perceptual luminance (CIE 1931)
    return 0.2126f * r_lin + 0.7152 * g_lin + 0.0722 * b_lin;
}

If you aren't familiar with color-theory, just remember that green is perceptually brighter than red and blue, with blue being the least perceptually bright color component of the three.

Then, we can simply compute the luminance of each pixel in parallel like so:

// This constructor indicates that the memory should be allocated by the runtime
cl::sycl::buffer<float, 1> greyscale_buffer{width * height};

queue.submit([&greyscale_buffer, &image_buffer, image, width, height](
cl::sycl::handler& h) {
    // A discard_write is a write access that doesn't need to preserve existing
    // memory contents
    auto data = greyscale_buffer.get_access<cl::sycl::access::mode::discard_write>(h);
    auto image_data = image_buffer.get_access<cl::sycl::access::mode::read>(h);

    h.parallel_for(cl::sycl::range<1>(width * height),
                   [image_data, data](cl::sycl::id<1> idx) {
                       int offset   = 3 * idx[0];
                       data[idx[0]] = luminance(image_data[offset],
                       image_data[offset + 1],
                       image_data[offset + 2]);
                   });
});

The get_access<M> method of the SYCL buffer allows us to advertise that code submitted to the queue will access memory in a particular way. The SYCL runtime, in turn, will sequence queue submissions along with any need memory synchronization. Furthermore, note that while we wrote the luminance function with vanilla C++, the compiler is able to compile it to executable code suitable for the device we target.

Horizontal and Vertical Convolutions

Next, we need to perform the convolutions needed to compute the edge gradients which we'll store in two buffers:

cl::sycl::buffer<float, 1> dx{width * height};
cl::sycl::buffer<float, 1> dy{width * height};

Now, the horizontal convolution:

// Open a new scope so that dx_tmp is deallocated once no longer needed
{
    cl::sycl::buffer<float, 1> dx_tmp{width * height};

    // Extract a 3x1 window around (x, y) and compute the dot product
    // between the window and the kernel [1, 0, -1]
    queue.submit([&greyscale_buffer, &dx_tmp, width, height](cl::sycl::handler& h) {
        auto data = greyscale_buffer.get_access<cl::sycl::access::mode::read>(h);
        auto out = dx_tmp.get_access<cl::sycl::access::mode::discard_write>(h);

        h.parallel_for(cl::sycl::range<2>(width, height),
                       [data, width, out](cl::sycl::id<2> idx) {
                           int offset = idx[1] * width + idx[0];
                           float left = idx[0] == 0 ? 0 : data[offset - 1];
                           float right = idx[0] == width - 1 ? 0 : data[offset + 1];
                           out[offset] = left - right;
                        });
    });

    // Extract a 1x3 window around (x, y) and compute the dot product
    // between the window and the kernel [1, 2, 1]
    queue.submit([&dx, &dx_tmp, width, height](cl::sycl::handler& h) {
        auto data = dx_tmp.get_access<cl::sycl::access::mode::read>(h);
        auto out  = dx.get_access<cl::sycl::access::mode::discard_write>(h);
        h.parallel_for(
              cl::sycl::range<2>(width, height),
              [data, width, height, out](cl::sycl::id<2> idx) {
                  // Convolve vertically
                  int offset = idx[1] * width + idx[0];
                  float up   = idx[1] == 0 ? 0 : data[offset - width];
                  float down = idx[1] == height - 1 ? 0 : data[offset + width];
                  float center = data[offset];
                  out[offset]  = up + 2 * center + down;
              });
    });
}

The main important thing to note is that the dependency graph of operations is implicitly defined by the memory access barriers we include in each queue submission. For example, while we have no explicit synchronization between the first 3x1 convolution and the greyscale conversion, SYCL guarantees that there is a happens-after relationship between these two steps because the greyscale memory produces data that is read by the convolution.

The vertical convolution is then performed in the same way, except with different kernels:

{
    cl::sycl::buffer<float, 1> dy_tmp{width * height};

    queue.submit([&greyscale_buffer, &dy_tmp, width, height](
                 cl::sycl::handler& h) {
    auto data = greyscale_buffer.get_access<cl::sycl::access::mode::read>(h);
    auto out  = dy_tmp.get_access<cl::sycl::access::mode::discard_write>(h);

    // Create a scratch buffer for the intermediate computation
    h.parallel_for(cl::sycl::range<2>(width, height),
                   [data, width, out](cl::sycl::id<2> idx) {
                       // Convolve horizontally
                       int offset = idx[1] * width + idx[0];
                       float left = idx[0] == 0 ? 0 : data[offset - 1];
                       float right = idx[0] == width - 1 ? 0 : data[offset + 1];
                       float center = data[offset];
                       out[offset]  = left + 2 * center + right;
                    });
    });

    queue.submit([&dy, &dy_tmp, width, height](cl::sycl::handler& h) {
        auto data = dy_tmp.get_access<cl::sycl::access::mode::read>(h);
        auto out  = dy.get_access<cl::sycl::access::mode::discard_write>(h);
        h.parallel_for(
            cl::sycl::range<2>(width, height),
            [data, width, height, out](cl::sycl::id<2> idx) {
                // Convolve vertically
                int offset = idx[1] * width + idx[0];
                float up   = idx[1] == 0 ? 0 : data[offset - width];
                float down = idx[1] == height - 1 ? 0 : data[offset + width];
                out[offset] = up - down;
            });
    });
}

Notice that the vertical and horizontal gradients have no dependence on one another, so SYCL may execute them in parallel.

Combining the Gradients

For each pixel, we can have the gradient projected on the x and y axes, so it's a simple matter to compute the magnitude of the gradient.

// Allocate a memory region shared between the host and device queue
uint8_t* out = reinterpret_cast<uint8_t*>(
    cl::sycl::malloc_shared(width * height, queue));

queue.submit([&dx, &dy, width, height, out](cl::sycl::handler& h) {
    auto dx_data = dx.get_access<cl::sycl::access::mode::read>(h);
    auto dy_data = dy.get_access<cl::sycl::access::mode::read>(h);

    h.parallel_for(cl::sycl::range<1>(width * height),
        [dx_data, dy_data, out](cl::sycl::id<1> idx) {
            float dx_val = dx_data[idx[0]];
            float dy_val = dy_data[idx[0]];
            // NOTE: if deploying to an accelerated device, math
            // functions MUST be used from the sycl namespace
            out[idx[0]] = cl::sycl::sqrt(dx_val * dx_val + dy_val * dy_val) * 255;
    });
});

Here, we write the data back to the unified memory allocation after rescaling to an 8-bit greyscale format. Notice that because we do a buffer read from both dx and dy buffers, this work will be sequenced after both horizontal and vertical edge convolutions finish.

Writing Out the Result

Finally, we're ready to read out the result and write it to disk.

queue.wait();

stbi_write_png("edges.png", weidth, height, 1, out, width);

// Reclaim now unused memory
stbi_image_free(image);
cl::sycl::free(out, queue);

The wait is needed here because unlike before, we don't have a get_access request to create an implicit barrier and we are reading from the back host memory directly.

Deploying to DevCloud

Now that we have a working program, we can deploy the source code and any build scripts to the Intel oneAPI DevCloud. After signing up for the program, you should receive an email with instructions on how to obtain your SSH credentials to login to the DevCloud platform. After following the instructions, you should be able to upload your source file and test image with the following command:

scp main.cpp devcloud:~
scp stb_image.h devcloud:~
scp stb_image_write.h devcloud:~
scp peppers.png devcloud:~

This uploads your source file to the home directory of your assigned DevCloud user account. Then, you can login to deploy your program:

ssh devcloud

Create a script that compiles and runs your program like so:

#!/usr/bin/env bash
# run.sh

# Ensure the correct environment variables are set
source /opt/intel/inteloneapi/setvars.sh

# Compile our program
dpcpp main.cpp -o sobel -std=c++17 -fsycl -lOpenCL

# Invoke the program with our test image
./sobel peppers.png

Then, we can use the qsub (queue submission) command to invoke our script on the various compute nodes available on the DevCloud. To see a list of available hosts, run the pbsnodes command. This will list the nodes by id with additional information such as the type of processor running, and any accelerators available to it.

To submit a job to a host with a GPU for example, we can run the following command:

qsub -l nodes=1:gpu:ppn=2 -d . run.sh

The options, in short, indicate that we want to run our script on a single node with a GPU, we want to occupy the node fully (the ppn=2 option), we want the working directory to be the current directory, and we want the node to invoke the run.sh script.

To see the status of your job, you can run qstat which will produce output similar to the following:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
681510.v-qsvr-1            run.sh           u47956                 0 R batch

The Job ID can be used as an argument to the qdel command in order to cancel a pending job.

Once the job completes, you'll have files like run.sh.o681510 and run.sh.e068150 in your current directory, corresponding to the script stdout and stderr output respectively. If our program ran successfully, you should also have an edges.png image you can check for correctness. Logout using the exit command, and transfer the image back to your host with scp:

scp devcloud:~/edges.png .

Conclusion

In this article, we developed a C++ application using the SYCL runtime with additional extensions provided by the Intel DPC++ compiler. The application demonstrates how a unified programming model can target different architectures, as well as the abstractions the SYCL runtime provides for coordinating access to memory and author parallel code with implicit dependency graphs.

Finally, we showed how to deploy and test our code against different hardware profiles provided by the Intel DevCloud. To learn more about the SYCL runtime, the Intel DevCloud, or the Intel DPC++ compiler, you are encouraged to read more at the Intel DevZone here.

History

9^th November, 2020: Initial version