Introduction

In this article, we'll be accelerating an existing C++ program with GPUs by porting the application to use Intel's oneAPI toolkit and compiler. The application we'll be porting is a C++ program that computes the discrete cosine transform of an image.

This operation is very common in image and video compression codecs (e.g. H.264, WebM, JPEG, WebP, etc.) and also highly amenable to parallelization. First, we'll present the C++ application and briefly describe how it works. For a fuller description of the underlying mathematics of the DCT, you'll want to consult the wiki page which gives not only an accessible description of the calculation, but also a worked computation. Then, we'll demonstrate how we can take advantage of Intel's oneAPI libraries to parallelize the application.

Intel's oneAPI is an open, standards-based programming model, empowering C++ developers to target different accelerator architectures with in a uniform manner. This is in contrast to specialized APIs such as Vulkan, CUDA, or D3D12 which target GPUs in a bespoke manner, or Verilog and VHDL which target FPGAs. When programming with oneAPI, the same code can be used to target CPUs, GPUs, and FPGAs alike. The only thing you need to get started is a working installation of the oneAPI toolkit which ships with DPC++, Intel's extension to the Clang compiler, needed to compile C++ code to target the various supported acclerator types.

An additional benefit of using oneAPI is the ability to deploy applications to Intel's DevCloud, a sandbox environment providing access to compute clusters provisioned with powerful CPUs and FPGAs. Practitioners familiar with the complexity of managing FPGA toolchains will likely be pleased to deploy software in a nearly turnkey environment, targeting FPGAs such as the Arria 10 without needing an immediate investment in hardware. In this article, we'll be writing a small program using oneAPI and DPC++ from scratch and subsequently deploy and run our program on hardware in the DevCloud.

Why Port to a GPU?

It may seem straightforward to think that porting C++ applications to a GPU are a strict win. However, it's important to consider the way in which CPUs and GPUs perform differently and analyze the tradeoffs involved. CPUs are topologically simple processors. A CPU consists of multiple cores (which may be further subdivided into logical cores, e.g., Intel's hyperthreading). Each logical core has its own instruction cache, data cache, and command queue. While a thread is assigned to a core, it has exclusive access to the core's resources and can fetch instructions from memory for evaluation.

In contrast, a GPU's architecture is hierarchical in nature, and designed to maximize throughput. To achieve this, instead of having each core maintain its own caches and command queue, individual processing units are instead grouped into subgroups (also known as warps on Nvidia hardware). Instructions across an entire subgroup are single issue, that is, all threads within a subgroup operate on the same instruction at a given time. Furthermore, while each thread has its own set of vector registers, the entire subgroup has access to fast device memory that is shared. It should be relatively clear why this arrangement of silicon results in higher throughput, fewer resources and instructions are needed to drive more work.

Algorithms that map cleanly to GPUs are those that are amenable to high degrees parallelization. In particular, if a task can be split into subtasks such that the subtasks can work in relative isolation from one another, the GPU is a prime candidate for accelerating the work. The Discrete Cosine Transform is one such algorithm easily accelerated by GPUs which we'll look at in this article.

C++ Discrete Cosine Transform

Without going too much into the mechanics of the Discrete Cosine Transform (henceforth, the DCT), let's just recount a few facts about the DCT:

Like the Fourier Transform, the DCT is a fully invertible transform from the spatial domain to the frequency domain
Unlike the Fourier Transform, the DCT is built on only even functions (cosines)
The DCT is commonly used in image and video compression algorithms; after the DCT is applied, fewer bits are allocated to store the higher frequency components because they are perceptually less significant

When the DCT is actually applied to an image or video frame, it's commonly applied to individual blocks within the raster (8x8 blocks are a typical size). For this demonstration, we'll be performing the following operations:

Load an image and convert it to greyscale
For each 8x8 block within the image, we'll map it to an 8x8 block of frequency amplitudes (applying the DCT)
Immediately, we'll invert the DCT to recover the original image and write it to disk for comparison

This application won't "produce" any result other than an image that should hopefully be identical to the original image. However, the intermediate frequency amplitude values can be quantized and compressed in a typical image or video codec. The encoding and decoding we perform here is largely illustrative of the technique, and the decoded output is a useful visual metric to determine if our code works as intended. While, we won't delve any further into the DCT or its properties, interested readers are encouraged to read the excellent introduction on wikipedia.

Without further ado, let's jump into the code!

Image Loading and Greyscale Conversion

To read and write images, we'll be using the single-file libraries stb_image and stbimagewrite respectively. After putting these files in your source tree, create a main.cpp file with the following included headers at the top.

#include <CL/sycl.hpp>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"

Then, we can load an image like so:

int channels;
int width;
int height;
uint8_t* image = stbi_load(path, &width, &height, &channels, 3);

Each individual RGB pixel can be mapped to greyscale with the following function:

float luminance(uint8_t r, uint8_t g, uint8_t b)
{
float r_lin = static_cast<float>(r) / 255;
float g_lin = static_cast<float>(g) / 255;
float b_lin = static_cast<float>(b) / 255;

// Perceptual luminance (CIE 1931)
return 0.2126f * r_lin + 0.7152 * g_lin + 0.0722 * b_lin;
}

If you aren't familiar with color-theory, just remember that green is perceptually brighter than red and blue, with blue being the least perceptually bright color component of the three.

Then, we can simply map the entire image to greyscale like so:

uint8_t* out = reinterpret_cast<uint8_t*>(std::malloc(width * height));
int8_t* grey = reinterpret_cast<int8_t*>(std::malloc(width * height));

// First, we convert the image to 8-bit greyscale, centered around 0
for (int y = 0; y != height; ++y)
{
    for (int x = 0; x != width; ++x)
    {
        int offset     = y * width + x;
        uint8_t* pixel = &image[3 * offset];
        float lum      = luminance(pixel[0], pixel[1], pixel[2]);
        // The DCT projects the signal to cosines (even functions) so our
        // signal must be centered on zero
        grey[offset] = static_cast<int>(lum * 255) - 128;

       // Write out the direct greyscale conversion for visual comparison
       out[offset] = static_cast<uint8_t>(lum * 255);
    }
}

// Write out the greyscale image
stbi_write_png("grey.png", width, height, 1, out, width);

DCT and Back Again

Next, we'll loop through each 8x8 block within our image, apply the DCT, reverse the DCT, and write the result to a new image.

// Some needed constants
constexpr static int block_size  = 8;
constexpr static float inv_sqrt2 = 0.70710678118f;
constexpr static float pi        = 3.14159265358979323846f;
// scale factor needed for the DCT to be orthonormal
// (1 / sqrt(2 * block_size))
constexpr static float scale = 0.25f;

int block_count = width / block_size;

for (int block_x = 0; block_x != block_count; ++block_x)
{
    for (int block_y = 0; block_y != block_count; ++block_y)
    {
        // Results of the DCT will be stored here
        float dct[block_size * block_size];

        // Apply the discrete cosine transform to each block
        for (int u = 0; u != block_size; ++u)
        {
            float a_u = u == 0 ? inv_sqrt2 : 1.f;

            for (int v = 0; v != block_size; ++v)
            {
                // Compute a single amplitude per iteration
                float& out = dct[v * block_size + u];

                float a_v = v == 0 ? inv_sqrt2 : 1.f;

                float a = scale * a_u * a_v;

                // Initialize amplitude
                out = 0.f;

                for (int x = 0; x != block_size; ++x)
                {
                    for (int y = 0; y != block_size; ++y)
                    {
                        int8_t value = grey[width * (block_y * block_size + y)
                                     + block_x * block_size + x];
                        float cosu   = std::cosf((2 * x + 1) * u * pi / 2 / block_size);
                        float cosv   = std::cosf((2 * y + 1) * v * pi / 2 / block_size);
                        out += cosu * cosv * value;
                    }
                }

                out *= a;
            }
        }

        // Now, the dct array holds the 8x8 frequency amplitudes

        // Next, invert the transform, writing the result out to the
        // original image buffer
        for (int x = 0; x != block_size; ++x)
        {
            for (int y = 0; y != block_size; ++y)
            {
                uint8_t& value = out[width * (block_y * block_size + y)
                               + block_x * block_size + x];

                // Accumulate the results in a floating point register
                float acc = 0.f;
                for (int u = 0; u != block_size; ++u)
                {
                    float a_u = u == 0 ? inv_sqrt2 : 1.f;
                    for (int v = 0; v != block_size; ++v)
                    {
                        float a_v       = v == 0 ? inv_sqrt2 : 1.f;
                        float a         = scale * a_u * a_v;
                        float amplitude = dct[v * block_size + u];
                        float cos_u     = std::cosf((2 * x + 1) * u * pi / 2 / block_size);
                        float cos_v     = std::cosf((2 * y + 1) * v * pi / 2 / block_size);
                        acc += a * amplitude * cos_u * cos_v;
                     }
                 }

                 // Don't forget to reverse the subtraction where we
                 // centered the grey value on zero.
                 value = static_cast<uint8_t>(acc + 128);
            } 
        }
    }
}

stbi_write_png("grey_dct.png", width, height, 1, out, width);

This code snippet might seem like a mouthful at first blush, but the general structure is that we have two sets of loops. The outermost duet of loops iterates through each 8x8 block. In inner loops iterate within the sixty-four entires of an individual 8x8 block.

With any luck, this code should produce a pair of images like the following (they are supposed to look identical):

The second image here has undergone a DCT and back, and the resemblance should be striking (any error you encounter is likely related to floating point error).

Running on a GPU with Intel's DPC++

Those loops should look extremely parallelizable. After all, each block can be evaluated without any knowledge of what's happening to any other block. Furthermore, within a block, each frequency amplitude can be evaluated independently of the others. After all frequency amplitudes are available, the inverse transform can once again proceed in parallel.

Our strategy then, is to dispatch each 8x8 block as an independent workgroup. Within the workgroup, all frequencies can be evaluated in parallel. Then, we want each thread in the workgroup to block until the forward DCT is complete. After this barrier, all threads will once again be able to resume parallel execution, this time, performing the inverse transform and writing out the data needed.

Initializing a SYCL Device and Queue

In order to submit hardware-accelerated commands, we need to construct a cl::sycl::queue. Queues are needed to marshall commands and memory to a device.

// Use this device selector to select the Gen9 Intel GPU
class gen9_gpu_selector : public cl::sycl::device_selector
{
public:
    gen9_gpu_selector()
      : cl::sycl::device_selector()
    {}

    int operator()(cl::sycl::device const& device) const override
    {
        if (device.get_info<cl::sycl::info::device::name>()
            == "Intel(R) Gen9 HD Graphics NEO")
        {
            return 100;
        }
        return 0;
    }
};

cl::sycl::queue queue{gen9_gpu_selector{}};

Here, we make a custom selector to prioritize the Gen9 integrated GPU on recent Intel CPUs. You can also choose built-in selectors such as the cl::sycl::gpu_selector, cl::sycl::fpga_selector, cl::sycl::host_selector, or just the cl::sycl::default_selector.

Greyscale Conversion

To perform the greyscale conversion, we first need to allocate memory writable by the device but readable by the host. To do this, we'll use DPC++'s unified memory abstractions. Note that this abstraction is not provided by the SYCL standard, but is an extension provided by Intel. We'll also need a buffer to house device memory for the intermediate 0-centered greyscale data.

uint8_t* out = reinterpret_cast<uint8_t*>(
    cl::sycl::malloc_shared(width * height, queue));

cl::sycl::buffer<int8_t, 1> greyscale_buffer{width * height};

The greyscale conversion routine itself is fairly similar to our C++ code. Instead of directly executing on the host, however, the code is submitted to the device queue created earlier.

queue.submit([&greyscale_buffer, &image_buffer, out, width, height](
                 cl::sycl::handler& h) {
    // Request write access to the greyscale buffer
    auto data = greyscale_buffer.get_access<cl::sycl::access::mode::discard_write>(h);
    // Request read access to our image data
    auto image = image_buffer.get_access<cl::sycl::access::mode::read>(h);

    // Parallel-for over each width*height pixel
    h.parallel_for(cl::sycl::range<1>(width * height),
        [image, data, out](cl::sycl::id<1> idx) {
            int offset = 3 * idx[0];
            float lum  = luminance(image[offset],
            image[offset + 1],
            image[offset + 2]);

            // The DCT projects the signal to cosines (even
            // functions) so our signal must be centered on
            // zero
            data[idx[0]] = static_cast<int>(lum * 255) - 128;

            // Write out the direct greyscale conversion for
            // visual comparison
            out[idx[0]] = static_cast<uint8_t>(lum * 255);
        });
});

The get_access<M> method of the SYCL buffer allows us to advertise that code submitted to the queue will access memory in a particular way. The SYCL runtime, in turn, will sequence queue submissions along with any need memory synchronization. Furthermore, note that while we wrote the luminance function with vanilla C++, the compiler is able to compile it to executable code suitable for the device we target.

Another distinction between the DPC++ code an our C++ code is that loops are expressed in terms of range abstractions. For example, cl::sycl::range<1>(20) represents a 1-dimensional range consisting of 20 items. In the kernel function itself, a cl::sycl::id<1> index object is passed to communicate to the thread which item in the range the thread is assigned to. We'll see another important type of range in the next section.

Forward and Inverse DCT Workgroup Submission

As we mentioned before, we want to submit 8x8 blocks in parallel, with another layer of parallelism within each block. The way in which this is expressed in SYCL is with a different type of range, namely, the cl::sycl::nd_range. We'll also need a way to request memory that is transient but shared across the workgroup. Let's show what the structure of our forward and inverse DCT submission will look like:

// Now, we do another parallel-for, but use hierarchical parallelism to
// dispatch 8x8 blocks in parallel
int block_count = width / block_size;

queue.submit([&greyscale_buffer, out, width, height, block_count](
             cl::sycl::handler& h) {
    auto data = greyscale_buffer.get_access<cl::sycl::access::mode::read>(h);

    // Local memory storage
    cl::sycl::range<1> local_memory_range(block_size * block_size);
    cl::sycl::accessor<float,
                       1,
                       cl::sycl::access::mode::read_write,
                       cl::sycl::access::target::local>
        dct(local_memory_range, h);

    cl::sycl::range<2> workgroup_count(block_count, block_count);
    // An individual workgroup is 8x8
    cl::sycl::range<2> workgroup_size(block_size, block_size);
    // The global range spans with width and height of our image
    auto global = workgroup_count * workgroup_size;

    h.parallel_for(
        cl::sycl::nd_range<2>(global, workgroup_size),
        [data, out, width, dct](cl::sycl::nd_item<2> item) {
            // Ranges from 0 to the block_count - 1
            int block_x = item.get_group(0);
            // Ranges from 0 to the block_count - 1
            int block_y = item.get_group(1);
            // Ranges from 0-7
            int u       = item.get_local_id(0);
            // Ranges from 0-7
            int v       = item.get_local_id(1);

            // Perform forward and inverse DCT here
            // ...
        });
});

Here, we use an nd_range<2> object to denote a two-tier hierarchical range. The top-level hierarchy spans across the entire image, but we supply a second range (called workgroup_size here) to specify the dimensions of the inner-hierarchy. In addition, we create a local memory accessor to create space for 8x8 floats needed to store the temporary DCT data. The cl::sycl::nd_item object received by the lambda passed to the parallel_for has a number of methods needed to retrieve the invocation's position within a workgroup, and workgroup within the global dispatch context. The ranges the group and local ids correspond to can be derived from the workgroup size and counts as illustrated in the comments above.

With this in place, the interior of the parallel-for is relatively straightforward to write:

// Within the parallel-for body:

// Scale factors
float a_u = u == 0 ? inv_sqrt2 : 1.f;
float a_v = v == 0 ? inv_sqrt2 : 1.f;
float a   = scale * a_u * a_v;

float amp = 0.f;
for (int x = 0; x != block_size; ++x)
{
    for (int y = 0; y != block_size; ++y)
    {
        int8_t value = data[width * (block_y * block_size + y)
                     + block_x * block_size + x];
        float cosu   = cl::sycl::cos((2 * x + 1) * u * pi / 2 / block_size);
        float cosv   = cl::sycl::cos((2 * y + 1) * v * pi / 2 / block_size);
        amp += cosu * cosv * value;
    }
}
amp *= a;

// Write amplitude to local storage buffer
dct[v * block_size + u] = amp;

// Invoke a local memory barrier so all writes to dct are
// visible across the workgroup
item.mem_fence<cl::sycl::access::mode::write>(
    cl::sycl::access::fence_space::local_space);
item.barrier(cl::sycl::access::fence_space::local_space);

// Next, invert the DCT
int x          = item.get_local_id(0);
int y          = item.get_local_id(1);
uint8_t& value = out[width * (block_y * block_size + y)
               + block_x * block_size + x];

// Accumulate the results in a floating point register
float acc = 0.f;
for (u = 0; u != 8; ++u)
{
    a_u = u == 0 ? inv_sqrt2 : 1.f;
    for (v = 0; v != 8; ++v)
    {
        a_v         = v == 0 ? inv_sqrt2 : 1.f;
        a           = scale * a_u * a_v;
        amp         = dct[v * block_size + u];
        float cos_u = cl::sycl::cos((2 * x + 1) * u * pi / 2 / block_size);
        float cos_v = cl::sycl::cos((2 * y + 1) * v * pi / 2 / block_size);
        acc += a * amp * cos_u * cos_v;
    }
}

value = static_cast<uint8_t>(acc + 128);

There are a few important callouts to make here. While the bulk of the code is similar to our C++ program, we also needed a mem_fence and barrier. The memory fence as written above ensures that all writes to local shared memory are visible to all invocations within the workgroup, and the barrier ensures that all threads wait until every other thread has encountered the memory fence.

Without this synchronization, invocations in the second half of the kernel may attempt to read memory before the write has occurred or been made visible. Another callout is that our calls to std::cos before are now replaced by calls to cl::sycl::cos. The kernels are compiled just-in-time during program execution, and math functions such as transcental functions here must be sourced from the SYCL runtime.

Deploying to DevCloud

Now that we have a working program, we can deploy the source code and any build scripts to the Intel oneAPI DevCloud. After signing up for the program, you should receive an email with instructions on how to obtain your SSH credentials to login to the DevCloud platform. After following the instructions, you should be able to upload your source file and test image with the following command:

scp main.cpp devcloud:~
scp stb_image.h devcloud:~
scp stb_image_write.h devcloud:~
scp peppers.png devcloud:~

This uploads your source file to the home directory of your assigned DevCloud user account. Then, you can login to deploy your program:

ssh devcloud

Create a script that compiles your program and runs it like so:

#!/usr/bin/env bash
# run.sh

# Ensure the correct environment variables are set
source /opt/intel/inteloneapi/setvars.sh

# Compile our program
dpcpp main.cpp -o dct -std=c++17 -fsycl -lOpenCL

# Invoke the program with our test image
./dct peppers.png

Then, we can use the qsub (queue submission) command to invoke our script on the various compute nodes available on the DevCloud. To see a list of available hosts, run the pbsnodes command. This will list the nodes by id with additional information such as the type of processor running, and any accelerators available to it.

To submit a job to a host with a GPU for example, we can run the following command:

qsub -l nodes=1:gpu:ppn=2 -d . run.sh

The options in short, indicate that we want to run our script on a single node with a GPU, we want to occupy the node fully (the ppn=2 option), we want the working directory to be the current directory, and we want the node to run the run.sh script.

To see the status of your job, you can run qstat which will produce output like the following:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
681510.v-qsvr-1            run.sh           u47956                 0 R batch

The Job ID can be used as an argument to the qdel command in order to cancel a pending job.

Once the job completes, you'll have files like run.sh.o681510 and run.sh.e068150 in your current directory, corresponding to the script stdout and stderr output respectively. If our program ran successfully, you should also have the grey.png and grey_dct.png images you can compare for correctness. Logout using the exit command, and transfer the image back to your host with scp:

scp devcloud:~/grey.png .
scp devcloud:~/grey_dct.png .

Conclusion

In this article, we ported a C++ application to the SYCL runtime with additional extensions provided by the Intel DPC++ compiler. The application demonstrates how a unified programming model can target different architectures, as well as the abstractions the SYCL runtime provides for coordinating access to memory and author parallel code with implicit dependency graphs. Furthermore, we showcase how to express code that maps in a hierarchical fashion specifically in the manner that corresponds to a GPU's architecture for improved performance.

Finally, we showed how to deploy and test our code against different hardware profiles provided by the Intel DevCloud. To learn more about the SYCL runtime, the Intel DevCloud, or the Intel DPC++ compiler, you are encouraged to read more at the Intel DevZone here.

History

9^th November, 2020: Initial version