SYCL and OpenCL

Sergiu Oprea

4.11/5 (2 votes)

Feb 27, 2023

CPOL

8 min read

17068

Differences and similarities between SYCL and OpenCL

Do you want to write efficient, scalable software that can take advantage of specialized hardware? If so, you should consider Open Computing Language (OpenCL) and SYCL for cross-architecture programming and heterogeneous computing. for cross-architecture programming and heterogeneous computing.

Cross-architecture programming lets your code run on different types of hardware, like central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGAs). You can increase application performance by exploiting the unique benefits of these hardware architectures.

Heterogeneous computing is closely related to cross-architecture programming, which refers to using a variety of specialized hardware in a single application. Both allow users to develop efficient and scalable workloads and can be accessed by developers with the OpenCL programming framework. SYCL (pronounced "sickle") serves as a C++ abstraction layer making offload process queues and cross-architectural parallelism more accessible to C++ application developers. SYCL leverages OpenCL to that end, but can also interface with other such solutions like oneAPI Level Zero.

With the OpenCL framework, you can write programs that run on various hardware available in a heterogeneous system. For instance, a machine learning algorithm can use the CPU for data processing and the GPU for complex matrix computation. Offloading some of your application workloads to multiple hardware types makes the computation more efficient.

To achieve this, OpenCL provides a runtime environment that coordinates the execution and transfer of data between multiple CPUs and GPUs. However, OpenCL serves as a low-level, non-composable, C-based interface. As such, it has had significant success and popularity but failed to gain universal vendor support or higher-level C++ interfaces.

SYCL is a high-level C++ abstraction layer for developers. Critically, it is composable, so you can use it freely in applications and their associated libraries without restrictions. So far, SYCL is supported with performance on CPUs, GPUs, and FPGAs. Thanks to the power of LLVM open-source development, even NVIDIA GPUs are well-supported — a weakness that OpenCL has not overcome. With SYCL, writing cross-platform programs is straightforward, intuitive, and well-supported.

In this article, we’ll compare SYCL and OpenCL, present some specific historical use cases, and demonstrate the differences between the two with a simple code example that performs a vector addition operation.

Comparing SYCL and OpenCL

OpenCL was developed to allow applications to exploit the parallel computing capabilities of GPUs. It has become the open standard for parallel programming of heterogeneous systems and provides a low-level programming model in C/C++. With OpenCL, you can write efficient applications to run on CPUs, GPUs, and other specialized processors such as FPGAs.

Apple developed the OpenCL specification. OpenCL 1.0 was released alongside Mac OS X Snow Leopard. In 2009, OpenCL became an open standard maintained by the non-profit technology consortium Khronos Group.

Many companies, including Intel, AMD, and NVIDIA, have adopted OpenCL, and they all released their first versions of OpenCL SDKs in 2011. While Apple no longer promotes OpenCL, favoring their Metal interfaces for GPU programming, NVIDIA has committed to supporting the latest OpenCL 3.0 specification.

Although NVIDIA's CUDA (Compute Unified Device Architecture) programming model delivers top performance using their GPUs, it is commendable that they also provide high-quality support for OpenCL. It's great to see NVIDIA's dedication to CUDA and OpenCL, ensuring that developers can access multiple GPU computing programming models. For example, the popular deep learning framework PyTorch has both CUDA and OpenCL backends.

By supporting OpenCL, NVIDIA makes it easier for developers to port their code to NVIDIA GPUs and allows them to use other hardware platforms if needed.

Over the years, OpenCL has enabled greater computational capabilities in numerous fields:

In the medical field, it speeds up the processing of medical imaging.
In the automotive industry, it improves the performance of autonomous driving systems, making computer vision algorithms and sensor fusion more efficient.
In the gaming industry, it’s used to accelerate rendering pipelines.
In the finance industry, some banks use it to accelerate Monte Carlo simulations for stock price predictions.

While OpenCL remains important for low-level hardware access, it has limitations for widespread application use. SYCL addresses these limitations.

SYCL emerged as a higher-level abstraction layer of OpenCL in 2015. SYCL provides a simpler and intuitive programming model based on modern C++, making it more attractive to developers familiar with C++ and the standard template library (STL). Adopting C++ in SYCL also helps integrate new SYCL code into existing C++ projects. Using templates and modern C++, including across the host-device boundary, is a big plus. Support for higher level abstractions (for example, reductions) are also present in SYCL.

SYCL offers broader device support since SYCL can be built upon many different backends (not just OpenCL) to reach diverse hardware. For example, Intel's SYCL implementation for CPU uses an OpenCL backend, while the Intel® GPU compute runtime implementation leverages Level Zero to achieve more fine-grained universal shared memory control.

A particularly appealing advantage of SYCL over OpenCL is its single-source programming model, which unifies host and device code in a single source file, making it easier to write, understand, and maintain code. SYCL also provides abstractions, such as accessors and buffers, that simplify the programming of heterogeneous architectures and reduce repetitive code.

Syntax Differences: SYCL and OpenCL

To understand the syntax differences between SYCL and OpenCL, let's look at the implementation of a code that performs a simple vector addition in both programming models. This kernel takes vectors a and b to sum up each of their elements in vector c. This is done in parallel using specialized hardware, such as a GPU, so we’ll add elements a[0] and b[0] and a[1] and b[1] at the same time.

Let’s take a look at the implementation in OpenCL:

#include <iostream>
#include <vector>
#include <CL/cl.hpp>
 
int main() {
    const size_t n = 100;
    std::vector<float> a(n, 1.0f);
    std::vector<float> b(n, 2.0f);
    std::vector<float> c(n, 0.0f);
 
    // Get available platforms
    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
 
    // Get available devices
    std::vector<cl::Device> devices;
    platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);
 
    // Create a context and command queue
    cl::Context context(devices);
    cl::CommandQueue queue(context, devices[0]);
 
    // Create buffers
    cl::Buffer a_buf(context, CL_MEM_READ_ONLY, n * sizeof(float));
    cl::Buffer b_buf(context, CL_MEM_READ_ONLY, n * sizeof(float));
    cl::Buffer c_buf(context, CL_MEM_WRITE_ONLY, n * sizeof(float));
 
    // Copy data to buffers
    queue.enqueueWriteBuffer(a_buf, CL_TRUE, 0, n * sizeof(float), a.data());
    queue.enqueueWriteBuffer(b_buf, CL_TRUE, 0, n * sizeof(float), b.data());
 
    // Create and build program
    std::string kernel_code = R"(
        __kernel void add_vectors(__global const float* a,
                                   __global const float* b,
                                   __global float* c) {
            const int i = get_global_id(0);
            c[i] = a[i] + b[i];
        }
    )";
 
    cl::Program program(context, kernel_code);
    program.build(devices);
 
    // Create kernel
    cl::Kernel kernel(program, "add_vectors");
 
    // Set kernel arguments
    kernel.setArg(0, a_buf);
    kernel.setArg(1, b_buf);
    kernel.setArg(2, c_buf);
 
    // Enqueue kernel
    queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NullRange);
 
    // Copy results back to host
    queue.enqueueReadBuffer(c_buf, CL_TRUE, 0, n * sizeof(float), c.data());
 
    // Print results
    for (int i = 0; i < n; i++) {
        std::cout << c[i] << " ";
    }
    std::cout << std::endl;
 
    return 0;
}

Here’s the SYCL code for the same operation:

#include <iostream>
#include <sycl/sycl.hpp>
 
void add_vectors(sycl::queue& queue, sycl::buffer<float>& a, sycl::buffer<float>& b, sycl::buffer<float>& c) {
   sycl::range n(a.size());
 
   queue.submit([&](sycl::handler& cgh) {
      auto in_a = a.get_access<sycl::access::mode::read>(cgh);
      auto in_b = b.get_access<sycl::access::mode::read>(cgh);
      auto out_c = c.get_access<sycl::access::mode::write>(cgh);
 
      cgh.parallel_for<class add_vectors>(n, [=](sycl::id<1> i) {
               out_c[i] = in_a[i] + in_b[i];
      });
   });
}
 
int main(int, char**) {
   const size_t n = 100;
 
   std::vector<float> a(n, 1.0f);
   std::vector<float> b(n, 2.0f);
   std::vector<float> c(n, 0.0f);
 
   sycl::buffer<float> a_buf{a};
   sycl::buffer<float> b_buf{b};
   sycl::buffer<float> c_buf{c};
 
   sycl::queue q;
 
   add_vectors(q, a_buf, b_buf, c_buf);
 
   auto result = c_buf.get_access<sycl::access::mode::read>();
   for (size_t i = 0; i < n; ++i) {
      std::cout << result[i] << " ";
   }
 
   return 0;
}

Let's compare these two implementations to identify the advantages SYCL has over OpenCL.

In SYCL, kernel functions are defined using regular function declarations, while in OpenCL, a kernel is defined using the __kernel keyword. The __kernel qualifier declares a function that can be executed on an OpenCL device (or multiple devices). The host usually calls this function, which can only be executed on the device.

In SYCL, there’s no need for additional keywords to define a kernel, and the code is cleaner. The kernel can be executed on the host and device and is submitted to a queue as a command group function object. SYCL’s queue monitors the completion of the task targeting a specific device. For arguments, SYCL expects sycl:buffer objects. These are SYCL’s representations of memory buffers. In OpenCL, we use pointers to objects in global memory using the __global keyword and float* pointer. OpenCL uses explicit memory management, where the programmer is responsible for explicitly allocating and deallocating memory on the device. SYCL provides a memory model where the runtime system handles memory allocation and deallocation.

In SYCL, the buffer objects are stored in global memory, and we use the accessors (sycl::access class) to read and write data from global memory within a kernel. These abstractions provide a higher-level interface for accessing global memory without requiring additional qualifiers, as in OpenCL.

To execute the kernel, we use the parallel_for class that sets the range of elements to process and specifies the kernel function to execute in parallel. SYCL’s queue manages the communication between the host and the device. In that kernel, we create a parallel_for object that executes the add_vectors kernel on a range of N elements (sycl::range n(N)). In OpenCL, we specify the parallel execution of a kernel using the clEnqueueNDRangeKernel function that takes the number of global and local work items as arguments. OpenCL provides a lower-level interface, while in SYCL, the parallel_for class is more abstract and easier to use.

Compared to OpenCL, SYCL has had less time to mature and gain widespread adoption. However, it has consistently gained traction in the industry and has been adopted by many companies. In 2018, for example, Intel released its first version of oneAPI, which includes SYCL as a key component. SYCL has many different implementations, such as the Intel® oneAPI DPC++ Library (oneDPL), ComputeCpp, HipSYCL, NeoSYCL, and triSYCL.

Each implementation has unique features. And with Codeplay’s added support for oneAPI for Nvidia® and AMD GPU hardware, DPC++ now supports native programming on NVIDIA and AMD GPUs, enabling SYCL to deliver performance results comparable to that of CUDA or HIP while offering a level of portability that CUDA, HIP, and OpenCL cannot.

Conclusion

OpenCL and SYCL have similar goals in supporting diverse collections of accelerators. While SYCL is a modern solution for application level cross-architecture programming, it is not meant to convince OpenCL users to switch. OpenCL will remain important for some time and will continue to be one of the targets SYCL compilers use to reach hardware.

SYCL has many implementations that provide additional features and enable heterogeneous programming on hardware. For instance, SYCL LLVM-based compilers provide advanced tooling, can deliver great native performance on GPUs from several vendors by accommodating plugins to support vendors like NVIDIA and AMD, and continuously incorporates new features from the latest SYCL specification. With Open SYCL (formerly known as hipSYCL), there is even a SYCL implementation incorporating new features from the latest SYCL specification, built on top of AMD HIP.