Click here to Skip to main content
15,867,308 members
Articles / Programming Languages / SYCL
Article

Why ISO C++ Is Not Enough for Heterogeneous Computing

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
9 Dec 2022CPOL6 min read 3.4K   6   1
In this article we look at why C++ with SYCL is the best approach to accelerated computing, and why the C++ standard is a critically important foundation but insufficient for the needs of accelerated computing.

This article is a sponsored article. Articles such as these are intended to provide you with information on products and services that we consider useful and of value to developers

Why ISO C++ Is Not Enough for Heterogeneous Computing

Once you work in computational science long enough, you know that programming language and tools come and go. Few of us have the luxury of working in a single programming language anymore. When I joined Intel in 2000, my projects were a mix of Fortran, C, and to my dismay C++. I wasn’t a fan of C++ back then. It took a long time to compile, I didn’t see much point to objects or object-oriented programming, and the constructs that made the language expressive also made it slow. C++ compilers have improved significantly over the last 20 years, so this is no longer true.

Consequently, I wasn’t thrilled about having to relearn C++ when I started using oneAPI. The direct programming approach of oneAPI is Khronos SYCL, which is based on ISO C++17. However, I was surprised at how different C++17 looked compared to C++ circa 2000. It was still cryptic but very expressive, and dare I say: productive, especially when combined with the C++ Standard Template Library (STL) and oneAPI libraries, which brings me to the topic of this discussion: SYCL and its relation to ISO C++.

C++ is an excellent programming language (despite what the Rust advocates say - search C++ vs. Rust for a taste of that debate) that has withstood the test of time. There are a lot of C++ programmers and billions of lines of C++ code out there. However, the “free lunch” that Herb Sutter wrote about ended a long time ago, and he’s now talking about the “hardware jungle.”  We’re well into the era of heterogeneous parallelism, but C++ lacks key features necessary to exploit accelerator devices. The “free lunch” refers to the shift from frequency to core scaling. The “hardware jungle” refers to the proliferation of specialized processors.

First, and most important, C++ has no concept of an accelerator device, so obviously there are no language constructs to discover what devices are available, to offload work to those devices, or to handle exceptions that occur in code running on a device. Exception handling is particularly complicated if code is running asynchronously on the host and various devices. SYCL provides C++ platform, context, and device classes to discover what devices are available and to query the hardware and backend characteristics of those devices. The queue and event classes allow the programmer to submit work to devices and monitor the completion of asynchronous tasks. If an error occurs in code running on a device, asynchronous error handlers can be triggered by the queue::wait_and_throw(), queue::throw_asynchronous(), or event::wait_and_throw() methods, or automatically upon queue or context destruction.

Second, and only slightly less important, C++ has no concept of disjoint memories. Accelerator devices typically have their own memory that is separate from the main memory of the host system. Heterogeneous parallelism requires a way to transfer data from the host memory to the device memory. C++ provides no such mechanism, but SYCL does in the form of buffers/accessors and pointer-based allocation into Unified Shared Memory (USM). Host-device data transfer is done using C++ classes and/or familiar dynamic memory allocation syntax.   

C++ with SYCL

SYCL adds functionality to C++ to support accelerator devices. It just adds a new set of C++ template libraries (Figure 1). This is illustrated in the following side-by-side example, which implements the common SAXPY (Single-precision A times X Plus Y) calculation in C++ (left) and SYCL (right):

C++
#include <vector>

int main()
{
    size_t N{...};
    float A{2.0};


    std::vector<float> X(N);
    std::vector<float> Y(N);
    std::vector<float> Z(N);

    for (int i = 0; i < N; i++)
    {
        // Initialize the X, Y,
        // and Z work arrays.
    }

    for (int i = 0; i < N; i++)
    {
        Z[i] += A * X[i] + Y[i];
    }

}
C++
#include <CL/sycl.hpp>

int main()
{
   size_t N{...};
   float A{2.0};

   sycl::queue Q;
   auto X = sycl::malloc_shared<float>(N, Q);
   auto Y = sycl::malloc_shared<float>(N, Q);
   auto Z = sycl::malloc_shared<float>(N, Q);

   for (int i = 0; i < N; i++)
   {
       // Initialize the X, Y,
       // and Z work arrays.
   }

   Q.parallel_for(sycl::range<1>{N}, [=](sycl::id<1> i)
   {
       Z[i] += A * X[i] + Y[i];
   });
   Q.wait();
}

The codes are clearly similar. The C++ code is just modified to offload the SAXPY calculation to an accelerator:

  • Line 1: Include the SYCL header.
  • Line 8: Create a SYCL queue for the default accelerator. If no accelerator is available, any work submitted to the queue will simply run on the host.
  • Lines 9-11: Allocate the work arrays in USM. The SYCL runtime will handle data transfer between the host and the accelerator specified in the queue object.
  • Line 19: Submit the SAXPY calculation as a C++ lambda function to be executed in parallel on the accelerator specified in the queue argument.
  • Line 23: Unlike some proprietary approaches to heterogeneous parallel programming, SYCL is an asynchronous, event-based language. Submitting work to a SYCL queue does not block the host (or other accelerators) from continuing to do useful work. The wait() method prevents the host from exiting before the accelerator completes the SAXPY calculation.

Productive Approach

I suggested earlier that oneAPI is a productive approach to acceleration. I offer the next example as evidence. The STL greatly improves the productivity of C++ programmers. The oneAPI Data Parallel C++ Library (oneDPL) is the oneAPI accelerated implementation of the C++ STL. The following side-by-side example performs a maxloc reduction using the C++ STL (left) and oneDPL (right):

C++
#include <iostream>
#include <vector>
#include <algorithm>


int main()
{
  std::vector<int> data{2, 2, 4, 1, 1};
  auto maxloc = max_element(

                        data.cbegin(),
                        data.cend());

  std::cout << "Maximum at element "
    << distance(data.cbegin(),
                maxloc)
    << std::endl;
}
C++
#include <iostream>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>

int main()
{
  std::vector<int> data{2, 2, 4, 1, 1};
  auto maxloc = oneapi::dpl::max_element(
          oneapi::dpl::execution::dpcpp_default,
          data.cbegin(),
          data.cend());

  std::cout << "Maximum at element "
       << oneapi::dpl::distance(data.cbegin(),
                                maxloc)
       << std::endl;
}

Once again, the codes are similar:

  • Lines 2-4: Include the necessary oneDPL headers.
  • Line 9 and 15: Call the oneDPL implementations of the STL max_element() and distance() functions.
  • Line 10: Provide an execution policy, in this case the policy for the default accelerator. If no accelerator is available, any work submitted to the queue will simply run on the host.

There’s no explicit host-device data transfer in the oneDPL code. It’s handled implicitly by the runtime, which reinforces my contention that oneAPI offers a productive, portable approach to heterogeneous acceleration. It is worth noting that the SYCL programs will work on any system with a SYCL compiler and runtime    (Figure 2).

Image 1

Figure 1: SYCL is just C++ with additional template libraries to support heterogeneous accelerators. (Source: Khronos.org/sycl).

 Image 2

Figure 2: The SYCL compiler ecosystem is growing. (Source: Khronos.org/sycl).

C++ is a language with entrenched programmer expectations and stability requirements, so changes to the standard are made carefully, albeit slowly. Heterogeneous parallelism is just one of many new features that future versions of C++ will have to support. The P2300R5 std::execution proposal could help address current limitations of C++ with respect to heterogeneity by defining a “...framework for managing asynchronous execution on generic execution contexts” and a “...model for asynchrony, based around ... schedulers, senders, and receivers...” This proposal has been sent to the Library Working Group for C++26. In the meantime, notice the comment in the upper right corner of Figure 2. SYCL gives us a flexible and non-proprietary way to express heterogeneous parallelism while the ISO C++ standard evolves.

Learn More

In order to learn more about C++ with SYCL programming, the following sources are good places to start:

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
You may know us for our processors. But we do so much more. Intel invents at the boundaries of technology to make amazing experiences possible for business and society, and for every person on Earth.

Harnessing the capability of the cloud, the ubiquity of the Internet of Things, the latest advances in memory and programmable solutions, and the promise of always-on 5G connectivity, Intel is disrupting industries and solving global challenges. Leading on policy, diversity, inclusion, education and sustainability, we create value for our stockholders, customers and society.
This is a Organisation

42 members

Comments and Discussions

 
GeneralMy vote of 5 Pin
tugrulGtx11-Dec-22 3:47
tugrulGtx11-Dec-22 3:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.