HyCuda, a Hybrid Framework Code-Generator for CUDA

Joren Heit

4.89/5 (4 votes)

Dec 18, 2013

CPOL

6 min read

15837

A Hybrid Framework Code-Generator for CUDA

Download source code from SourceForge

Introduction

This article will describe my latest Cuda-related project called HyCuda. The name is a pun on the PyCUDA, but has nothing to do with Python. Instead, it generates C++ code that allows you to easily build hybrid algorithms.

When I was working on a CUDA implementation, it occurred to me that I couldn't be the only one who wants to experiment with different setups of devices. I wanted to see how the performance of my program was affected when I ran one or more of the subroutines on a different device. This of course meant implementing the same algorithm twice (for both CPU and GPU), but the code needs several modifications in order to make sure that all the data is in the right place. This is when I started developing a template framework to sort out these tasks. The framework however was specially suited to my needs, and those of the specific algorithm I was working on. I then started thinking on how to approach this in a more general way, and came up with HyCuda.

HyCuda works by reading a specification-file, provided by the programmer, that describes the algorithm. It contains information about the subroutines, what data they use and what kind of data this is. It then generates a C++ template-framework in which you can instantiate a class-template, of which the template argument holds information about the devices you want to use. This allows you to instantiate multiple objects performing the same algorithm, but on different machines. The template (metaprogramming) mechanism sorts out, at compile-time obviously, which data has to be moved around and when.

This article will not go into much detail, because I already wrote a manual (PDF), available on sourceforge that you can download. The same manual is also available as HTML on http://hycuda.sourceforge.net, but the formatting is not great. I just used latex2html to generate it and didn't go through the trouble of making it look nice.

Background

If you're not familiar with C++ and/or CUDA, this article probably isn't for you. If you want to get acquainted with GPU programming (particularly using CUDA), then I would suggest reading up on this first.

Using the code

Specification-File

As mentioned in the introduction, HyCuda generates C++ code based on a specification-file that describes the algorithm. The file contains 4 sections, separated by two consecutive percent signs (%%):

Directives (class-name, namespace etc)
Data (types, size, input/output to the algorithm)
Routines (what's their name, on which data do they depend and how?)
Order (in what order should the routines be executed?)

Below is an example, taken from the Example-section of the manual.

/* filename: spec.hycuda */

// Directives-Section
%namespace: Example
%class-name: ExampleAlgorithm
%parameters: Params {
    float m1;
    float m2;
}

%% // Memory-Section
vec1 (i) : float, vectorSize
vec2 (i) : float, vectorSize
vec3     : float, vectorSize
sum  (o) : float, 1

%% // Routine-Section
multiplyM1  : vec1 (rw), vec2 (rw)
multiplyM2  : vec3 (rw)
addVec1Vec2 : vec1 (r), vec2 (r), vec3 (w)
sumVec3     : vec3 (r), sum (w)

%% // Routine-order
%order : $1, $3, $2, $4

This, rather silly, example describes an algorithm that performs some vector operations. It will generate a class called ExampleAlgorithm, declared in the namespace Example, whose behavior is defined by two float-parameters called m1 and m2. Along the way, 3 different vectors will be used, being vec1, vec2 and vec3, all containing a total number of vectorSize floats. The first two vectors will serve as input, specified by (i), whereas the third one will be an intermediate product and thus has no input or output specifier associated with it. The output will be called sum, and contains just one element (again a float).

There will be 4 subroutines that make up the entire algorithm. (Of course, this a really stupid way of performing these operations. It is purely for educational purposes that the algorithm is broken up this way.)

Multiply both input-vectors by the parameter m1.
Add the resulting vectors, and call the result vec3.
Multiply vec3 by the other parameter m2.
Add every element in vec3 and store the result in sum.

The routine-section lists the names of every subroutine and tells HyCuda what data it will use. It also specifies how the data is being used: read (r), write (w) or both (rw)/(wr). This information is used to determine whether or not data has to me moved to the current device.

The Device-Policies

I will skip the part where the functions/kernels are implemented. This is described in much detail in the manual. Instead, I will move to the part where you have everything setup, and you want to run the algorithm. In order to do so, you need to specify which subroutine will run on which device. This is done by passing a template-parameter to the resulting class, called DevicePolicies. For your convenience, HyCuda generates a header-file with a default policy already set up. This file looks like the snippet below:

/* filename: examplealgorithm.h */

#include "examplealgorithm_algorithm.h"
#include "examplealgorithm_hybrid.h"
#include "examplealgorithm_devicepolicies.h"

namespace Example {

// Specify which device to use for each of the routines (CPU/GPU)

typedef DevicePolicies <
	MultiplyM1Device  < CPU >,
	MultiplyM2Device  < CPU >,
	AddVec1Vec2Device < CPU >,
	SumVec3Device     < CPU >

> CustomPolicy;

typedef Hybrid_< CustomPolicy > Hybrid;
typedef ExampleAlgorithm_< Hybrid > ExampleAlgorithm;

} //Example

There's actually quite a bit going on in this little snippet, but the main thing is the typedef of the class-template DevicePolicies. It takes a number of template parameters, which are themselves class-templates (e.g. MultiplyM1Device). These parameters are all named after the routines in the spec-file, and their own argument tell the algorithm which device to use to execute each particular routine. When implemented correctly, each permutation of CPU/GPU-parameters used in the policy-list will cause the devices to be used in a different manner, but producing the same result.

Here's the main-function that uses the algorithm to process some input:

/* examplealgorithm_main.cc */

#include "examplealgorithm.h"
using namespace Example
using namespace std;

size_t readVectorsFromFile(char const *filename, 
                           float **v1, float **v2);

int main(int argc, char **argv)
{
    // Initialize parameters 
    Params params;
    params.m1 = 2;
    params.m2 = 3;
    
    // Initialize vectors
    float *v1, *v2;
    size_t vectorSize = readVectorsFromFile(argv[1], &v1, &v2);
    
    // Initialize input
    Input in;
    in.vec1 = {v1, vectorSize};
    in.vec2 = {v2, vectorSize};
    
    // Initialize output
    float sum;
    Output out;
    out.sum = {&sum, 1};
    
    // Process
    ExampleAlgorithm alg(params);
    alg.process(in, out);
    
    // Output
    cout << "Sum: " << sum << '\n';
}

The input-vectors are not owned by the user, who should make sure they are properly allocated and initialized. The same holds for the output-data (sum in this case). The algorithm is then initialized using the parameters, and its member process is called, taking the input and output as its arguments. When it returns, sum holds the appropriate value.

Final Remarks

I realize that the information in this article is absolutely insufficient if you actually want to do anything with the program. However, I hope it will provide you with a taste of what HyCuda is and how it operates. For more information, I refer you to the project page again: http://www.sourceforge.net/projects/hycuda.

Furthermore, I would love to get some feedback. Up to now, I haven't written a proper build script to make your life easy. I figured I would only need to do this when there actually are people intending to use it. So, if you think this program has any potential (or none for that matter), please let me know and I will put some extra effort in it.

Building HyCuda

For now, let me post some instructions on how to get going with HyCuda. It comes with a makefile which you should be able to use in any Unix-like environment. My sincere apologies to Windows-users for the inconvenience. However, the generator needs skeleton files, the paths to which are contained in the parser/skeletons.h header file. Therefore, before you call make, you should find an appropriate (absolute) directory to contain these files (like /etc/skel/hycuda). Edit skeletons.h to point to this directory and build the program using make. Now, just copy the contents of the skeletons directory to the one you made up and add a symlink to hycuda on your path and you should be ready to go!

History

Dec. 18, 2013: First draft.