Introduction

VectorizedKernel.h header-only tool aims to help quick prototyping for basic GPGPU problems, by using only CPU that does not require anything else than a C++ compiler and proper compiler flags set. The learning-curve of GPGPU on many platforms are generally affected by quality & availability of the API (such as CUDA having much shorter hello-world than OpenCL and OpenCL having much more diverse device selection) and this project aims to retain the availability by having dependency to:

C++14 compiler (GCC v10 was used for all benchmarks for this project)
Compiler's autovectorization capability
At least an MMX CPU is required but an AVX512, AVX1024(future?), AVX2048(future?) will work too

With the addition of relative simplicity to start compared to a driver-API of CUDA or pure OpenCL code and higher performance than plain single-thread non-explicitly-vectorized code at the expense of verbosity.

The entrance to GPGPU may not be always available rightaway because of not-supported devices (such as GT1030 for CUDA or RX550 for OpenCL) and problematic drivers that add starting overhead. It can be easily solved by hiring cloud computers and managing operating system images with a cost of few dollars per hour. It is generally more acceptable to test a demo for free and quick.

Background

When Nvidia started developing its own general purpose computing framework, many developers were using OpenGL/DirectX/Glide APIs to harness the graphics pipelines for the "general purpose" work. Lighting, transform and other pipelines of graphics chip was used for matrix operations and graphics buffers were used as input/output buffers. Usage of GPUs like these had an attractive speedup ratio compared to the CPUs their time and there were numerous work involving pushing more general purpose work into a GPU.

After CUDA (of Nvidia) had great success, OpenCL (of Apple) followed it by a modest technological lag and was adopted by hardware vendors from Intel to Xilinx to Amd and many others including Nvidia.

While GPGPU frameworks and higher-level libraries evolved to be more and more useful for different types of tasks (including training a neural-network), they also increased their starting overhead for learning, installing and testing. The hello-world of the GPGPU stayed simplest with CUDA but required Nvidia card. The hello-world of OpenCL on the other hand, is much longer on number of lines of codes. There are multiple parts of a hello-world OpenCL program to initialize just to send a letter to GPU:

query platforms
query devices
query supported API-features or versions
query memory size or other features of graphics card
create contex
create kernel
create gpu buffer
create kernel parameters
compute hello world
get results back to RAM (if its not CPU already)
release all the resources in the opposite order they were allocated

Both CUDA and OpenCL have highly dependent either on the code-reuse or the hardware to begin with. Despite the high preparation time, GPU run-time is very good and it is tolerated most of the time. The point of making a simulator is to let prototyping require a shorter code or no hardware dependency (except SSE/AVX that is found on nearly all of the desktop-CPU market) and especially quicker time to start having results.

Using the Code

VectorizedKernel.h header file defines a simple kernel builder function to create a SIMD kernel from a lambda function and it applies the lambda on all workitems of the kernel. Workitem is only a scalar-looking "lane" of a vector-processing task. If the task is to add 1 to all elements of an array, then a workitem is the computation of a[i] = a[i] + 1.

Kernel

createKernel factory function creates the kernel object and takes simd width and kernel lambda:

template<int SimdWidth, typename F, class...Args>
auto createKernel(F&& kernelPrm, KernelArgs<Args...> const& _prm_)
{
        return Kernel<SimdWidth, F, Args...>(std::forward<F>(kernelPrm));
}

The Args... part lets users define their own kernel parameters while SimdWidth is given explicitly as a template integer parameter that is integer power of 2. F parameter takes the lambda function and converts it to two versions, one for body of computations that uses simd = user given value and one for the tail that uses simd = 1. This makes the work scheduling easy and optimized enough without breaking the rule of integer-power-of-2 for any future algorithm that uses current width value of simd inside the kernel function.

How a kernel is created:

auto kernelHelloWorld = Vectorization::createKernel<simd>(
    [&](auto & factory, auto & idThread, int * prm1, float * prm2, int prm3){
        const int currentSimdWidth = factory.width;
		/* scalar code here */
        
	}, 
    Vectorization::KernelArgs<int*,float*, int>{}
);

Hello world version:

#include "VectorizedKernel.h"
#include <iostream>
int main()
{
	constexpr int simd = 8; // >= SIMD width of CPU (bigger  = more pipelining)

	auto kernelHelloWorld = Vectorization::createKernel<simd>(
        [&](auto & factory, auto & idThread, float * bufferIn, float * bufferOut){
            // this value changes only when total work is not multiple of simd
            // i.e. for the last 3 elements of 23-element array, it is 1 (not 8)
		    const int currentSimdWidth = factory.width;

            // allocation should be done before reaching hot-spot of a loop (if any)
		    auto tmp = factory.template generate<float>();

            // this reads a single chunk that is made of neighboring elements
            // starting address = bufferIn + idThread[0]
		    tmp.readFromContiguous(bufferIn,idThread); 

            // adds 1 to all 8 elements at once
            // writes result back to same variable (last parameter of method)
		    tmp.add(1.0f,tmp);
 
            // same as read version, with opposite direction
		    tmp.writeToContiguous(bufferOut,idThread); 

	}, /* defining kernel parameter types */ Vectorization::KernelArgs<float*,float*>{});

	// size does not have to be multiple of simd
	const int n = 23;

	// better performance with aligned buffers if workload is memory-bandwidth-bound
	float vecIn[n], vecOut[n];
	for(int i=0;i<n;i++)
	{
		std::cin>>vecIn[i];
	}

    // cuda-like kernel launching, with n so-called threads
    // actually it is single-thread but on SIMD units
	kernelHelloWorld.run(n,vecIn,vecOut);

	for(int i=0;i<n;i++)
	{
		std::cout<<vecOut[i]<<" ";
	}
	std::cout<<std::endl;
	return 0;
}

When it is compiled by a modern compiler with -march=native and -mavx flags, the produced CPU instructions look like this:

.L2:
mov rsi, rbx
mov edi, OFFSET FLAT:_ZSt3cin
add rbx, 4
cmp rbx, rbp
jne .L2
vmovaps xmm0, XMMWORD PTR .LC0[rip]
lea rbx, [rsp+96]
lea rbp, [rsp+188]
vaddps xmm1, xmm0, XMMWORD PTR [rsp]   <-- 4x add
vmovaps XMMWORD PTR [rsp+96], xmm1
vaddps xmm1, xmm0, XMMWORD PTR [rsp+16] <-- 4x add
vmovaps XMMWORD PTR [rsp+112], xmm1
vaddps xmm1, xmm0, XMMWORD PTR [rsp+32] <-- 4x add
vmovaps XMMWORD PTR [rsp+128], xmm1
vaddps xmm1, xmm0, XMMWORD PTR [rsp+48] <-- 4x add
vaddps xmm0, xmm0, XMMWORD PTR [rsp+64] <-- 4x add
vmovaps XMMWORD PTR [rsp+144], xmm1
vmovq xmm1, QWORD PTR .LC1[rip]
vmovaps XMMWORD PTR [rsp+160], xmm0
vmovq xmm0, QWORD PTR [rsp+80]
vaddps xmm0, xmm0, xmm1
vmovlps QWORD PTR [rsp+176], xmm0
vmovss xmm0, DWORD PTR .LC2[rip]
vaddss xmm0, xmm0, DWORD PTR [rsp+88]
vmovss DWORD PTR [rsp+184], xmm0

The vaddps instruction is computation of 4 values at once (addition for 32-bit floating-point). First five of these instructions compute the main body (20) of 23 workitems. The last vaddps and the vaddss computes the tail part (last 3 elements).

When an extra multiplication operation is added to kernel:

// This value changes only when total work is not multiple of simd
// i.e., for the last 3 elements of 23-element array, it is 1 (not 8)  
const int currentSimdWidth = factory.width;

// allocation should be done before reaching hot-spot of a loop (if any)  
auto tmp = factory.template generate<float>();

// This reads a single chunk that is made of neighboring elements
// starting address = bufferIn + idThread[0] 
tmp.readFromContiguous(bufferIn,idThread);

// adds 1 to all 8 elements at once
// writes result back to same variable (last parameter of method) 
tmp.add(1.0f,tmp);

tmp.mul(3.0f, tmp); <------ an extra multiplication!

// same as read version, with opposite direction 
tmp.writeToContiguous(bufferOut,idThread);

The instruction output looks like this:

vaddps  xmm2, xmm0, XMMWORD PTR [rsp]
vmulps  xmm2, xmm2, xmm1
vmovaps XMMWORD PTR [rsp+96], xmm2
vaddps  xmm2, xmm0, XMMWORD PTR [rsp+16]
vmulps  xmm2, xmm2, xmm1
vmovaps XMMWORD PTR [rsp+112], xmm2
vaddps  xmm2, xmm0, XMMWORD PTR [rsp+32]  <-- 4x add
vmulps  xmm2, xmm2, xmm1                  <-- 4x mul
vmovaps XMMWORD PTR [rsp+128], xmm2
vaddps  xmm2, xmm0, XMMWORD PTR [rsp+48] <-- 4x add
vaddps  xmm0, xmm0, XMMWORD PTR [rsp+64] <-- 4x add
vmulps  xmm2, xmm2, xmm1                 <-- 4x mul
vmulps  xmm0, xmm0, xmm1                 <-- 4x mul
vmovq   xmm1, QWORD PTR .LC2[rip]
vmovaps XMMWORD PTR [rsp+144], xmm2
vmovaps XMMWORD PTR [rsp+160], xmm0
vmovq   xmm0, QWORD PTR [rsp+80]
vaddps  xmm0, xmm0, xmm1
vmovq   xmm1, QWORD PTR .LC3[rip]
vmulps  xmm0, xmm0, xmm1
vmovlps QWORD PTR [rsp+176], xmm0
vmovss  xmm0, DWORD PTR .LC4[rip]
vaddss  xmm0, xmm0, DWORD PTR [rsp+88]
vmulss  xmm0, xmm0, DWORD PTR .LC5[rip]
vmovss  DWORD PTR [rsp+184], xmm0

The compiler adds the necessary vmulps instruction after each vaddps, without extra buffer copying (directly through re-using the same cpu vector register). This part is equivalent to explicitly using C++ x86 intrinsics, with the difference of automatically doing it.

On a Cascadelake processor microarchitecture (currently the cloud servers of godbolt.org), with simd=16 setting and with proper compiler flags (-std=c++2a (c++14 required at least!!) -O3 -march=cascadelake -mavx512f -mprefer-vector-width=512 -ftree-vectorize -fno-math-errno -lpthread), the output of computation of main body (16 workitems with AVX512 + 4 with SSE) is shorter:

vbroadcastss    zmm0, DWORD PTR .LC6[rip]
vmovq   xmm1, QWORD PTR .LC4[rip]
vaddps  zmm0, zmm0, ZMMWORD PTR [rbp-304] <--- 512 bit add operation (16 elements)
lea     rbx, [rbp-176]
lea     r12, [rbp-84]
vmulps  zmm0, zmm0, DWORD PTR .LC7[rip]{1to16} <--- 512 bit mul (16 elements)
vmovaps ZMMWORD PTR [rbp-176], zmm0
vbroadcastss    xmm0, DWORD PTR .LC6[rip]
vaddps  xmm0, xmm0, XMMWORD PTR [rbp-240]      <--- 128 bit add (4 elements)
vmulps  xmm0, xmm0, DWORD PTR .LC7[rip]{1to4}  <--- 128 bit mul (4 elements)
vmovaps XMMWORD PTR [rbp-112], xmm0
vmovq   xmm0, QWORD PTR [rbp-224]
vaddps  xmm0, xmm0, xmm1                      <--- tail i=20, i=21
vmovq   xmm1, QWORD PTR .LC5[rip]
vmulps  xmm0, xmm0, xmm1
vmovlps QWORD PTR [rbp-96], xmm0
vmovss  xmm0, DWORD PTR .LC6[rip]
vaddss  xmm0, xmm0, DWORD PTR [rbp-216]       <--- tail i=22
vmulss  xmm0, xmm0, DWORD PTR .LC7[rip]
vmovss  DWORD PTR [rbp-88], xmm0
vzeroupper

With a higher number of work items (i.e., 1000x1000 pixels for mandelbrot set), the main body of computations is much bigger than the non-aligned tail and the overall performance increases greatly.

Kernel Variables

Kernel variables are created with factory parameter of the lambda. factory is a templated object and to create any type of object, it requires explicit selection of variable's template:

// serial-looking, parallel variable
auto aFloat = factory.template generate<float>();  
  
// array with vertical layout in memory (strided element access --> parallelism)
auto arr = factory.template generateArray<int,5>(); 

// serial-looking, parallel integer initialized from current work-items id value
auto anInt = factory.template generate<int>(idThread);

This is a costly operation because every scalar-looking variable has a plain array inside. This causes stack allocation which is not always free. Due to this reason, variable allocations should be left outside of innermost loops that do the most intense math calculations:

auto j = idThread.modulus(matrixWidth);     // slow!
while(anyTrue)
{
    a.fusedMultiplyAdd(b,c,result);         // fast!
}

while(anyTrue)
{
    auto j = idThread.modulus(matrixWidth); // slow!
    a.fusedMultiplyAdd(b,c,result);         // stalled!
}

Once a variable is created, it is better to re-use it for many times to benefit from SIMD performance. Otherwise, the compiler adds many mov/vmov instructions to prepare the allocations and the real part of the algorithm is stalled and becomes even slower than scalar version if it has low compute-to-data ratio.

Another downside of using plain array inside a struct is not being able to evade the deep data copying on all copy/move - constructors. This causes extra work on the compiler's side and the CPU's side. To not use a plain array, project would be made of 100% define macros and have higher performance, but this is not the way of C++. Define macros should be used only when there is no other way to achieve it (like AVX intrinsics).

Work-Item Index

During kernel launch, every work-item is given an integer id value to indicate its position in the overall task. For a simple array processing task, it can be used for indexing the array. For a 2D matrix multiplication, it can be used to compute modulus by N and division by N to find X-Y coordinates of elements (work-items' own data) in the matrix.

Work scheduling in kernel object's run method is very simple. First part computes all work-items until there is one non-aligned chunk left. Second part computes the remaining tail by launching single work-item at a time:

void run(int n, Args... args)
{
            // main body of task (length aligned to SimdWidth)
            const int nLoop = (n/SimdWidth);
            const KernelDataFactory<SimdWidth> factory;
            auto id = factory.template generate<int>();
            for(int i=0;i<nLoop;i++)
            {
                id.idCompute(i*SimdWidth,[](const int prm){ return prm;});
                kernel(factory, id, args...);
            }

            // tail part of task (m<SimdWidth): launch 1 work-item at a time
            if((n/SimdWidth)*SimdWidth != n)
            {
                const KernelDataFactory<1> factoryLast;

                const int m = n%SimdWidth;
                auto id = factoryLast.template generate<int>();
                for(int i=0;i<m;i++)
                {
                    id.idCompute(nLoop*SimdWidth+i,[](const int prm){ return prm;});
                    kernel(factoryLast, id, args...);
                }
            }
}

The id values are computed contiguously between each work-item, starting with zero. When launching 1000 workitems, latest id value is 999. The id value (being another scalar-looking block data) can be used in any integer-based operation in the kernel:

auto kernel = Vectorization::createKernel<simd>(

                               work-item index
                                ^            kernel parameters
                                |             ^
                                |             |
    [&](auto & factory, auto & idThread, ... prms ...){

        // compute column position in matrix
        auto j = factory.template generate<int>();
        idThread.modulus(width, j); // j is result of operation

        // compute row position in matrix
        auto i = factory.template generate<int>();
        idThread.div(width, i); // i is result of operation

        

        // fill matrix with values i * j
        // 0 0 0 ... 0 0 0
        // 0 1 2 ... N-3 N-2 N-1
        // 0 2 4 ... (N-3)*2 (N-2)*2 (N-1)*2
        // ...
        // 0 N-3
        // 0 N-2
        // 0 N-1 ... (N-1) * (N-1)
        auto element = factory.template generate<int>();

        // lock-step operation, all SIMD lanes do same
        i.mul(j,element);

        // lock-step operation, all SIMD lanes do same
        element.writeTo(matrix,idThread);
       ...

    }, ... prm types ...
);

I/O Patterns

There are three groups of methods to read/write data from/to buffers: Contiguous read/write methods, indexed gather/scatter methods and indexed contiguous read/write methods.

Contiguous Read/Write

void writeTo(Type * const __restrict__ ptr);
void readFrom(const Type * const __restrict__ ptr);

These methods start read/write operation from the given ptr address and every next work-item in same work-group (that is single SIMD group launched in run method) uses uniformly increasing indexing on the ptr address. First work item of the group accesses first item of ptr, last work item in the group accesses currentSimdWidth-1 index on the ptr. Current simd width is queried from factory parameter of kernel lambda (first parameter):

int currentSimdWidth = factory.width;

This indicates the width of read/write operations on the contiguous read/write methods. For the body chunk of task, it is equal to simd template parameter given to kernel creator function. For the tail chunk, it is equal to 1.

Read and write operations are vectorized similar to multiplication and addition methods. Since there is RAM latency, algorithms with low compute-to-data ratio tend to lose more performance on I/O.

auto returnValue = factory.template generate<float>();
a.add(b,returnValue);

// work-item 0 writes to img,
// work-item 1 writes to img+1,
// ..., last work-item in simd writes to img+N-1
returnValue.writeTo(img);

Indexed Contiguous Read/Write

void readFromContiguous(Type * const __restrict__ ptr, const KernelData<int,Simd> & id);
void writeToContiguous(Type * const __restrict__ ptr, const KernelData<int,Simd> & id);

As an indexed version of first group, they only pick a different target region on the I/O buffer. First work item in every simd group decides where to read/write. If simd size is given 32, then first 32 work-items read/write starting with ptr + id[0]:

// writes to img + idThread.data[0], ..., img + idThread.data[0] + simd - 1
returnValue.writeToContiguous(img, idThread);

Gather/Scatter

readFrom(Type * const __restrict__ ptr, const KernelData<int,Simd> & id);
writeTo(Type * const __restrict__ ptr, const KernelData<int,Simd> & id);

Gathering and scattering means every work-item chooses its own target element to read/write in the buffer. The following method can do random-access per work-item:

// writes to img + idThread.data[0], ..., img + idThread.data[simd-1]
returnValue.writeTo(img, idThread);

This is implemented for convenience in writing scalar pattern codes and is not as fast as the contiguous versions. Even on AVX512 CPUs that support gather/scatter instructions, they have high latency due to the nature of randomly accessing memory and losing the benefit of cache-line. AVX1 and SSE do not have gather/scatter instructions and algorithms to run on these CPUs should be optimized with contiguous access always.

When an algorithm is highly dependent on RAM latency (due to data-set being larger than caches), simd size should be chosen bigger to hide loading/storing latencies. RAM is capable of serving multiple requests in parallel by pipelining. When there are multiple requests in flight, some of their latency is hidden.

For algorithms with high compute-to-data ratio, simd size should be chosen closer to (1-2 times) the SIMD hardware's width to benefit from physical registers of CPU core. If simd width is chosen too high, then physical registers of CPU are not enough and L1 cache is used instead. L1 cache has higher latency and lower bandwidth than registers.

Loop Structure

SIMD hardware of CPU core does not have independent lanes. Each lane is locked to all other lanes in same SIMD unit and all do same computation(instruction) on different data. Due to this work-flow, they have limited looping capability.

Fixed Number Of Iterations

When all work-items in same simd group are required to iterate same number of times at the same time, simple loops can be used:

// all work-items do same
for(int i=0;i<100;i++)
{
    // compute a
    a.add(1.0f,a);

    // compute b
    b.mul(1.1f,b);

    // compute result
    a.fusedMultiplyAdd(b,result,result); // a*b + result --> result
}
// all work-items in same SIMD group reach here at the same time

This is only for per-work-group. Different groups can iterate different number of times. The first SIMD number of work-items can iterate for 100 times and next group can iterate for 5 times without any problem since all work groups are independent (they are launched one after another).

Variable Number of Iterations

When each work-item is required to follow a different number of iterations, then all work-items in same SIMD groups have to follow same number of iterations, with masked operations. Masking allows a loop-breaked work-item to continue looping without changing its last state such as the result of algorithm:

bool anyWorkRemaining = true;
auto k = factory.template generate<int>(0);
auto mask = factory.template generate<int>();
auto somethingToCompute = factory.template generate<float>();

// max iterations
const int N = 100;
while(anyWorkRemaining)
{
    // ... compute loop exit condition for all work-items ...
    k.lessThan(N,mask);                  // vectorized computation
    anyWorkRemaining = mask.isAnyTrue(); // vectorized computation

    // compute algorithm, but masked
    // .. compute some new value
    // .. but don't use it if mask is off
    // (mask?newVal:something) --> something
    mask.ternary(newValueComputed,somethingToCompute,somethingToCompute);


    // increment the iteration
    k.add(variableIterationAmount,k);
}

bool anyWorkRemaining = true;

Parallel Factorial Sample

Compilers are generally smart and can (horizontally)parallelize even a factorial function like this:

template<typename Type>
Type factorial(Type input)
{
    Type result = input--;
    if(input+1 == 0)
        return 1;
    while(input>0)
    {
        result *= input--;
    }

    return result>0?result:1;
}

Entrance part of the produced instruction is:

        vpbroadcastq    zmm0, rax
        mov     rdi, r8
        vpaddq  zmm0, zmm0, ZMMWORD PTR .LC9[rip]
        vpbroadcastq    zmm1, QWORD PTR .LC2[rip]
        shr     rdi, 3
        xor     edx, edx
.L26:
        vmovdqa64       zmm2, zmm0
        inc     rdx
        vpmullq zmm1, zmm1, zmm2
        vpaddq  zmm0, zmm0, ZMMWORD PTR .LC10[rip]
        cmp     rdx, rdi
        jne     .L26
        vextracti64x4   ymm0, zmm1, 0x1
        vpmullq ymm1, ymm0, ymm1
        vmovdqa xmm0, xmm1
        vextracti64x2   xmm1, ymm1, 0x1
        vpmullq xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpmullq xmm0, xmm0, xmm1
        vmovq   rdx, xmm0
        imul    rax, rdx
        mov     rdx, r8
        and     edx, 7
        and     r8d, 7
        je      .L27

After this optimized part, it continues computation on scalar version:

.L25:
        mov     rdi, rdx
        imul    rax, rdx
        dec     rdi
        je      .L27
        imul    rax, rdi
        mov     rdi, rdx
        sub     rdi, 2
        je      .L27
repeats this many times

Scalar factorial applied on an array of elements(of type uint64_t) with value 10+k%3 runs at 24 cycles per element speed and is already partially vectorized on multiplications by the compiler (Cascadelake AVX512) as shown above. 24 cycle operation translates to 8 nanoseconds for the server CPU of godbolt.org. Considering that the RAM latency is included in 24 cycles, it is fairly fast (on GCC v11).

Benchmarking program for scalar version:

#include <algorithm>
#include <cstdint>
#include"VectorizedKernel.h"

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire 
                      // before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

template<typename Type>
Type factorial(Type input)
{
    Type result = input--;
    if(input+1 <= 1)
        return 1;

    while(input>0)
    {
        result *= input--;
    }

    return result>0?result:1;
}

constexpr int N = 1000000;

int main() {

  std::vector<uint64_t> input(N);
  std::vector<uint64_t> output(N);
  for(int k=0;k<N;k++)
  {
      input[k]=10+k%3
  }

  for(int i = 0;i<100;i++)
  {
    auto t1 = readTSC();
    for(int k=0;k<N;k++)
    {
        output[k] = factorial(input[k]);
    }
    auto t2 = readTSC();
    std::cout << (t2 - t1)/(N) << " cycles per factorial" << std::endl;
  }

  for(int i=0;i<20;i++)
  {
      std::cout<<output[i]<<" ";
  }
  std::cout<<std::endl;
  return 0;
}

Vectorized kernel version:

With vectorization on elements instead of multiplications, AVX512 can complete it quicker and requires the parallel while-loop. Compiler with AVX512 flags enabled for Cascadelake CPU completes same 1 million element process at 12 cycles per element.

constexpr int simd = 32;
using maskType = uint64_t;
using dataType = uint64_t;
auto kernel = Vectorization::createKernel<simd>(
      [&](auto & factory, auto & id, dataType * inputBuffer, dataType * outputBuffer){
            auto result = factory.template generate<dataType>();

            auto input  = factory.template generate<dataType>();
            input.readFromContiguous(inputBuffer,id);

            // result = input--
            result.readFrom(input);
            input.sub(1,input);

            // if(input+1 <= 1) return 1;
            // mask: 1 means work is in progress, 0 means completed
            auto maskEarlyQuit = 
            factory.template generate<maskType>(1); // type: same byte-length
            auto inputPlus1 = factory.template generate<dataType>();
            input.add(1,inputPlus1);
            inputPlus1.greaterThan(1,maskEarlyQuit);

            // return 1
            maskEarlyQuit.ternary(result,(dataType)1,result);
            maskEarlyQuit.ternary(input,(dataType)1,input);

            //while(input>0){    result *= input--;}
            auto maskedResult = factory.template generate<dataType>();
            auto maskedInput = factory.template generate<dataType>();
            bool workInProgress = true;
            auto mask = 
            factory.template generate<maskType>(1); // type: same byte-length
            const dataType one = 1;
            const maskType zero = 1;
            while(workInProgress)
            {
                // result * input ---> masked result
                result.mul(input,maskedResult);

                // input-- ----> masked input
                input.sub(one,maskedInput);

                //(input>0) && (not early quit)
                maskedInput.greaterThan(zero,mask);
                mask.bitwiseAnd(maskEarlyQuit,mask);
                workInProgress = mask.isAnyTrue();

                // mask?maskedResult:result -----> result
                // mask?maskedInput:input   -----> input
                mask.ternary(maskedResult,result,result);
                mask.ternary(maskedInput,input,input);
            }

            //return result>0?result:1;
            auto gt = factory.template generate<maskType>();
            result.greaterThan(0,gt);
            gt.ternary(result,(dataType)1,result);
            result.writeToContiguous(outputBuffer,id);

      }, Vectorization::KernelArgs<dataType *, dataType *>{}
);

Benchmark program for vectorized version:

#include <algorithm>
#include <cstdint>
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire 
                      // before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

template<typename Type>
Type factorial(Type input)
{
    Type result = input--;
    if(input+1 <= 1)
        return 1;
    while(input>0)
    {
        result *= input--;
    }

    return result>0?result:1;
}

#include"VectorizedKernel.h"
constexpr int N = 1000000;

int main() {
  constexpr int simd = 4;
  using maskType = uint64_t;
  using dataType = uint64_t;
  auto kernel = Vectorization::createKernel<simd>(
      [&](auto & factory, auto & id, dataType * inputBuffer, dataType * outputBuffer){
            auto result = factory.template generate<dataType>();

            auto input  = factory.template generate<dataType>();
            input.readFromContiguous(inputBuffer,id);

            // result = input--
            result.readFrom(input);
            input.sub(1,input);

            // if(input+1 <= 1) return 1;
            // mask: 1 means work is in progress, 0 means completed
            auto maskEarlyQuit = 
            factory.template generate<maskType>(1); // type: same byte-length
            auto inputPlus1 = factory.template generate<dataType>();
            input.add(1,inputPlus1);
            inputPlus1.greaterThan(1,maskEarlyQuit);

            // return 1
            maskEarlyQuit.ternary(result,(dataType)1,result);
            maskEarlyQuit.ternary(input,(dataType)1,input);

            //while(input>0){    result *= input--;}
            auto maskedResult = factory.template generate<dataType>();
            auto maskedInput = factory.template generate<dataType>();
            bool workInProgress = true;
            auto mask = factory.template generate<maskType>(1); // type: same byte-length
            const dataType one = 1;
            const maskType zero = 1;
            while(workInProgress)
            {
                // result *= input (masked)
                result.mul(input,maskedResult);

                // input-- (masked)
                input.sub(one,maskedInput);

                //(input>0) && (loop not exited)
                maskedInput.greaterThan(zero,mask);
                mask.bitwiseAnd(maskEarlyQuit,mask);
                workInProgress = mask.isAnyTrue();

                // mask?maskedResult:result -----> result
                // mask?maskedInput:input   -----> input
                mask.ternary(maskedResult,result,result);
                mask.ternary(maskedInput,input,input);
            }

            //return result>0?result:1;
            auto gt = factory.template generate<maskType>();
            result.greaterThan(0,gt);
            gt.ternary(result,(dataType)1,result);
            result.writeToContiguous(outputBuffer,id);

      }, Vectorization::KernelArgs<dataType *, dataType *>{});

  std::vector<dataType> input(N);
  std::vector<dataType> output(N);

  for(int k=0;k<N;k++)
  {
      input[k]=k%100;
  }

  for(int i = 0;i<101;i++)
  {
    if(i%2==0)
    {
        auto t1 = readTSC();
        kernel.run(N,input.data(),output.data());
        auto t2 = readTSC();
        std::cout << (t2 - t1)/(N) << " cycles per factorial (vectorized)" << std::endl;
    }
    else
    {
        auto t1 = readTSC();
        for(int k=0;k<N;k++)
        {
            output[k] = factorial(input[k]);

        }
        auto t2 = readTSC();
        std::cout << (t2 - t1)/(N) << " cycles per factorial (scalar)" << std::endl;
    }
  }

  for(int i=0;i<20;i++)
  {
      std::cout<<output[i]<<" ";
  }
  std::cout<<std::endl;
  return 0;
}

Even with the extra masking operations in the while-loop, it surpasses the performance of scalar(but AVX512 optimized) version. While AVX512 CPUs with two AVX512 units per core have enough performance for the naive factorial loop vectorization, AVX2 and earlier architectures lack the necessary SIMD width to compensate for the high-ratio masking operations, hence the naive loop vectorization has same performance on AVX2 and is slower on SSE. There are multiple ways to lose to a smart compiler and this is one of them, unless CPU has something better to offer that compiler does not take into consideration (yet).

Mandelbrot Set Generation

Mandelbrot set generation using complex numbers is something compiler cannot predict as efficient as the factorial function even with the similar simplicity:

int getPoint(int a, int b)
{
    float x = static_cast<float>(b);
    float y = static_cast<float>(a);
    x = (x - width / 2 - width/4)/(width/3);
    y = (height / 2 - y)/(width/3);

    complex<float> c (x,y);

    complex <float> z(0, 0);
    size_t iter = 0;
    while (abs(z) < 2 && iter <= 35)
    {
        z = z * z + c;
        iter++;
    }
    if (iter < 34) return iter*255/33;
    else return 0;
}

Compiler-produced instructions are 90% scalar (except some memory moves) and the computation performance is 1043 cycles per pixel on Cascadelake (AVX512) CPU, because compiler is unable to unroll the unknown pattern of data. To let it know what's happening, some of the standard structs have to be expanded. For example, std::complex and std::abs are not optimized out by compiler by default. Due to this reason, three different versions of naive scalar versions were tested:

simple naive version in code block above
std::complex expanded explicitly into two floats, without using square root for the abs calculation
std::complex expanded again but with square root

Benchmark results show that the naive version can be optimized highly by simple changes in the code.

                           Performance (cycles per pixel)
Compiler      CPU          std::complex  c;     float creal,cimag; + std::sqrt    
gcc   10      Fx-8150      1770                 175                  215
gcc   11      Xeon-8275CL  1075                 91                  111
clang 9       Xeon-8275CL  252(average 500+)    97                  97              
clang 14      Xeon-8275CL  255(average 500+)    77                  79
icc   2021.5  Xeon-8275CL  800+                 66                  70
icx   2022    Xeon-8275CL  800+                 79                  91
msvc  19      Xeon-8275CL  490                  71                  137

The largest speedup was seen when std::complex was expanded into two floats (real component and imaginary component) and explicitly computed in complex multiplication. This helped the compiler see optimizations that weren't visible with std::complex. Then to test if it was the square root in abs function, it was added back to the loop condition computation and it marginally increased timings.

With plain simple naive version, msvc-19 performed best, at 490 cycles per pixel while clang occasionally achieved 252 cycles per cycle. When std::complex was explicitly computed by two floats, icc performed best. When square root calculation was added back explicitly, winner was icc again.

Even though GCC (the default compiler used in benchmarking in this project) was the worst in unoptimized code, it did perform best in the case of auto-vectorization on the KernelData objects that are very small arrays that have just enough elements to be replaced by a single AVX or AVX512 instruction.

It is unfair to compare an unoptimized naive version against fully optimized vectorized version of code.

Real speedup for FX8150 is from 175 cycles per pixel to 80 cycles per pixel (2.1x performance) because the compiler is able to do "horizontal" parallelization that work on pixel one at a time but computing faster in while-loop depth per pixel. Nevertheless, vertical parallelization is better than horizontal parallelization for some algorithms and Mandelbrot-Set generation looks like one of them.

Real speedup for Cascadelake CPU is from 91 cycles per pixel to 11 cycles per pixel (~8x). Since the depth of while-loop is limited (35) and the AVX512 is 2x the width of AVX2 and there are 2x FMA units in the core, there is a substantial speedup against the horizontal vectorization of compiler.

Test code for three naive scalar versions:

#include <vector>
#include <stdint.h>
#include <iostream>
#include <sstream>
#include <fstream>
#include <complex>

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire 
                      // before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

constexpr int width = 2000;
constexpr int height = 2000;

unsigned char getPoint(int a, int b)
{
    float x = static_cast<float>(b);
    float y = static_cast<float>(a);
    x = (x - width / 2.0f - width/4.0f)/(width/3.0f);
    y = (height / 2.0f - y)/(width/3.0f);

    std::complex<float> c (x,y);

    std::complex <float> z(0, 0);
    size_t iter = 0;
    while (std::abs(z) < 2 && iter <= 35)
    {
        z = z * z + c;
        iter++;
    }
    if (iter < 34) return iter*255/33;
    else return 0;
}

// std::complex replaced by two floats
// sqrt is optimized out
unsigned char getPointExplicit(int a, int b)
{
    float x = static_cast<float>(b);
    float y = static_cast<float>(a);
    x = (x - width / 2.0f - width/4.0f)/(width/3.0f);
    y = (height / 2.0f - y)/(width/3.0f);

    //std::complex<float> c (x,y);
    float creal = x;
    float cimag = y;

    //std::complex <float> z(0, 0);
    float zreal = 0;
    float zimag = 0;

    size_t iter = 0;
    while ((zreal*zreal + zimag*zimag) < 4 && iter <= 35)
    {
        //z = z * z + c;
        float znewreal = zreal*zreal - zimag*zimag + creal;
        float znewimag = 2.0f*zreal*zimag + cimag;
        zreal=znewreal;
        zimag=znewimag;
        iter++;
    }
    if (iter < 34) return iter*255/33;
    else return 0;
}

// std::complex replaced by two floats
unsigned char getPointExplicitWithSqrt(int a, int b)
{
    float x = static_cast<float>(b);
    float y = static_cast<float>(a);
    x = (x - width / 2.0f - width/4.0f)/(width/3.0f);
    y = (height / 2.0f - y)/(width/3.0f);

    //std::complex<float> c (x,y);
    float creal = x;
    float cimag = y;

    //std::complex <float> z(0, 0);
    float zreal = 0;
    float zimag = 0;

    size_t iter = 0;
    while (std::sqrt(zreal*zreal + zimag*zimag) < 2 && iter <= 35)
    {
        //z = z * z + c;
        float znewreal = zreal*zreal - zimag*zimag + creal;
        float znewimag = 2.0f*zreal*zimag + cimag;
        zreal=znewreal;
        zimag=znewimag;
        iter++;
    }
    if (iter < 34) return iter*255/33;
    else return 0;
}

void createImage()
{
    std::vector<unsigned char> image(width*height);

    for(int k=0;k<20;k++)
    {
        auto t1 = readTSC();
        for (size_t i = 0; i < height; i++)
        {
            for (size_t j = 0; j < width; j++)
            {
                image[j+i*width]= getPointExplicit(i, j);
            }
        }
        auto t2 = readTSC();
        std::cout<<(t2-t1)/(width*height)<<" cycles per pixel"<<std::endl;
    }

  // write to file at once
  std::ofstream fout;
  fout.open("mandelbrot.ppm");
  if (fout.is_open()) {
    fout << "P5\n" << width << ' ' << height << " 255\n";
    for(int i=0;i<width*height;i++)
    {
        fout << image[i];
    }
    fout.close();
  } else {
    std::cout << "Could not open the file!\n";
  }
}

int main()
{
    createImage();
    return 0;
}

Following code sample is the vectorized kernel for the same Mandelbrot Set generation (without square root):

vertically parallelized, 16-32 pixels are computed at once
memory is accessed for 16-32 pixels at once
fused-multiply-add method is used for fast a*b+c computation
masking operand types are chosen as int to be same width as float (4 bytes) as this may help on compiler to generate efficient CPU instructions without converting them to/from char values

#include <algorithm>
#include <cstdint>
#include <fstream>
#include <iostream>
#include "VectorizedKernel.h"

constexpr int width = 2000;
constexpr int height = 2000;
constexpr int simd = 64;

void Mandelbrot(std::vector<char> & image) {
    // imag: imaginary part of complex number
    // real: real part of complex number
	auto kernel = Vectorization::createKernel<simd>([&]
                  (auto & factory, auto & idThread, char * img){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail

		auto j = factory.template generate<int>();
		idThread.modulus(width, j);

		auto i = factory.template generate<int>();
		idThread.div(width, i);

		auto x0 = factory.template generate<float>();
		j.template castAndCopyTo<float>(x0);
		auto y0 = factory.template generate<float>();
		i.template castAndCopyTo<float>(y0);

		const auto heightDiv2 = factory.template generate<float>(height/2.0f);

		auto x = factory.template generate<float>();
		x0.sub(width/2.0f, x0);

		x0.sub(width/4.0f, x0);
		x0.div(width/3.0f,x);

		auto y =  factory.template generate<float>();
		heightDiv2.sub(y0,y0);
		y0.div(width/3.0f, y);

        // c constant variable's components
		const auto imagc = factory.template generate<float>(y);
		const auto realc = factory.template generate<float>(x);

        // z variable's imaginary and real components
		auto imagz = factory.template generate<float>(0);
		auto realz = factory.template generate<float>(0);

		// loop
		bool anyTrue = true;
		auto iteration = factory.template generate<int>(0);

		// allocate all re-used resources for the while-loop
		auto imagzSquared = factory.template generate<float>();
		auto absLessThan2 = factory.template generate<int>();
		auto tmp1 = factory.template generate<float>();
		auto whileLoopCondition = factory.template generate<int>();
		auto tmp2 = factory.template generate<int>();
		auto zzReal = factory.template generate<float>();

		auto zzImag = factory.template generate<float>();
		auto tmp3 = factory.template generate<float>();
		auto tmpAdd1 = factory.template generate<float>();
		auto tmpAdd2 = factory.template generate<float>();
		auto tmp4 = factory.template generate<int>();

        // no allocation within the loop
		while(anyTrue)
		{
			// computing while loop condition start
            imagz.mul(imagz, imagzSquared);
			realz.fusedMultiplyAdd(realz,imagzSquared,tmp1);
			tmp1.lessThan(4.0f, absLessThan2);

			iteration.lessThanOrEquals(35, tmp2);
			absLessThan2.logicalAnd(tmp2, whileLoopCondition);
			anyTrue = whileLoopCondition.isAnyTrue();
			// computing while loop condition end

			// do complex multiplication z = z*z + c
			realz.fusedMultiplySub(realz,imagzSquared, zzReal);
			imagz.mul(realz, tmp3);
			realz.fusedMultiplyAdd(imagz,tmp3, zzImag);

			// if a lane has completed work, do not modify it
			zzReal.add(realc, tmpAdd1);
			zzImag.add(imagc, tmpAdd2);
			whileLoopCondition.ternary(tmpAdd1, realz, realz);
			whileLoopCondition.ternary(tmpAdd2, imagz, imagz);

			// increment iteration
			whileLoopCondition.ternary(1,0, tmp4);
			iteration.add(tmp4, iteration);
		}

        // compute output
		const auto thirtyFour = factory.template generate<int>(34);
		auto ifLessThanThirtyFour = factory.template generate<int>();
		iteration.lessThan(thirtyFour, ifLessThanThirtyFour);

		auto conditionalValue1 = factory.template generate<int>();
		iteration.mul(255, conditionalValue1);
		conditionalValue1.div(33, conditionalValue1);

		auto conditionalValue2 = factory.template generate<int>(0);
		auto returnValue = factory.template generate<int>(0);
		ifLessThanThirtyFour.ternary
              (conditionalValue1, conditionalValue2, returnValue);
		auto tmp5 = factory.template generate<int>();
		i.mul(width, tmp5);
		auto writeAddr = factory.template generate<int>(0);
		j.add(tmp5, writeAddr);
		auto returnValueChar =  factory.template generate<char>(0);
		returnValue.template castAndCopyTo<char>(returnValueChar);
		returnValueChar.writeTo(img,writeAddr);

	},Vectorization::KernelArgs<char*>{});
    if(image.size()<width*height)
	    image.resize(width*height);
	kernel.run(width*height,image.data());
}

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire 
                      // before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

int main() {
  std::vector<char> image;
  for(int i = 0;i<100;i++)
  {
    auto t1 = readTSC();
    Mandelbrot(image);
    auto t2 = readTSC();
    std::cout << (t2 - t1)/(width*height) << " cycles per pixel" << std::endl;
  }
  // write to file at once
  std::ofstream fout;
  fout.open("mandelbrot.ppm");
  if (fout.is_open()) {
    fout << "P5\n" << width << ' ' << height << " 255\n";
    for(int i=0;i<width*height;i++)
    {
        fout << image[i];
    }
    fout.close();
  } else {
    std::cout << "Could not open the file!\n";
  }

  return 0;
}

This time, there are more computations than masking operations and the vectorization more efficient than the factorial sample. On FX8150 3.6GHz CPU and GCC v10, it has performance of 80 cycles per pixel. On Cascadelake and GCC v10, it has 11 cycles per pixel. There is nearly 100x speedup on the Cascadelake against untouched naive code performance and about 8x against the optimized version of it.

Part of the performance comes from parallel memory I/O. All 64 work-items are computed in parallel, then all results are written to output buffer at once. Doing the writing one by one causes extra RAM or cache latency per operation. When all are written at once, some of the total latency is hidden. Then there is the SIMD-based computation speedup that is up to 16x. Cascadelake CPU of godbolt.org has dual AVX512 fma units per core. This makes 4 times speedup compared to single AVX unit of FX8150 and architectural differences (such as L1 bandwidth, instruction latency, total physical registers, ...) make the difference 8x even with lower CPU frequency.

Thanks to the GCC v10's efficient auto-vectorization, VectorizedKernel made 100% speedup even against std::valarray version that is on equally optimized code path.

The program generates this image:

The image pattern mainly consists of uniform color distribution and this is good for scaling performance with SIMD usage. When image is zoomed, it may not scale as good as this. It depends on the brancing between neighboring pixels. The test program used scan-line computing. If it was tiled-computing, then it could have better scaling due to better packing of neighboring computations. When benchmarking the mandelbrot-set generation performance, the resulting image and the precision of calculations have to be same to have a fair comparison of different CPUs. There are also other factors such as AVX512 throttling issues with an unknown server-load. When another client is computing AVX512, current client's scalar code is throttled too. This is generally a bad side-effect for web servers but not for the high performance computing nor workstations. Xeon AVX512 core is similar to a mini GPU core (that has 192 pipelines at low frequency) that has free access to RAM.

Casting Variables To Different Types

Type conversion is supported by castAndCopyTo templated method. In the mandelbrot kernel, there are cast methods:

auto x0 = factory.template generate<float>(); // if j was 3
j.template castAndCopyTo<float>(x0);          // then x0 becomes 3.0f

This converts all SIMD lane values to the explicitly called template type. When cast between integral variables and floating-point variables, this changes the bitwise representation of data. To keep bitwise representation same while changing the value, castBitwiseAndCopyTo method is used.

// j=100
j.broadcast(100);                            // 4-byte integer
auto x = factory.template generate<float>(); // 4-byte float (important! 4==4)

// x becomes 1.4013e-43
j.template castBitwiseAndCopyTo<float>(x);

Communication Between Lanes Of SIMD (So-called)Thread Group

There are also methods to broadcast value of a lane of a variable to all lanes of itself or to all lanes of another variable.

broadcast initializes all lanes to the same constant literal value.

auto x = factory.template generate<int>(0); // 0-initializes all lanes
x.broadcast(5); // all lanes have the value 5 now

broadcastFromLaneToVector copies a lane value to all lanes of a target variable:

#include <algorithm>
#include <iostream>
#include"VectorizedKernel.h"

constexpr int simd = 4;

int main() {
  std::vector<int> data(500);

  // initialize data vector to 0,0,0,0,4,4,4,4,8,8,8,8,12,12,12,12,16,16,16,16
  auto kernel = Vectorization::createKernel<simd>([&]
                (auto & factory, auto & idThread, int * data){

            auto value = factory.template generate<int>();

            // take lane 0's value and broadcast to all lanes of value
            idThread.broadcastFromLaneToVector(0,value);

            value.writeTo(data,idThread);
        },
        Vectorization::KernelArgs<int*>{}
  );

  kernel.run(500,data.data());

  for(int i=0;i<30;i++)
  {
      std::cout<<data[i]<<" ";
  }
  std::cout<<std::endl;
  return 0;
}

The output is as follows:

0 0 0 0 4 4 4 4 8 8 8 8 12 12 12 12 16 16 16 16 20 20 20 20 24 24 24 24 28 28

broadcastFromLane method copies a lane's value to all other lanes in same variable.

auto kernel = Vectorization::createKernel<simd>([&]
              (auto & factory, auto & idThread, int * data){
      const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                  // 1 on non-aligned tail

            auto value = factory.template generate<int>();
            value.readFrom(idThread);

            // take lane 0's value and broadcast to all lanes
            value.broadcastFromLane(0);

            value.writeTo(data,idThread);
        },
        Vectorization::KernelArgs<int*>{}
);

gatherFromLane method does independent lane selection for each lane and copies values to result variable.

Left-shifting sample (works only for non-tail SIMD thread groups):

auto kernel = Vectorization::createKernel<simd>([&]
              (auto & factory, auto & idThread, int * data){
            // simd on aligned-body, 1 on non-aligned tail
            const int currentSimdWidth = factory.width; 

            auto value = factory.template generate<int>();
            auto laneId = factory.template generate<int>();

            // laneId = idThread % currentSimdWidth
            idThread.modulus(currentSimdWidth, laneId);

            // laneId++; ----> will gather from right-neighbor
            laneId.add(1,laneId);

            // laneId = laneId % currentSimdWidth ---> wrap around to 0
            laneId.modulus(currentSimdWidth,laneId);

            // { 0,1,2,3 } becomes { 1,2,3,0 }
            idThread.gatherFromLane(laneId,value);

            value.writeTo(data,idThread);
        },
        Vectorization::KernelArgs<int*>{}
);

The output is as follows:

1 2 3 0 5 6 7 4 9 10 11 8 13 14 15 12 17 18 19 16 21 22 23 20 25 26 27 24 29 30

On Cascadelake CPU and simd=32 setting, each element is gathered from another lane and written to output at a speed of 3 cycles per element. On top of kernel launch latency, there are RAM access latency(partially hidden), two allocations for value and laneId variables, 2 modulus, 1 add, 1 gather and write operations, all completed in 3 cycles per element on average or 1 nanosecond (4GB/s, 200kB buffer).

Left-shifting lanes without gatherFromLane method can be made using lanesLeftShift:

auto kernel = Vectorization::createKernel<simd>([&]
              (auto & factory, auto & idThread, int * data){
            const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                        // 1 on non-aligned tail

            auto value = factory.template generate<int>(idThread);
            auto value2 = factory.template generate<int>();

            // quicker to compute than add+modulus+gather
            value.lanesLeftShift(10,value2);

            value2.writeTo(data,idThread);
        },
        Vectorization::KernelArgs<int*>{}
); // same output

Gather values from neighboring lanes (in same SIMD unit) makes it easier to calculate a reduction.

Reduction Sample

Naive and readable reduction algorithm is given as:

---64-element sum-reduction sample using 32 SIMD lanes---

Type data[64];

for(i = 32; i>0; i/=2)
{
    if(current lane index < i)
    {
         data[current lane index] += data[current lane index + i];
    }
}
data[0]---> has sum of all elements from 0 to 63

When there is an if, another mask is required for the instructions inside of its block. The += operator requires the masking:

Reduction In SIMD

Reduction within SIMD lanes uses only half of the lanes to add both halves then continues recursively by halving each new left half until first element is the only half part left.

constexpr int simd=64;
auto kernel = Vectorization::createKernel<simd>([&]
              (auto & factory, auto & idThread, int * data){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail

        auto value = factory.template generate<int>();
        auto value2 = factory.template generate<int>();
        auto value3 = factory.template generate<int>();

        value.readFrom(data,idThread);

        auto laneIndex = factory.template generate<int>();
        auto laneIndexIncremented = factory.template generate<int>();
        idThread.modulus(currentSimdWidth,laneIndex);

        auto mask = factory.template generate<int>();

        auto gatheredValue = factory.template generate<int>();

        // reduction
        for(int i = 32; i>0; i/=2)
        {
            // lane index < i ?
            laneIndex.lessThan(i,mask);

            // lane index + i
            laneIndex.add(i,laneIndexIncremented);

            // do not increment right-half of lanes to not overflow SIMD lanes
            mask.ternary(laneIndexIncremented,laneIndex,laneIndexIncremented);

            // value2 = gathered from [lane index + i]
            value.gatherFromLane(laneIndexIncremented,value2);

            // value3 = value + value2
            value.add(value2,value3);

            // if(lane index < i) then value3 ---> value else value ----> value
            mask.ternary(value3,value,value);
        }

        // sum from 0 to 63 = 2016
        // only first lane writes the result: { 2016, 1, 2, 3, ...}
        value.writeToMasked(data, idThread, mask); 
      },
      Vectorization::KernelArgs<int*>{}
);

Reduction can also be run on a buffer to reduce twice number of elements:

Reduction On Buffer

constexpr int simd=64;
auto kernel = Vectorization::createKernel<simd>([&]
              (auto & factory, auto & idThread, int * data){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail

        auto value = factory.template generate<int>();
        auto value2 = factory.template generate<int>();
        auto value3 = factory.template generate<int>();

        value.readFrom(data,idThread);

        auto laneIndex = factory.template generate<int>();
        auto laneIndexIncremented = factory.template generate<int>();
        idThread.modulus(currentSimdWidth,laneIndex);

        auto mask = factory.template generate<int>();

        auto gatheredValue = factory.template generate<int>();

        // reduction
        for(int i = 64; i>0; i/=2)
        {
            // lane index < i ?
            laneIndex.lessThan(i,mask);

            // lane index + i
            laneIndex.add(i,laneIndexIncremented);

            // value = gathered from [lane]
            value.readFromMasked(data, laneIndex, mask);

            // value2 = gathered from [lane index + i]
            value2.readFromMasked(data, laneIndexIncremented, mask);

            // value3 = value + value2
            value.add(value2,value3);

            // if(lane index < i) then value3 ---> value else value ----> value
            mask.ternary(value3,value,value);

            value.writeToMasked(data,idThread,mask);
        }
      },
      Vectorization::KernelArgs<int*>{}
);

The result has 8128 at position zero (sum of values 0,1,2,...127).

Since there is no overflowing in the buffer and the I/O is always contiguous, the masked read/write methods can be replaced with non-masked contiguous versions:

// reduction
for(int i = 64; i>0; i/=2)
{
      // lane index < i ?
      laneIndex.lessThan(i,mask);

      // lane index + i
      laneIndex.add(i,laneIndexIncremented);

      // value = gathered from [lane]
      value.readFromContiguous(data, laneIndex);

      // value2 = gathered from [lane index + i]
      value2.readFromContiguous(data, laneIndexIncremented);

      // value3 = value + value2
      value.add(value2,value3);

      // if(lane index < i) then value3 ---> value else value ----> value
      mask.ternary(value3,value,value);

      value.writeToContiguous(data,idThread);
}

This is faster to run and completes every 128-element chunk in 1248 cycles on average (~3 nanoseconds per element), for Cascadelake CPU. This is the cost of 7 variable allocations, 6 loop iterations and 7 operations per iteration. This sample has low compute-to-data ratio (especially with allocations).

When only the reduction loop is measured:

#include <algorithm>

#include <iostream>
#include"VectorizedKernel.h"

constexpr int simd = 64;

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire 
                      // before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

int main() {
    std::vector<int> data(512);

    auto kernel = Vectorization::createKernel<simd>([&]
                  (auto & factory, auto & idThread, int * data){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail

        auto value = factory.template generate<int>();
        auto value2 = factory.template generate<int>();
        auto value3 = factory.template generate<int>();

        value.readFrom(data,idThread);

        auto laneIndex = factory.template generate<int>();
        auto laneIndexIncremented = factory.template generate<int>();
        idThread.modulus(currentSimdWidth,laneIndex);

        auto mask = factory.template generate<int>();

        auto gatheredValue = factory.template generate<int>();

        auto t1 = readTSC();

        // reduction
        for(int i = 64; i>0; i/=2)
        {
            // lane index < i ?
            laneIndex.lessThan(i,mask);

            // lane index + i
            laneIndex.add(i,laneIndexIncremented);

            // value = gathered from [lane]
            value.readFromContiguous(data, laneIndex);

            // value2 = gathered from [lane index + i]
            value2.readFromContiguous(data, laneIndexIncremented);

            // value3 = value + value2
            value.add(value2,value3);

            // if(lane index < i) then value3 ---> value else value ----> value
            mask.ternary(value3,value,value);

            value.writeToContiguous(data,idThread);
        }

        auto t2 = readTSC();
        std::cout<<t2-t1<<" cycles per 128-element reduction"<<std::endl;
      },
      Vectorization::KernelArgs<int*>{}
    );

    for(int i=0;i<512;i++)
    {
      data[i]=i;
    }
    kernel.run(500,data.data());

    for(int i=0;i<35;i++)
    {
      std::cout<<data[i]<<" ";
    }
    std::cout<<std::endl;
    return 0;
}

it is 198 cycles per reduction on FX8150 core and 28 cycles for Cascadelake (Godbolt.org server). Actual cycles are lower because readTSC functions (that are call twice) have their own latency too.

28 cycles for 6 loop iterations is equivalent to 0.67 cycles per method call. Since Cascadelake server is generally running 2.5GHz or 3.0Ghz, it is ~0.2 nanoseconds per method call on average (each method computes simd-number-of elements and roughly compares to 300 elements per nanosecond / 1.2 TB/s). The only issue with the sample is the allocation overhead of variables and its readability compared to real CUDA/OpenCL kernel code.

28 cycles for 128 value-reduction means less than 0.1 nanosecond per element and this is a naive reduction algorithm that is not work-efficient but is cache-friendly (6 steps for 128 elements work in L1 cache or registers depending on size&number of CPU physical registers).

8x8 Matrix Transposition Sample

Naive transposition algorithm applied on 100 matrices of sizes 8x8 on FX8150 CPU:

#include <algorithm>

#include <iostream>
#include"VectorizedKernel.h"

constexpr int simd = 8;
int main() {

    constexpr int N = 100;
    // N matrices of sizes simd*simd
    std::vector<int> data(simd*simd*N);
    std::vector<int> dataOut(simd*simd*N);

    auto kernel = Vectorization::createKernel<simd>([&]
                  (auto & factory, auto & idThread, int * data, int * dataOut){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail
        auto value = factory.template generate<int>();
        auto indexRead = factory.template generate<int>(idThread);
        auto indexWrite = factory.template generate<int>();
        auto column = factory.template generate<int>();
        auto row = factory.template generate<int>();

        // transpose 100 matrices of size simd*simd
        Vectorization::Bench bench(0);
        for(int i=0;i<N;i++)
        {
            for(int j=0;j<currentSimdWidth;j++)
            {
                // read
                idThread.add(j*currentSimdWidth,indexRead);

                value.readFrom(data+i*currentSimdWidth*currentSimdWidth,indexRead);

                // compute row
                indexRead.div(currentSimdWidth,row);

                // compute column
                indexRead.modulus(currentSimdWidth,column);

                // compute write address for transposed output 
                // (row becomes column, column becomes row)
                column.mul(currentSimdWidth,indexWrite);
                indexWrite.add(row,indexWrite);

                // write transposed
                value.writeTo(dataOut+i*currentSimdWidth*currentSimdWidth,indexWrite);
            }
        }
      },
      Vectorization::KernelArgs<int*,int*>{}
    );

    for(int i=0;i<simd*simd*N;i++)
    {
      data[i]=i;
    }

    kernel.run(simd,data.data(),dataOut.data());
    kernel.run(simd,data.data(),dataOut.data());
    kernel.run(simd,data.data(),dataOut.data());

    for(int i=0;i<100;i++)
    {
      std::cout<<dataOut[i]<<" ";
    }
    std::cout<<std::endl;
    return 0;
}

The output is as follows:

1.569e-05 seconds (for 100 matrices)
1.6099e-05 seconds (for 100 matrices)
1.6281e-05 seconds (for 100 matrices)
0 8 16 24 32 40 48 56 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 
27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 
46 54 62 7 15 23 31 39 47 55 63 64 72 80 88 96 104 112 120 65 73 81 89 97 
105 113 121 66 74 82 90 98 106 114 122 67 75 83 91 99 107 115 123 68 76 84 92

Each element of matrices moved for transposition at 2.5 nanoseconds on average (~1.6 GB/s). Naive solution for a linear algebra problem is not always fast. For Cascadelake CPU with simd=32 (to transpose 32x32 matrices), it is 1.1 nanoseconds (~3.9 GB/s).

Optimized codes should make the compiler produce SIMD shuffle instructions. gatherFromLane is a way of generating shuffle commands.

Slightly optimized version that does not use gatherFromLane but stores all matrix elements in single simd variable:

#include <algorithm>

#include <iostream>
#include"VectorizedKernel.h"

// each simd group will now hold a whole 8x8 matrix
constexpr int matrixWidth = 2;
constexpr int simd = matrixWidth*matrixWidth;
int main() {

    constexpr int N = 1000;
    // N matrices of sizes simd*simd
    std::vector<int> data(simd*N);
    std::vector<int> dataOut(simd*N);

    auto kernel = Vectorization::createKernel<simd>([&]
                  (auto & factory, auto & idThread, int * data, int * dataOut){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail
        auto value = factory.template generate<int>();
        auto indexRead = factory.template generate<int>(idThread);
        auto indexWrite = factory.template generate<int>();
        auto column = factory.template generate<int>();
        auto row = factory.template generate<int>();

        // compute row
        indexRead.div(matrixWidth,row);

        // compute column
        indexRead.modulus(matrixWidth,column);

        // compute write address for transposed output 
        // (row becomes column, column becomes row)
        column.mul(matrixWidth,indexWrite);
        indexWrite.add(row,indexWrite);

        // transpose 1000 matrices of size simd*simd
        Vectorization::Bench bench(0);
        for(int i=0;i<N;i++)
        {
                // read whole matrix
                value.readFrom(data+i*currentSimdWidth,indexRead);

                // write transposed
                value.writeTo(dataOut+i*currentSimdWidth,indexWrite);
        }
      },
      Vectorization::KernelArgs<int*,int*>{}
    );

    for(int i=0;i<simd*N;i++)
    {
      data[i]=i;
    }

    kernel.run(simd,data.data(),dataOut.data());
    kernel.run(simd,data.data(),dataOut.data());
    kernel.run(simd,data.data(),dataOut.data());

    for(int i=0;i<100;i++)
    {
      std::cout<<dataOut[i]<<" ";
    }
    std::cout<<std::endl;
    return 0;
}

The output on FX8150 is as below:

3.155e-06 seconds
3.002e-06 seconds
2.826e-06 seconds
0 2 1 3 4 6 5 7 8 10 9 11 12 14 13 15 16 18 17 19 20 22 21 23 24 26 25 27 
28 30 29 31 32 34 33 35 36 38 37 39 40 42 41 43 44 46 45 47 48 50 49 51 52 
54 53 55 56 58 57 59 60 62 61 63 64 66 65 67 68 70 69 71 72 74 73 75 76 78 
77 79 80 82 81 83 84 86 85 87 88 90 89 91 92 94 93 95 96 98 97 99

It is 0.7 nanoseconds per matrix element (5.7 GB/s). On Cascadelake CPU with simd=16 (matrixWidth=4), it is 0.11 nanoseconds per matrix element (36GB/s).

Second optimization with gatherFromLane:

#include <algorithm>

#include <iostream>
#include"VectorizedKernel.h"

// each simd group will now hold a whole 8x8 matrix
constexpr int matrixWidth = 2;
constexpr int simd = matrixWidth*matrixWidth;
int main() {

    constexpr int N = 1000;
    // N matrices of sizes simd*simd
    std::vector<int> data(simd*N+64);
    std::vector<int> dataOut(simd*N+64);
    int * alignedInput = data.data();
    int * alignedOutput = dataOut.data();
    while(((size_t) alignedInput) % 64 !=0)
        alignedInput++;

    while(((size_t) alignedOutput) % 64 !=0)
        alignedOutput++;

    auto kernel = Vectorization::createKernel<simd>([&]
                  (auto & factory, auto & idThread, int * data, int * dataOut){
        const int currentSimdWidth = factory.width; // simd on aligned-body, 
                                                    // 1 on non-aligned tail
        auto valueIn = factory.template generate<int>();
        auto valueOut = factory.template generate<int>();

        auto indexRead = factory.template generate<int>(idThread);
        auto indexWrite = factory.template generate<int>();
        auto column = factory.template generate<int>();
        auto row = factory.template generate<int>();

        // compute row
        indexRead.div(matrixWidth,row);

        // compute column
        indexRead.modulus(matrixWidth,column);

        // compute write address for transposed output 
        // (row becomes column, column becomes row)
        column.mul(matrixWidth,indexWrite);
        indexWrite.add(row,indexWrite);

        // transpose 1000 matrices of size simd*simd
        {
            Vectorization::Bench bench(0);
            for(int i=0;i<N;i++)
            {
                    // read whole matrix
                    valueIn.readFrom(data+i*simd);

                    // gather operation
                    valueIn.gatherFromLane(indexWrite,valueOut);
                    //valueIn.transposeLanes(matrixWidth,valueOut); // same result

                    // write transposed
                    valueOut.writeTo(dataOut+i*simd);
            }
        }
      },
      Vectorization::KernelArgs<int*,int*>{}
    );

    for(int i=0;i<simd*N;i++)
    {
      alignedInput[i]=i;
    }
    for(int i=0;i<100000;i++)
        kernel.run(simd,alignedInput,alignedOutput);

    for(int i=0;i<100;i++)
    {
      std::cout<<alignedOutput[i]<<" ";
    }
    std::cout<<std::endl;
    return 0;
}

FX8150 (simd=4): 0.28 nanoseconds per element (14 GB/s)
Cascadelake (simd=4): 0.098 nanoseconds per element (40 GB/s)
Cascadelake (simd=16): 0.094 nanoseconds per element (42 GB/s)

Matrix transposition is memory bandwidth bottlenecked and simple C++ scalar version can easily reach the RAM bandwidth bottleneck. Cascadelake L2-to-L1 cache bandwidth is 64 bytes per cycle. At 3GHz, it means 192 GB/s. The gatherFromLane method call is reducing the 192GB/s bandwidth to 42 GB/s. gatherFromLane produces "vpermd" instruction for Cascadelake CPU and it has 3 cycles latency (taken from instruction tables). Gathering 16 lane values (permuting for the instruction) taking 3 cycles is equivalent to ~5 integers/floats per cycle. This is 15 giga integers/floats at 3GHz or 45 GB/s bandwidth and is close to the measured 42 GB/s. transposeLanes method is slightly more efficient for the same task because the pattern of transposition is known in compile-time and produces one less mov command in the loop. Apparently, having two AVX512 FMA units does not help memory-only tasks. To be able to fully use FMA units during a matrix-matrix multiplication, multiple sub-matrices must be stored in AVX512 registers and selected by permutation or shuffling the register then computed using fusedMultiplyAdd or fusedMultiplySub methods.

Matrix Multiplication Sample

Naive version of matrix-matrix multiplication is straightforward to implement.

#include "VectorizedKernel.h"
#include<vector>
int main()
{
    // C = A x A kernel
    constexpr int simd = 8;
    constexpr int matrixSize = 512;
    auto kernelMatrixMultiplication = Vectorization::createKernel<simd>
    ([&](auto & factory, auto & idThread, float * bufferIn, float * bufferOut){
        const int currentSimdWidth = factory.width;

        auto ROW = factory.template generate<int>();
        idThread.div(matrixSize, ROW);
        auto COL = factory.template generate<int>();
        idThread.modulus(matrixSize, COL);

        auto tmpSum = factory.template generate<float>(0.0f);

        auto A = factory.template generate<float>(0.0f);
        auto B = factory.template generate<float>(0.0f);

        auto AmulB = factory.template generate<float>(0.0f);

        auto indexA = factory.template generate<int>(0);

        auto indexB = factory.template generate<int>(0);

        for(int i=0;i<matrixSize;i++)
        {
            // compute index of A element
            ROW.mul(matrixSize, indexA);
            indexA.add(i,indexA);

            // compute index of B element
            COL.add(i*matrixSize, indexB);

            // read A and B
            A.readFrom(bufferIn, indexA); // A[ROW * N + i] --> gather operation
            B.readFrom(bufferIn, indexB); // A[i * N + COL] --> contiguous

            // multiply
            A.mul(B,AmulB);

            // add
            tmpSum.add(AmulB, tmpSum);
        }
        // write C index
        auto writeC = factory.template generate<int>();
        ROW.mul(matrixSize, writeC);
        writeC.add(COL, writeC);
        tmpSum.writeTo(bufferOut,writeC);

    },Vectorization::KernelArgs<float*,float*>{});

    for(int j=0;j<25;j++)
    {
        std::vector<float> test1(matrixSize*matrixSize),test2(matrixSize*matrixSize);
        for(int i=0;i<matrixSize*matrixSize;i++)
        {
            test1[i]=2.0f;
        }

        std::cout<< "matrix multiplication 
        ("<<matrixSize*(size_t)matrixSize*matrixSize*2<<" operations total) ";
        {
            Vectorization::Bench bench(0);
            kernelMatrixMultiplication.run
            (matrixSize * matrixSize,test1.data(),test2.data());
        }

        for(int i=0;i<3;i++)
        {
            std::cout<<test2[i]<<" ";
        }
        std::cout<<std::endl;
    }

    return 0;
}

On Cascadelake CPU (simd=32), it takes 0.097 seconds (0.133 for FX8150 simd=8) to complete 268 million flops. 2.7 GFLOPs shows efficiency of strided reads (with stride = multiple rows of big matrix) and the lack of caching. Since there is no sub-matrix optimization, it doesn't even use L1 cache efficiently. To be able to use cache first, it can be made to do "tiled"/"sub matrix" computing.

Assuming that the second matrix is another matrix buffer and is already transposed, the data gathering part can be made fully contiguous and the mul&add methods can be joined together as a fused multiply add operation:

#include "VectorizedKernel.h"
#include<vector>
int main()
{
    // C = A x A kernel
    constexpr int simd = 4;
    constexpr int matrixSize = 512;
    auto kernelMatrixMultiplication = Vectorization::createKernel<simd>
    ([&](auto & factory, auto & idThread, float * bufferIn, float * bufferOut){
        const int currentSimdWidth = factory.width;

        auto ROW = factory.template generate<int>();
        idThread.div(matrixSize, ROW);
        auto COL = factory.template generate<int>();
        idThread.modulus(matrixSize, COL);

        auto tmpSum = factory.template generate<float>(0.0f);

        auto A = factory.template generate<float>();
        auto B = factory.template generate<float>();

        auto AmulB = factory.template generate<float>();

        auto indexA = factory.template generate<int>();
        auto indexAmul = factory.template generate<int>();

        auto indexB = factory.template generate<int>();
        auto indexBmul = factory.template generate<int>();
        ROW.mul(matrixSize, indexAmul);
        COL.mul(matrixSize, indexBmul);
        for(int i=0;i<matrixSize;i++)
        {
            // compute index of A element
            indexAmul.add(i,indexA);

            // compute index of B element
            indexBmul.add(i,indexB);

            // read A and B
            A.readFrom(bufferIn, indexA); // A[ROW * N + i]
            B.readFrom(bufferIn, indexB); // A[COL * N + i]

            // A * B + tmpSum --> tmpSum
            A.fusedMultiplyAdd(B,tmpSum,tmpSum);
        }
        // write C index
        auto writeC = factory.template generate<int>();
        ROW.mul(matrixSize, writeC);
        writeC.add(COL, writeC);
        tmpSum.writeTo(bufferOut,writeC);

    },Vectorization::KernelArgs<float*,float*>{});

    for(int j=0;j<25;j++)
    {
        std::vector<float> test1(matrixSize*matrixSize),test2(matrixSize*matrixSize);
        for(int i=0;i<matrixSize*matrixSize;i++)
        {
            test1[i]=2.0f;
        }

        std::cout<< "matrix multiplication 
        ("<<matrixSize*(size_t)matrixSize*matrixSize*2<<" operations total) ";
        {
            Vectorization::Bench bench(0);
            kernelMatrixMultiplication.run
            (matrixSize * matrixSize,test1.data(),test2.data());
        }

        for(int i=0;i<3;i++)
        {
            std::cout<<test2[i]<<" ";
        }
        std::cout<<std::endl;
    }

    return 0;
}

Better memory access pattern leads to increased performances. FX8150 completes 512x512 multiplication in 0.053 seconds which is 5.1 GFLOPs. For every flops, 8 bytes are fetched from cache and 40GB/s bandwidth is achieved, without any explicit data re-use.

To experiment data re-use optimization, the following sample focuses on register-tiling using generateArray factory method to create arrays with vertical-memory-layout to achieve parallelism in simple array operations by vertical SIMD CPU instructions that are usually more efficient than horizontal versions (such as dot products):

#include "VectorizedKernel.h"
#include<vector>
int main()
{
    // chunk width for lock-step computation
    constexpr int simd = 4;

    // size of A, B, C tile matrices to fit in registers
    // per-work-item required register size is 3 x (matrixWidth^3) x sizeof(float)
    // 288 bytes for matrixWidth=8 (1 avx register is 256bits or 32bytes)
    constexpr int matrixWidth = 8;  

    // benchmark length
    constexpr int totalMatMatMultiplications = 10000; 

    constexpr int numMatrices = totalMatMatMultiplications*2;
    constexpr int numElements = matrixWidth*matrixWidth*numMatrices;
    size_t totalTime=0;
    auto kernelMatrixMultiplication = Vectorization::createKernel<simd>
    ([&](auto & factory, auto & idThread, float * bufferIn, float * bufferOut){

        // private array has vertical memory layout => 
        // vertical matrix multiplication in SIMD units
        auto privateMatrix1 = factory.template 
             generateArray<float,matrixWidth*matrixWidth>();
        auto privateMatrix2 = factory.template 
             generateArray<float,matrixWidth*matrixWidth>();
        auto privateMatrix3 = factory.template 
             generateArray<float,matrixWidth*matrixWidth>();

        auto readIndexStart = factory.template generate<int>();
        auto writeIndexStart = factory.template generate<int>();
        auto readIndex = factory.template generate<int>();
        auto writeIndex = factory.template generate<int>();

        // compute matrix start (1 work item = 2 matrices)
        idThread.mul(matrixWidth*matrixWidth*2,readIndexStart);
        // compute matrix output (1 work item = 1 matrix C=A*B)
        idThread.mul(matrixWidth*matrixWidth,writeIndexStart);

        // load matrix 1&2
        for(int j=0;j<matrixWidth;j++)
        {
            for(int i=0;i<matrixWidth;i++)
            {
                readIndexStart.add(j*matrixWidth + i,readIndex);
                privateMatrix1[i+j*matrixWidth].readFrom(bufferIn,readIndex);
                privateMatrix2[i+j*matrixWidth].readFrom
                              (bufferIn+(matrixWidth*matrixWidth),readIndex);
            }
        }

        // this program measures only the in-register multiplication part
        size_t tmpTime;
        {
            Vectorization::Bench benchmark(&tmpTime);
            // multiply matrices fully in registers 
            // "simd" number of sgemm multiplications at a time, vertically
            for (int i = 0; i < matrixWidth; i++)
            {
                for (int j = 0; j < matrixWidth; j++)
                {
                    //C[i][j] = 0;
                    privateMatrix1[i*matrixWidth].mul(privateMatrix2[j],
                                   privateMatrix3[i*matrixWidth+j]);
                    for (int k = 1; k < matrixWidth; k++)
                    {
                        // C[i][j] += A[i][k]*B[k][j];
                        privateMatrix1[i*matrixWidth+k].fusedMultiplyAdd
                        (privateMatrix2[k*matrixWidth+j],
                        privateMatrix3[i*matrixWidth+j],privateMatrix3[i*matrixWidth+j]);
                    }
                }
            }
        }
        totalTime += tmpTime;

        // store the result matrix
        for(int j=0;j<matrixWidth;j++)
        {
            for(int i=0;i<matrixWidth;i++)
            {
                writeIndexStart.add(j*matrixWidth + i,writeIndex);
                privateMatrix3[i+j*matrixWidth].writeTo(bufferOut,writeIndex);
            }
        }

    },Vectorization::KernelArgs<float*,float*>{});

    std::vector<float> test1(numElements),test2(numElements);

    for(int i=0;i<numElements;i++)
        test1[i]=i;

    for(int i=0;i<100;i++)
    {

        totalTime=0;
        kernelMatrixMultiplication.run(totalMatMatMultiplications,
                                       test1.data(),test2.data());
        std::cout<<"total time to do "<<totalMatMatMultiplications<<
         " multiplications is "<<totalTime<<
         " nanoseconds ("<<totalTime/((float)totalMatMatMultiplications)<<
         " per multiplication)"<<
         " ("<<1.0f/((totalTime/((float)totalMatMatMultiplications))/
         (matrixWidth*(float)matrixWidth*matrixWidth*2.0))<<
         " floating-point operations per nanosecond)"<<std::endl;
    }
    for(int i=0;i<15;i++)
        std::cout<<test2[i]<<std::endl;

    return 0;
}

Benchmarking was repeated for multiple simd - matrixWidth combinations:

Setting                |    32-bit floating-point operations per nanosecond
simd    matrixWidth    |    FX8150 Bulldozer           Xeon 8275CL Cascadelake
1       1              |    0.04                       0.10
1       2              |    0.32                       0.76
1       4              |    2.03                       3.28
1       8              |    3.47                       16.25
2       1              |    0.08                       0.18
2       2              |    0.61                       1.45
2       4              |    3.12                       4.41
2       8              |    3.98                       14.65     
4       1              |    0.17                       0.4
4       2              |    1.14                       2.67
4       4              |    6.73                       9.14
4       8              |    14.42                      30.12
8       1              |    0.33                       1.0
8       2              |    2.67                       5.33
8       4              |    10.67                      24.81
8       5              |    14.40                      11.33
8       6              |    16.84                      58.31
8       7              |    17.59 !!                   18.15
8       8              |    12.34                      22.75
16      1              |    0.67                       1.53
16      2              |    5.33                       9.20
16      4              |    14.22                      33.83 
16      5              |    14.12                      31.26
16      6              |    4.03                       55.13   
16      7              |                               64.28  !!      
16      8              |    4.12                       35.31
1       16             |    2.87                       21.06
2       16             |    5.51                       8.50
4       16             |    7.55                       14.76
8       16             |    5.33                       16.31
16      16             |    3.63                       31.50
32      2              |    8.00                       12.65
2       32             |    2.18                       5.06
32      32             |    3.75                       19.85
32      4              |                               23.22
32      5              |                               47.65
32      6              |                               42.45
32      7              |                               39.02 
32      8              |                               25.24
64      4              |                               35.04
64      5              |                               28.11
64      6              |                               22.18

Since Bulldozer architecture has only AVX-v1 and work like two SSE units (for single thread) joined together, best performance was achieved with simd=8 and matrixWidth=7. At 17.59 GFLOPs, it is 3x denser than naive loop (while only multiplying streaming matrices of sizes 7x7) and is 30% of theoretical peak performance (theoretical peak is 57.6 GFLOPS by using only FMA operations for equal number of additions and multiplications and no other operations to affect the pipeline).

Cascadelake architecture supports AVX512. This means, it has access to not only 32 architectural 512bit(64byte) registers but also 168 vector registers in renaming (as physical registers) and this lets it store more data before reaching L1 for re-using same data. Best performance was achieved for simd=16 and matrixWidth=7 setting. At 64.28 GFLOPs, it has 30% to 50% of the peak performance (depending on godbolt.org server's current load) of Cascadelake server-grade CPU that is shared between multiple clients. When an algorithm uses less register space than CPU supports, it can benefit more from pipelining and simd=32-64 settings depending on the pipelining efficiency.

In the last sample, factory.template generateArray<Type,NumElements>(); was used for simulating parallelized arrays. This has a vertical layout in memory where each element of a work-item has Simd stride between them. This lets the compiler parallelize the in-kernel array operations. Timing measurements of multiplications were made only around themselves and all I/O were excluded. To be able to use this performance, one needs to implement cache-tiling to feed the registers fast enough with new sub-matrices of a big matrix to be multiplied. Fastest matrix multiplications are only possible with register tiling, cache tiling and explicitly selecting efficient versions of CPU instructions by intrinsic functions. The implementation is not being developed with any intrinsics nor gnu language extensions. Maximum performance nearly always requires intrinsics (and experience on instruction tables) to be used. When they are not used, library is portable to any platform (with no define macro) as long as it has a C++ compiler that can auto-vectorize plain plain loops & plain arrays. Lastly, it will not require any code-rewrite other than just replacing simd=16 with simd=128 when an AVX-2048 is CPU is launched. The compiler will take care of all the plain loops iterating on plain arrays in each method of KernelData object.

How Fast are Sqrt and Other Math Functions?

Currently, the project only uses standard functions in auto-parallelized loops. Sqrt is one of the successfully auto-parallelized math functions, sinFast, cosFast, expFast are also added for less-accurate versions of sin, cos and exp functions with high speedup (see the last part in the article). There are also non-parallelized versions due to the standard implementation revealing no more than a function jump instruction. In future versions, Pow and Log methods will be given polynomial approximation versions with optimized coefficients (magic numbers found by genetic algorithm).

The best way to know performance is to benchmark. In below sample, Quake-3 (John Carmack) Fast Inverse Square Root is tested against plain 1.0f/std::sqrt(x) loop (which is effectively parallelized by compiler), VectorizedKernel version of inverse square root (computed by first division then square root, then a second version by a direct call to rsqrt) and VectorizedKernel version of Quake-3 Fast Inverse Square Root:

GCC v11, simd=8 for AVX, simd=16 for AVX512, n=20million

              -------- Simple Loop ----------   ------- Kernel Version --------
              (1/std::sqrt)           Q_rsqrt   Q_rsqrt    (1/sqrt)     rsqrt 
FX8150        0.035 s                 0.092 s   0.037 s    0.036 s      0.036 s
8275CL        0.012 s                 0.014 s   0.013 s    0.012 s      0.012 s

CLANG v14(simd=16)
8275CL        0.016 s                 0.014s    0.071 s    0.068 s      0.064 s

ICC 2021(simd=16)
8275CL        0.019 s                 0.026 s   0.016 s    0.013 s      0.014 s

Since the simple 1/sqrt loop is auto-vectorized efficiently by compiler, it has equal performance to kernel version which does similar work-flow inside with the addition of kernel launch latency. Since ICC could not vectorize serial version efficiently, the kernel versions had a speedup.

Benchmark codes:

#include "VectorizedKernel.h"

constexpr int n = 20000000;

float Q_rsqrt( float number )
{
    int i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( int * ) &y;                      
    i  = 0x5f3759df - ( i >> 1 );             
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) ); // newton-raphson iteration
//    y  = y * ( threehalfs - ( x2 * y * y ) );

    return y;
}

void Q_rsqrt_simd(std::vector<float> & input, std::vector<float> & output)
{
    constexpr int simd = 16;
    auto kernel = Vectorization::createKernel<simd>([&]
         (const auto & factory, auto & idThread, float * bufIn, float * bufOut)
    {
        auto i = factory.template generate<unsigned int>();
        auto x2= factory.template generate<float>();
        auto y= factory.template generate<float>();
        const auto threehalfs = factory.template generate<float>(1.5f);
        auto threehalfsMinusX2 = factory.template generate<float>();
        auto number = factory.template generate<float>();
        auto iRightShifted = factory.template generate<unsigned int>();
        auto iRightShiftedAndSubtractedFromMagicNumber = 
                             factory.template generate<unsigned int>();
        auto magicNumber = factory.template generate<unsigned int>(0x5f3759df);

        for(int k = 0; k<n/simd; k++)
        {
            number.readFrom(bufIn,idThread);
            number.mul(0.5f,x2);

            y=number; // expensive
            y.readFrom(number);

            y.template castBitwiseAndCopyTo<unsigned int>(i);
            i.rightShift(1,iRightShifted);
            magicNumber.sub(iRightShifted,i);
            i.template castBitwiseAndCopyTo<float>(y);
            x2.mul(y,x2);
            x2.mul(y,x2);
            threehalfs.sub(x2,threehalfsMinusX2);
            y.mul(threehalfsMinusX2,y);

            y.writeTo(bufOut,idThread);
            idThread.add(simd,idThread);
        }
    },Vectorization::KernelArgs<float*,float*>{});
    kernel.run(simd,input.data(),output.data());
}

void rsqrt_simd(std::vector<float> & input, std::vector<float> & output)
{
    constexpr int simd = 16;
    auto kernel = Vectorization::createKernel<simd>([&]
         (const auto & factory, auto & idThread, float * bufIn, float * bufOut)
    {

        auto y= factory.template generate<float>();

        const auto one= factory.template generate<float>(1.0f);

         for(int k = 0; k<n; k+=simd)
         {
            y.readFrom(bufIn+k);
            y.sqrt(y);
            one.div(y,y);
            y.writeTo(bufOut+k);
         }

    },Vectorization::KernelArgs<float*,float*>{});
    kernel.run(simd,input.data(),output.data());
}

void rsqrt_simd2(std::vector<float> & input, std::vector<float> & output)
{
    constexpr int simd = 16;
    auto kernel = Vectorization::createKernel<simd>([&]
         (const auto & factory, auto & idThread, float * bufIn, float * bufOut)
    {
        auto y= factory.template generate<float>();

        for(int k = 0; k<n; k+=simd)
        {
            y.readFrom(bufIn+k);
            y.rsqrt(y);
            y.writeTo(bufOut+k);
        }
    },Vectorization::KernelArgs<float*,float*>{});
    kernel.run(simd,input.data(),output.data());
}

int main() {
  std::vector<float> input(n),output(n);
  for(int i = 0; i<n; i++)
      input[i]=1+i;
  std::cout<<" ---------------------- 1.0f/std::sqrt ----------------------------- "
           <<std::endl;
  for(int k=0;k<20;k++)
  {
      {
          Vectorization::Bench bench(0);
          for(int i = 0; i<n; i++)
          {
              output[i]=1.0f/std::sqrt(input[i]);
          }
      }
      std::cout<<output[0]<<" "<<output[1]<<" "<<output[2]<<" "<<output[3]<<" 
               "<<output[5]<<" "<<std::endl;
  }
  std::cout<<" ---------------------- Q_rsqrt ----------------------------- "
           <<std::endl;
  for(int k=0;k<10;k++)
  {
      {
          Vectorization::Bench bench(0);
          for(int i = 0; i<n; i++)
          {
              output[i]=Q_rsqrt(input[i]);
          }
      }
      std::cout<<output[0]<<" "<<output[1]<<" "<<output[2]<<" "<<output[3]<<" "
               <<output[5]<<" "<<std::endl;
  }
  std::cout<<" ---------------------- Q_rsqrt simd ----------------------------- "
           <<std::endl;
  for(int k=0;k<10;k++)
  {
      {
        Vectorization::Bench bench(0);
        Q_rsqrt_simd(input,output);
      }
      std::cout<<output[0]<<" "<<output[1]<<" "<<output[2]<<" "<<output[3]<<" "
               <<output[5]<<" "<<std::endl;
  }
  std::cout<<" ----------------------rsqrt simd ----------------------------- "
           <<std::endl;
  for(int k=0;k<10;k++)
  {
      {
        Vectorization::Bench bench(0);
        rsqrt_simd(input,output);
      }
      std::cout<<output[0]<<" "<<output[1]<<" "<<output[2]<<" "<<output[3]<<" "
               <<output[5]<<" "<<std::endl;
  }
  std::cout<<" ----------------------rsqrt simd2 ----------------------------- "
           <<std::endl;
  for(int k=0;k<10;k++)
  {
      {
        Vectorization::Bench bench(0);
        rsqrt_simd2(input,output);
      }
      std::cout<<output[0]<<" "<<output[1]<<" "<<output[2]<<" "<<output[3]<<" "
               <<output[5]<<" "<<std::endl;
  }
  return 0;
}

At the time of development of Quake-3 inverse square root, CPUs were not advanced enough to compute square root and inverse square root fast. It was faster to do custom computations like this. Today (2022), it is slower or close performance with reduced precision and it is not worth reducing the precision for only 1% possible speedup.

Multithreading

There is an optional multi-threaded version of kernel-launch method. runMultithreaded<numThreads>(n,Args...) runs same as run(n,Args...) method, except with extra threads and optionally using OpenMP (if -fopenmp is enabled from compiler flags).

Mandelbrot-Set Generator Revisited

By changing only single line of code, a simple equal-work-sharing sceduler is activated:

//kernel.run(width*height,image.data());
kernel.runMultithreaded<8>(width*height,image.data());

The same benchmark program generates image at 25 cycles per pixel (22 cycles with OpenMP) speed instead of 80 cycles/pixel on the FX8150 3.6 GHz CPU (turbo disabled). There are 4 FPU modules in FX8150 and each are shared between two cores. This limits scaling at 4x maximum and 3.5x in this sample.

There is also a second version with atomic load-balancing between threads:

kernel.runMultithreadedLoadBalanced<8,resolution=100>(width*height,image.data());

This completes same task (2000x2000 pixels, 35 max iterations per pixel) in 18 milliseconds. Non-load-balanced version completes in more than 27 milliseconds. With 32 threads used (for giving it better work distribution), it completes in 21 milliseconds which is still slower than atomic work distribution. The algorithm is simple:

chunk size = 1 + ( (n / simd) / number of threads) / load_balance_resolution
while work remaining
    atomically increment chunk counter
    if fetched chunk index is out of bounds, 
        work remaining = false
    if fetched chunk index is in bound
        then compute chunk

For 2000x2000 pixels, there are 125000 kernel launches. For 125000 kernel launches with 8 threads and 100 (default) load balance resolution, the chunk size is 157. When there are only few kernel launches like 100x100, the resolution should be lowered. For this Mandelbrot sample, 157 kernel launches per atomic access hides the atomic access latency easily.

Xeon 8275CL, 1 thread computes the same task in 20 milliseconds. With 2 threads, 13 milliseconds. Godbolt.org serves only two threads to each client and both threads are possibly using same core by simultaneous multi-threading. This gives each client 50% more compute power than a whole FX8150 CPU.

If only two threads can produce same Mandelbrot image in 13 milliseconds, then a dedicated server with all 24 threads should produce one image in ~1.1 millisecond while not doing any explicit hardware-related optimizations but doing a GPGPU kernel-approach using only plain arrays and simple loops in encapsulation. If encapsulation had no overhead, it would reach theoretical peak gflops values.

Accelerating (Approximating) Math Functions

There are many methods to approximate important (trigonometric/transcendental) functions. A popular but slower way is to train a neural network and use the inference in production. They can map well to nearly all problems. But SIMD computations require every bit of performance to have a responsible speedup over standard versions of functions such as std::sin. Directly using a neural-network API adds an overhead for data loading, computations and result gathering. On the other hand, using the weight matrix of a neural network directly in hand-written computations can be comparable to some other methods.

Another successful method is to approximate a function using truncated polynomials, specifically the Chebyshev Polynomials. These type of polynomials have better error distribution over the sampled range of inputs during the foundation of "magic" numbers (coefficients). Some other similar methods like Taylor series do not converge efficiently towards a solution and cause worse error distribution than Chebyshev Polynomials.

When use-cases are narrowed enough, a function can be practically made O(1) operation. For example, when sine-computation is required near zero, one can say that the below function is OK.

float sinUltraFast(float inputTiny)
{
    return inputTiny; // because sin(x)=~x when x goes 0
}

To have a good enough speed-up and a good-enough accuracy over a bigger range than the function above, Chebyshev Polynomials are used in VectorizedKernel and optimized for range of [-1,1].

Finding Chebyshev Polynomials' coefficients can be easy if there are enough computational power in a desktop PC. The development PC has three GPUs: a GT1030 and two K420. These three combined can support more than 1 TFLOPS compute performance in real-world algorithms.

Finding coefficients of a polynomial to approximate a function is a good practice of global-optimization. Genetic algorithm is one of the useful heuristicts found in the list of global-optimization tools and the version used for this task is accelerated by CUDA.

Before using cuda-accelerated genetic-algorithm, the skeleton of Chebyshev Polynomial has to be selected. For sin(x) function, it has odd-powers of x near all coefficients. For cos(x) function, it has even-powers of x near all coefficients. For exp(x) function, it has both even and odd powers of coefficients (including zero which makes use of a constant literal).

Starting with Sin(x) function, there is the following global optimization problem:

[x^k means computing pow(x,k=integer), not XOR]

c1 * x^9 + c2 * x^7 + c3 * x^5 + c4 * x^3 + c5 * x = sin(x)

The only thing the genetic algorithm has to do is to compute the difference of two sides:

diff =  sin(x) - ( c1 * x^9 + c2 * x^7 + c3 * x^5 + c4 * x^3 + c5 * x )

then add its squared value to an accumulator of errors:

error += diff * diff

lastly, return the error as the current fitness value of selected DNA in genetic algorithm:

return error; // --> DNAs with less error will be populated more, to survive more

Rest is handled by the genetic algorithm and it approaches into a not-very-bad error average at final generation of DNAs. (the parallelization scheme is: each cuda thread block computes 1 DNA, scans 10000 data points in range of [-1,1] and sums all errors, in parallel)

After few minutes, the genetic algorithm returns magic numbers:

c1 = -0.0005664825439453125
c2 = 0.001037120819091796875
c3 = 0.007439136505126953125
c4 = -0.166426181793212890625
c5 = 0.999982357025146484375

When applied onto the Polynomial, they compute sin(x) faster than the standard version but only for a limited range and limited accuracy.

Implementation of the Chebyshev Polynomial in the sinFast function reflects the same pattern of computation found in other methods in VectorizedKernel:

compute in waves (Simd elements at once, in simple loop) as if its CUDA/OpenCL
Make each wave compute only one or two basic operations like add or mul
Use aligned buffers (64-alignment for AVX512 support)
inline function

// only optimized for [-1,1] input range!!
VECTORIZED_KERNEL_METHOD
void sinFast(KernelData<Type,Simd> & result) const noexcept
{
            alignas(64)
            Type xSqr[Simd];

            alignas(64)
            Type xSqrSqr[Simd];

            alignas(64)
            Type xSqrSqr5[Simd];

            alignas(64)
            Type xSqrSqr8[Simd];

            alignas(64)
            Type tmp[Simd];

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqr[i] =   data[i]*data[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqrSqr[i] =    xSqr[i]*xSqr[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqrSqr5[i] =   xSqrSqr[i]*data[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqrSqr8[i] =   xSqrSqr[i]*xSqrSqr[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqrSqr8[i] =   xSqrSqr8[i]*data[i] ;
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                tmp[i] =   xSqrSqr5[i]*xSqr[i] ;
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqr[i] =   xSqr[i]*data[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                result.data[i]  =   Type(-0.0005664825439453125)*xSqrSqr8[i] +
                                    Type(0.001037120819091796875)*tmp[i] +
                                    Type(0.007439136505126953125)*xSqrSqr5[i] +
                                    Type(-0.166426181793212890625)*xSqr[i] +
                                    Type(0.999982357025146484375)*data[i];
            }
}

This type of computation generally lends itself well to auto-vectorization. When loops are simple and the data are plain old data (POD), a compiler has higher chances of vectorizing the algorithm. Even in the final loop that contains chained multiplications and additions, GCC efficiently used FMA AVX512 instructions.

Even with only 10000 data points used between -1 and 1 in the global-optimization (genetic algorithm), the generalized solution made it possible to have an average error of 5.2e-07 and maximum error of 5.06e-06 for 100 million points between -1 and 1 during the error testing.

With less accuracy, there should be a performance gained. On FX8150 3.6 GHz single core execution and blow benchmark program,

#include "VectorizedKernel.h"
#include <vector>
int main()
{
    constexpr int n = 100000000;
    std::vector<float> a(n),b(n);
    auto kernel = Vectorization::createKernel<8>(
        [&](auto & factory, auto & idThread,float * test, float * test2)
        {
            auto val = factory.template generate<float>();
            val.readFrom(test,idThread);
            val.sinFast(val);
            val.writeTo(test2,idThread);
        },
        Vectorization::KernelArgs<float *,float *>{});

    auto kernelSlow = Vectorization::createKernel<8>(
        [&](auto & factory, auto & idThread,float * test, float * test2)
        {
            auto val = factory.template generate<float>();
            val.readFrom(test,idThread);
            val.sin(val);
            val.writeTo(test2,idThread);
        },
        Vectorization::KernelArgs<float *,float *>{});

    for(int i=0;i<n;i++)
    {
        a[i]=i/(float)n;
    }

    size_t t;
    for(int k=0;k<10;k++)
    {
        {
            Vectorization::Bench bench(&t);
            kernel.run(n,a.data(),b.data());
            
        }
        std::cout<<t<<" nanoseconds kernel"<<std::endl;
    }
    float maxError = 0;
    float minError = 100000000;
    float avgError = 0;
    for(int i=0;i<n;i++)
    {
        auto err = std::abs(std::sin(a[i]) - b[i]);
        if(maxError<err)
            maxError=err;
        if(minError>err)
            minError=err;
        avgError += err;
        if(i<10)
        {
            std::cout<<std::sin(a[i]) << "   "<<b[i]<<" "<<std::sin(a[i]) - 
                       b[i]<<" "<<(std::sin(a[i]) - b[i])/std::sin(a[i])<<std::endl;
        }
    }
    avgError/=n;
    std::cout<<"Approximation:"<<std::endl;
    std::cout<<"average error: "<<avgError<<std::endl;
    std::cout<<"max error: "<<maxError<<std::endl;
    std::cout<<"min error: "<<minError<<std::endl;

    for(int k=0;k<10;k++)
    {
        {
            Vectorization::Bench bench(&t);
            kernelSlow.run(n,a.data(),b.data());
            
        }
        std::cout<<t<<" nanoseconds kernelSlow"<<std::endl;
    }
    maxError = 0;
    minError = 100000000;
    avgError = 0;
    for(int i=0;i<n;i++)
    {
        auto err = std::abs(std::sin(a[i]) - b[i]);
        if(maxError<err)
            maxError=err;
        if(minError>err)
            minError=err;
        avgError += err;
    }
    avgError/=n;
    std::cout<<"std::sin:"<<std::endl;
    std::cout<<"average error: "<<avgError<<std::endl;
    std::cout<<"max error: "<<maxError<<std::endl;
    std::cout<<"min error: "<<minError<<std::endl;
    return 0;
}

The output is as below:

210320327 nanoseconds kernel
207193156 nanoseconds kernel
209724002 nanoseconds kernel
208031048 nanoseconds kernel
208375082 nanoseconds kernel
212297393 nanoseconds kernel
269920925 nanoseconds kernel
247177762 nanoseconds kernel
226875523 nanoseconds kernel
215999626 nanoseconds kernel
0   0 0 -nan
1e-08   9.99982e-09 1.76748e-13 1.76748e-05
2e-08   1.99996e-08 3.53495e-13 1.76748e-05
3e-08   2.99995e-08 5.29354e-13 1.76451e-05
4e-08   3.99993e-08 7.0699e-13 1.76748e-05
5e-08   4.99991e-08 8.81073e-13 1.76215e-05
6e-08   5.99989e-08 1.05871e-12 1.76451e-05
7e-08   6.99988e-08 1.23634e-12 1.76621e-05
8e-08   7.99986e-08 1.41398e-12 1.76748e-05
9e-08   8.99984e-08 1.58451e-12 1.76057e-05
Approximation:
average error: 5.19494e-07
max error: 5.06639e-06
min error: 0
1514166472 nanoseconds kernelSlow
1572569295 nanoseconds kernelSlow
1543835922 nanoseconds kernelSlow
1549091614 nanoseconds kernelSlow
1487053516 nanoseconds kernelSlow
1490235113 nanoseconds kernelSlow
1500433847 nanoseconds kernelSlow
1495819895 nanoseconds kernelSlow
1486465391 nanoseconds kernelSlow
1492841472 nanoseconds kernelSlow
std::sin:
average error: 0
max error: 0
min error: 0

std::sin function ran for 100-million times and took 1.5 seconds on average. sinFast function with the same number of calculations took only 210 milliseconds on average. This is nearly 7x speedup. Many programs that can work with narrowed input ranges and lowered accuracy can benefit from such optimizations.

On AMD EPYC 7R32 and AVX2 (simd=8), the same program with 10million iterations (due to godbolt.org server limits) yields these results:

4684267 nanoseconds kernel
4724790 nanoseconds kernel
4796075 nanoseconds kernel
4798874 nanoseconds kernel
4284829 nanoseconds kernel
4936115 nanoseconds kernel
4699258 nanoseconds kernel
4255476 nanoseconds kernel
4283909 nanoseconds kernel
4253615 nanoseconds kernel
0   0 0 -nan
1e-07   9.99982e-08 1.76215e-12 1.76215e-05
2e-07   1.99996e-07 3.52429e-12 1.76215e-05
3e-07   2.99995e-07 5.28644e-12 1.76215e-05
4e-07   3.99993e-07 7.04858e-12 1.76215e-05
5e-07   4.99991e-07 8.81073e-12 1.76215e-05
6e-07   5.99989e-07 1.05729e-11 1.76215e-05
7e-07   6.99988e-07 1.2335e-11 1.76215e-05
8e-07   7.99986e-07 1.40972e-11 1.76215e-05
9e-07   8.99984e-07 1.58593e-11 1.76215e-05
Approximation:
average error: 1.25595e-06
max error: 5.06639e-06
min error: 0
29602815 nanoseconds kernelSlow
29983970 nanoseconds kernelSlow
31948362 nanoseconds kernelSlow
29545788 nanoseconds kernelSlow
31207689 nanoseconds kernelSlow
29330923 nanoseconds kernelSlow
29236516 nanoseconds kernelSlow
29234575 nanoseconds kernelSlow
39186776 nanoseconds kernelSlow
54037120 nanoseconds kernelSlow
std::sin:
average error: 0
max error: 0
min error: 0

4 milliseconds for 10 million iterations with sinFast function and 29 milliseconds for 10 million iterations with std::sin. This is 7.5x speedup.

The AVX512 instances of same server could have higher speedup but the only work done in kernel is computation of a single sin(x) function and this is bottlenecked by factory.generate allocations and kernel launch latency.

To have a better benchmarking for faster architectures like AVX512, each work-item is given 3000 sin(x) computations:

#include <cpuid.h>
#include <vector>
int main()
{
    constexpr int simd=16;
    constexpr int n = 1000*simd;

    std::vector<float> a(n),b(n);
    auto kernel = Vectorization::createKernel<simd>(
        [&](auto & factory, auto & idThread,float * test, float * test2)
        {
            auto val = factory.template generate<float>();
            
            for(int i = 0; i<n; i+=simd)
            {
                val.readFrom(test+i,idThread);
                val.sinFast(val);
                val.sinFast(val);
                val.sinFast(val);
                val.writeTo(test2+i,idThread);
            }               
        },
        Vectorization::KernelArgs<float *,float *>{});

    auto kernelSlow = Vectorization::createKernel<simd>(
        [&](auto & factory, auto & idThread,float * test, float * test2)
        {
            auto val = factory.template generate<float>();
            
            for(int i = 0; i<n; i+=simd)
            {
                val.readFrom(test+i,idThread);
                val.sin(val);
                val.sin(val);
                val.sin(val);
                val.writeTo(test2+i,idThread);
            }
        },
        Vectorization::KernelArgs<float *,float *>{});

    for(int i=0;i<n;i++)
    {
        a[i]=i/(float)n;
    }

    size_t t;
    for(int k=0;k<10;k++)
    {
        {
            Vectorization::Bench bench(&t);
            kernel.run(simd,a.data(),b.data());
        }
        std::cout<<t<<" nanoseconds kernel"<<std::endl;
    }
    float maxError = 0;
    float minError = 100000000;
    float avgError = 0;
    for(int i=0;i<n;i++)
    {
        auto err = std::abs(std::sin(std::sin(std::sin(a[i]))) - b[i]);
        if(maxError<err)
            maxError=err;
        if(minError>err)
            minError=err;
        avgError += err;
        if(i<10)
        {
            std::cout<<std::sin(std::sin(std::sin(a[i]))) << "   "<<b[i]<<" 
            "<<std::sin(a[i]) - b[i]<<" "<<(std::sin(std::sin(std::sin(a[i]))) - 
             b[i])/std::sin(std::sin(std::sin(a[i])))<<std::endl;
        }
    }
    avgError/=n;
    std::cout<<"Approximation:"<<std::endl;
    std::cout<<"average error: "<<avgError<<std::endl;
    std::cout<<"max error: "<<maxError<<std::endl;
    std::cout<<"min error: "<<minError<<std::endl;

    for(int k=0;k<10;k++)
    {
        {
            Vectorization::Bench bench(&t);
            kernelSlow.run(simd,a.data(),b.data());

        }
        std::cout<<t<<" nanoseconds kernelSlow"<<std::endl;
    }
    maxError = 0;
    minError = 100000000;
    avgError = 0;
    for(int i=0;i<n;i++)
    {
        auto err = std::abs(std::sin(std::sin(std::sin(a[i]))) - b[i]);
        if(maxError<err)
            maxError=err;
        if(minError>err)
            minError=err;
        avgError += err;

    }
    avgError/=n;
    std::cout<<"std::sin:"<<std::endl;
    std::cout<<"average error: "<<avgError<<std::endl;
    std::cout<<"max error: "<<maxError<<std::endl;
    std::cout<<"min error: "<<minError<<std::endl;

char CPUBrandString[0x40];
unsigned int CPUInfo[4] = {0,0,0,0};

__cpuid(0x80000000, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
unsigned int nExIds = CPUInfo[0];

memset(CPUBrandString, 0, sizeof(CPUBrandString));

for (unsigned int i = 0x80000000; i <= nExIds; ++i)
{
    __cpuid(i, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);

    if (i == 0x80000002)
        memcpy(CPUBrandString, CPUInfo, sizeof(CPUInfo));
    else if (i == 0x80000003)
        memcpy(CPUBrandString + 16, CPUInfo, sizeof(CPUInfo));
    else if (i == 0x80000004)
        memcpy(CPUBrandString + 32, CPUInfo, sizeof(CPUInfo));
}

std::cout << "CPU Type: " << CPUBrandString << std::endl;
    return 0;
}

Output on godbolt.org server with Xeon(R) Platinum 8375C CPU @ 2.90GHz instance:

24426 nanoseconds kernel
26922 nanoseconds kernel
26189 nanoseconds kernel
26711 nanoseconds kernel
21866 nanoseconds kernel
12891 nanoseconds kernel
12987 nanoseconds kernel
13017 nanoseconds kernel
13011 nanoseconds kernel
12753 nanoseconds kernel
0   0 0 -nan
6.25e-05   6.24967e-05 3.31784e-09 5.30854e-05
0.000125   0.000124993 6.63567e-09 5.30854e-05
0.0001875   0.00018749 9.90985e-09 5.28526e-05
0.00025   0.000249987 1.32713e-08 5.30854e-05
0.0003125   0.000312483 1.65892e-08 5.30854e-05
0.000375   0.00037498 1.9907e-08 5.30854e-05
0.0004375   0.000437477 2.32249e-08 5.30854e-05
0.0005   0.000499973 2.65427e-08 5.30854e-05
0.0005625   0.00056247 2.98023e-08 5.2775e-05
Approximation:
average error: 2.92283e-06
max error: 5.76675e-06
min error: 0
157713 nanoseconds kernelSlow
153495 nanoseconds kernelSlow
152426 nanoseconds kernelSlow
149269 nanoseconds kernelSlow
125595 nanoseconds kernelSlow
126358 nanoseconds kernelSlow
126901 nanoseconds kernelSlow
128745 nanoseconds kernelSlow
125790 nanoseconds kernelSlow
126077 nanoseconds kernelSlow
std::sin:
average error: 0
max error: 0
min error: 0
CPU Type: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz

Another instance with Xeon 8275CL:

34331 nanoseconds kernel
19040 nanoseconds kernel
19582 nanoseconds kernel
23158 nanoseconds kernel
19634 nanoseconds kernel
19505 nanoseconds kernel
18382 nanoseconds kernel
19700 nanoseconds kernel
19435 nanoseconds kernel
19831 nanoseconds kernel
0   0 0 -nan
6.25e-05   6.24967e-05 3.31784e-09 5.30854e-05
0.000125   0.000124993 6.63567e-09 5.30854e-05
0.0001875   0.00018749 9.90985e-09 5.28526e-05
0.00025   0.000249987 1.32713e-08 5.30854e-05
0.0003125   0.000312483 1.65892e-08 5.30854e-05
0.000375   0.00037498 1.9907e-08 5.30854e-05
0.0004375   0.000437477 2.32249e-08 5.30854e-05
0.0005   0.000499973 2.65427e-08 5.30854e-05
0.0005625   0.00056247 2.98023e-08 5.2775e-05
Approximation:
average error: 2.92283e-06
max error: 5.76675e-06
min error: 0
256440 nanoseconds kernelSlow
202568 nanoseconds kernelSlow
209147 nanoseconds kernelSlow
203952 nanoseconds kernelSlow
208616 nanoseconds kernelSlow
247332 nanoseconds kernelSlow
210957 nanoseconds kernelSlow
200276 nanoseconds kernelSlow
206405 nanoseconds kernelSlow
201319 nanoseconds kernelSlow
std::sin:
average error: 0
max error: 0
min error: 0
CPU Type: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

Also an EPYC with AVX2 enabled:

22160 nanoseconds kernel
22440 nanoseconds kernel
41861 nanoseconds kernel
24450 nanoseconds kernel
22031 nanoseconds kernel
21640 nanoseconds kernel
21621 nanoseconds kernel
21720 nanoseconds kernel
21700 nanoseconds kernel
21701 nanoseconds kernel
0   0 0 -nan
6.25e-05   6.24967e-05 3.31784e-09 5.30854e-05
0.000125   0.000124993 6.63567e-09 5.30854e-05
0.0001875   0.00018749 9.90985e-09 5.28526e-05
0.00025   0.000249987 1.32713e-08 5.30854e-05
0.0003125   0.000312483 1.65892e-08 5.30854e-05
0.000375   0.00037498 1.9907e-08 5.30854e-05
0.0004375   0.000437477 2.32249e-08 5.30854e-05
0.0005   0.000499973 2.65427e-08 5.30854e-05
0.0005625   0.00056247 2.98023e-08 5.2775e-05
Approximation:
average error: 2.92283e-06
max error: 5.76675e-06
min error: 0
194784 nanoseconds kernelSlow
216814 nanoseconds kernelSlow
200474 nanoseconds kernelSlow
197884 nanoseconds kernelSlow
202814 nanoseconds kernelSlow
191923 nanoseconds kernelSlow
199453 nanoseconds kernelSlow
207143 nanoseconds kernelSlow
187853 nanoseconds kernelSlow
185784 nanoseconds kernelSlow
std::sin:
average error: 0
max error: 0
min error: 0
CPU Type: AMD EPYC 7R32

Depending on server load, AVX512 instances reached speedups between 10x to 15x and AVX2 instances more consistently reached 8x-9x speedup. (Please observe that the relative error is still on the order of 5e-05 despite using sin(sin(sin))) chain computations)

Cos and Exp implementations are similar, only with difference on the number of FP32 multiplications required.

In case of a greater range is required, here is [-any,any] optimized version of sin(x) that is called as x.sinFastFullRange(result):

        // Chebyshev Polynomial coefficients found by genetic algorithm 
        // running on 768 GPU pipelines for 5 minutes
        VECTORIZED_KERNEL_METHOD
        void sinFastFullRange(KernelData<Type,Simd> & result) const noexcept
        {
            // reduce range to [-pi,+pi] by modf(input, 2pi) - pi { at high precision }
            // divide by 5 (multiply 0.2)
            // compute on [-1,+1] range
            // compute T5(cos(x)) chebyshev (   16sin(x)^5 - 20sin(x)^3 + 5sin(x)   )
            // return

            alignas(64)
            double wrapAroundHighPrecision[Simd];

            alignas(64)
            double wrapAroundHighPrecisionTmp[Simd];

            alignas(64)
            double reducedData[Simd];

            alignas(64)
            double reducedDataTmp[Simd];

            alignas(64)
            Type xSqr[Simd];

            alignas(64)
            Type x[Simd];

            alignas(64)
            Type resultData[Simd];

            // these have to be as high precision as possible to 
            // let wide-range of inputs be used
            constexpr double pi =  /*Type(std::acos(-1));*/ 
            double(3.1415926535897932384626433832795028841971693993751058209749445923);

            constexpr double twoPi = double(2.0 * pi);

            constexpr double twoPiInv = double(1.0/twoPi);

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                wrapAroundHighPrecision[i] = data[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                wrapAroundHighPrecisionTmp[i] = wrapAroundHighPrecision[i] * twoPiInv;
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                wrapAroundHighPrecisionTmp[i] = std::floor(wrapAroundHighPrecisionTmp[i]);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                wrapAroundHighPrecisionTmp[i] = twoPi*wrapAroundHighPrecisionTmp[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                reducedData[i] = wrapAroundHighPrecision[i] - 
                                 wrapAroundHighPrecisionTmp[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                reducedDataTmp[i] = reducedData[i]-twoPi;
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                reducedData[i]=reducedData[i]<double(0.0)?
                               reducedDataTmp[i]:reducedData[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                reducedData[i] = reducedData[i] - pi;
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                reducedData[i] = reducedData[i]*double(0.2);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                x[i] =     reducedData[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqr[i] =     x[i]*x[i];
            }


            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =    Type(2.543240487540288086165674e-06)*xSqr[i] + 
                                   Type(-0.0001980781510937390521576162);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =    resultData[i]*xSqr[i] + 
                                   Type(0.008333159571549231259268709);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =    resultData[i]*xSqr[i] + 
                                   Type(-0.1666666483147380972695828);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =    resultData[i]*xSqr[i] + 
                                   Type(0.9999999963401755564973428);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =     resultData[i]*x[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                x[i] =     resultData[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                xSqr[i] =     resultData[i]*resultData[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =     Type(16.0)*xSqr[i] - Type(20.0);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =     resultData[i]*xSqr[i] + Type(5.0);
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                resultData[i] =     resultData[i]*x[i];
            }

            VECTORIZED_KERNEL_LOOP
            for(int i=0;i<Simd;i++)
            {
                result.data[i] =     -resultData[i];
            }
        }

What it does is simple in pseudo-code:

Reduce input range from [-any,any] to [-pi,pi] at higher precision than 32bits (only 64 bits here)
Divide new range by 5 (this number is also the Chebyshev Polynomial index to be used later)
Do the approximation using [-1,1] input
De-reduce range back to [-pi,pi] by applying Chebyshev Polynomial on the result (L₅(approx))

The approximation part is optimized by genetic algorithm to give only 1 ULPS max error (0.19 ulps average) for sine and 1 ULPS max for cosine (0.08 ulps average). This error increases during range reduction and range de-reduction due to extra floating-point operations. Normally a scientific program uses more than 1000 bits of PI to compute reduction but this program is optimized for speed so there is a tradeoff. The maximum error is on the digit 0.0000001 for both sinFastFulRange and cosFastFullRange and this makes ULP difference big when result is close to zero. When exact numbers are required, x.sin(result) and x.cos(result) can be used at the cost of performance.

In the final version of sinFast and cosFast, the computations are optimized by Horner Scheme. Horner Scheme increases accuracy and performance of polynomial computations. An unoptimized polynomial computation looks like this:

y = c1 * x*x*x*x + c2 * x*x*x + c3 * x*x + c4*x + c5;

but this has too many unnecessary operations for powers of x and every unnecessary floating-point operations adds to the rounding error which makes avalanche effect on the result after enough iterations. Due to this reason, using too many polynomial terms makes it too slow and low accuracy.

Horner Scheme optimizes the operation like this:

y = ((((c1*x + c2) * x + c3 ) * x + c4 ) * x + c5);

or in other words:

y = c1*x + c2;
y = y *x + c3;
y = y *x + c4;
y = y *x + c5;

this has not only less multiplications but also equal number of multiplications and additions. This is a good candidate for compiler to apply FMA instruction of CPU to compute two things (multiply + add) at once like this:

y = fma(c1,x,c2)
y = fma(y, x,c3)
y = fma(y, x,c4)
y = fma(y, x,c5)

only 4 instructions (with CPU's own architectural "extra" temporary precision) for less run-time and high accuracy. Some CPUs have higher precision than others during an FMA operation and the development computer's CPU (FX8150) has lower precision than many up-to-date CPUs so that the 1 ULPS difference is maintained for a big range of CPUs and average ULPS difference is lower on newer CPUs, especially on Xeons and EPYCs.

To maintain simplicity and performance high enough, the range-reduction (from any to [-pi,pi]) the operations are kept limited at 64 bits which is highest supported by SIMD units of AVX2/AVX512 CPUs on floating-point operations.

After Horner Scheme is applied, cosFast performance increases to 0.54 cycles per cos on AVX512. This is close to 50GFLOPS of FMA operations (each FMA counts as 2 flop) on single core and is bottlenecked by memory access latency / bandwidth. 0.54 cycles for 3GHz CPU means 0.18 nanoseconds per cosine computation or 5.5 billion cosines per second. Since (cos/Sin)FastFullRange has extra range reduction and de-reduction steps at higher accuracy, their speedup is only limited at 3-4x compared to std::cos and std::sin. When input range is inherently [-1,1], then using only cosFast and sinFast is at least 10x faster than using std::cos and std::sin because they are more general-purpose and handle all corner-cases.

To make auto-vectorization easier for compiler, every elementary operation is computed on its own Simd-length loop, same as every other method in KernelData. This reduces readability and maintainability but makes hardware-portable speedup through vector-hardware of CPU.

Chebyshev Polynomial

On the de-reduction of range part, Chebyshev Polynomial was used because it is only another polynomial to reduce cos(n * x) to F(cost(x)) and is simple to implement:

cos(4 * x) = L4(cos(x)) = 8 * cos(x)^4 - 8 * cos(x)^2 + 1 {^ is power, not xor}

this is as simple as it looks. For sine computation, L5 is used and input is reduced by division to 5:

sin(5 * x) = L5(sin(x)) = 16 * sin(x)^5 - 20 * sin(x)^3 + 5 * cos(x)

Then user-input is interpreted as:

sin(Y) = sin(5 * x) => solve on [-1,1] instead of [-pi,pi]

Despite having two polynomials and a range reduction computation, the error is on the digit 0.0000001 so it's not super-wrong on bigger results like 0.01 but certainly non-scientific when results include 0.00001. When a program interprets anything smaller than 0.0001 as zero, then it is relatively more acceptable to do use an approximation algorithm. Similarly, when results greater than 0.99999 can be interpreted as 1, the error is not too big and again an approximation works.

If modern x86 CPUs did not have good square-root support, sqrt could be optimized this way or John Carmack's Q3_inv_sqrt way.

When a CPU does not have good floating-point capability, CORDIC methods are used. CORDIC computes the opposite to arrive at input parameter and on every failure it adjusts the initial guess and repeats until convergence. This can be made using only bit-shifts and additions. But on modern x86 CPUs, floating-point math is fast. So simply using fp mul/add is easier. This tool does not implement CORDIC and is meant to be used on AVX/AVX512/../AVX8192 architectures that have good floating-point performance.

Points of Interest

Vertically parallelized operations running on SIMD units are fast even when algorithms are naively applied and even when there is not enough register-renaming space in CPU to hold all of the dataset. But working without define macros has its toll on the compiler-generated instruction efficiency. Without any define macro, the variable-variable interaction in the kernel actually costs deep data copying and it is slower than something fully works as define macros.

Using plain-array wrappers (such as simple KernelData implementation in https://github.com/tugrul512bit/VectorizedKernel) as if they are OpenCL kernel variables and simulating GPU-like workflow is fun to work with, quick to apply and similarly efficient as a naive GPU-targeted kernel (that does not achieve 100% of the marketed peak GFLOPs values), at the cost of great verbosity. The coding part is clunkier than std::valarray but is faster than an equal translation into it. Quick to start typing, slower to learn, similar to CUDA, without requirement of any extra hardware, similar to OpenCL, fully independent of any platform.

Simple loops alone with simple arrays are effectively vectorized by and compiler but once the algorithm gets too complex, the encapsulation (this is the C++ way rather than 100% define macros with bare vectors on stack) shows its toll on the efficiency of compiler-generated cpu instructions. These are generally unnecessary mov-instructions and might get better in the future that this project relies upon. If one day AVX-8192 (or some unbounded vector processor) gets released for desktop computers, then this project won't require any other change than this:

int simd = a higher power-of-2 value;

History

Sunday, 24^th April, 2022: Added samples, benchmarks for several basic algorithms running on SIMD
Monday, 25^th April, 2022: Added basic multithreading (and fixed some typos)
Friday, 29^th April, 2022: Added load-balanced multithreading, Quake-3 inverse square root benchmarking and comparison of different compilers
Monday, 9^th May, 2022: Added cosFast, costFastFullRange, sinFast, sinFastFullRange and explanation for Chebyshev Polynomials and Horner Scheme