Intel® Computer Vision SDK Developer Guide

Intel

0/5 (0 vote)

Oct 24, 2017

CPOL

64 min read

19329

The Intel® Computer Vision SDK is a new software development package for development and optimization of computer vision and image processing pipelines for Intel System-on-Chips (SoCs).

The product includes six major components:

Intel-optimized implementation of the Khronos* OpenVX* 1.1 API, extended with both additional API (see the Intel's Extensions to the OpenVX* API: OpenCL™ Custom Kernels and Intel's Extensions to the OpenVX* API: Advanced Tiling chapters) and kernels, including support for Convolutional Neural Networks (CNN), (see the Intel's Extensions to the OpenVX* Primitives chapter).
Pre-built and fully-validated community OpenCV* 3.3 binaries with additional fast and accurate Face capabilities, described in the chapter Advanced Face Capabilities in Intel's OpenCV.
Vision Algorithm Designer (VAD), a cross platform IDE tool, for visual development and analysis of OpenVX* graphs. For details, refer to the VAD Developer Guide.
Deep Learning Model Optimizer tool to help you with deploying Convolutional Neural Networks. For details, refer to the Deep Learning Model Optimizer Developer Guide.
Deep Learning Inference Engine that features unified (OpenVX-agnostic) API to integrate the inference with application logic. For details, refer to the Deep Learning Inference Engine Developer Guide.
Sample applications (available as a separate package). For details, see the Intel(R) Computer Vision SDK Sample Applications topic.

The product is compatible with Intel® VTune™ (see the Intel(R) VTune(TM) Tools Support section).

The general flow of development with the SDK is combining low-level primitives to build highly optimized computer vision pipelines which can be used standalone or integrated with other code.

Supported Platforms

The Intel® Computer Vision SDK is officially supported and validated on the following target and development environments:

Dev/Target	OS	Platform	OS 64-bit/32-bit
Development and Target	Ubuntu* 16.04, CentOS* 7.2, Microsoft Windows* 10	6th Generation Intel® Core™ Processors	64-bit
Target	Yocto* MR3	Next-Generation Intel® Atom™ Processors (formerly Apollo Lake)	64-bit

Intel's OpenVX* Implementation: Key Features

Performance:

Intel's OpenVX* implementation offers CPU kernels which are multi-threaded (with Intel® Threading Building Blocks) and vectorized (with Intel® Integrated Performance Primitives). Also, GPU support is backed by optimized OpenCL implementation.
To perform most read/write operations on local values, the implementation supports automatic data tiling for input, intermediate, and output data.

Extensibility:

The SDK also extends the original OpenVX standard with specific APIs and numerous Kernel extensions. Refer to the Intel's Extensions to the OpenVX* Primitives chapter for details.
The implementation also allows you to add performance efficient (for example, tiled) versions of your own algorithms to the processing pipelines, refer to the Intel's Extensions to the OpenVX* API: Advanced Tiling chapter on the CPU-efficient way and the Intel's Extensions to the OpenVX* API: OpenCL™ Custom Kernels chapter for GPU (OpenCL) specific info.

Heterogeneity:

Support for both task and data parallelism to maximize utilization of the compute resources such as CPU, GPU, and IPU.
General system-level device affinities, as well as fine-grain API for orchestrating the individual nodes via notion of the targets. Refer to the Heterogeneous Computing with Intel® Computer Vision SDK chapter for details.

NOTE: Notice that IPU support is in experimental stage. Refer to the General Note on the IPU Support section for details.

Intel's implementation of OpenVX uses a modular approach with a common runtime and individual device-specific plugins.

Intel® CV SDK OpenVX* Architecture

What's New in this Release?

Implementation of the CNN kernels for the CPU is now powered with Intel® Math Kernel Library (Intel® MKL) (for the 32-bit floating-point path). For more details, see the section System-Level Device Affinities.
OpenCV features like a new profiling mechanism, and new "dnn" module are described in the OpenCV release log.
There are also significant changes with respect to the tools, including numerous improvements of Vision Algorithm Designer (VAD). For more details, refer to the VAD Developer Guide.
The Model Optimizer tool further simplifies deployment of Convolutional Neural Networks. Model Optimizer introduces Alpha support for the TensorFlow* framework to enable conversion of TensorFlow models and data to OpenVX graphs and formats. It is also possible to register your own implementation for layers that are not supported by the OpenVX yet. The Intel's Extensions to the OpenVX* Primitives section contains further references on CNN support.

Getting Started

Refer to the Quick Start Guide for details on installation and environment settings for the Intel® Computer Vision SDK. The Quick Start Guide also provides more details on building applications with OpenVX*, and some Intel's OpenCV integration tips.

OpenVX* Basics

OpenVX* is a new standard from Khronos*, offering a set of optimized primitives for low-level image processing and computer vision primitives. The OpenVX specification provides a way to write code that is portable across multiple vendors and platforms, as well as multiple hardware types.

The OpenVX resource and execution abstractions enable hardware vendors to optimize implementation with a strong focus on a particular platform. At the same time, standardization allows software developers to get started with algorithmic innovations based on high performance vendor implementations without immediately needing to understand all of the details required to optimize for specific hardware.

The graph-based programming interface is intended for creating vision algorithms using graph concepts. The standard defines graph formalities and execution semantics to address the needs of computer vision developers. One of the major advantages of the graph interface is the ability of the underlying implementation to aggressively optimize full vision pipelines instead of specific functions.

Performance portability across diversity of engines like CPUs, GPUs, and fixed function hardware was one of the design goals of the OpenVX specification.

For more details, see the following topics:

OpenVX* Programming Framework

OpenVX* defines a C Application Programming Interface (API) that is used for building, verifying, and coordinating graph execution, as well as for accessing memory objects. The graph abstraction enables OpenVX implementation to optimize the execution for the underlying acceleration architecture.

The graph defines an explicit data-flow, using node objects and data objects like images and arrays to describe the data flow of a graph. After the graph is created the graph verification is required. After that, the same graph can be executed multiple times with updated inputs.

OpenVX also defines the vxu utility library that enables you to use each OpenVX predefined function as a directly callable C function, without needing to create a graph first. Applications built using the vxu library do not benefit from the optimizations enabled by graphs; however, the vxu library can be useful as the simplest way to use OpenVX and as the first step in porting existing vision applications.

As the computer vision domain is still rapidly evolving, OpenVX provides an extensibility mechanism to enable developer-defined functions to be added to graphs.

OpenVX* Objects

This chapter provides a brief description of some essential OpenVX* objects to understand basic OpenVX* application development. Refer to the OpenVX* 1.1 specification for more details.

OpenVX context is a container for OpenVX* objects. You have to create a context to be able to create objects. A program can have more than one context for various reasons. However, sharing objects across contexts is very limited.

OpenVX kernel is the abstract representation of a computer vision function, such as a "Sobel Gradient" or "Lucas Kanade Feature Tracking." Kernels are referred by qualified names (e.g. "org.khronos.openvxsobel_3x3") or enumerations, and the functionality is defined by the specification and conformance tests.

OpenVX node is an instance of a kernel. The node encapsulates the kernel (functionality) and the position of that functionality in the data flow of the graph. The latter is defined by assigning the node parameters, for example, when output of one node becomes an input for another node. Every node that is created has to belong to a graph.

OpenVX graph is the collection of the nodes with their connections. The graph must be directed (only goes one-way) and acyclic (does not loop back). OpenVX* supports two modes of graph execution, synchronous/blocking mode (using vxProcessGraph) and asynchronous (via vxScheduleGraph and vxWaitGraph). In both modes, the order of execution of nodes inside a single graph is defined by the data-flow which is guaranteed. However, in asynchronous mode there is no defined order for graph execution (each specified via vxScheduleGraph call). To control the order of execution, you have to insert vxWaitGraph accordingly.

Graphs in OpenVX* depend on data objects to link together nodes. OpenVX image is an opaque data object that holds pixels. There are multiple types of images. For example, to enable certain optimizations, the intermediate images can be declared as virtual. Such images cannot be accessed for read or write outside the graph, as there is no associated accessible memory. Basically, virtual images are stubs to define a dependency on OpenVX nodes (just like temporary variables in C/C++), and these can be eliminated by the OpenVX graph compiler.

Creating a Simple OpenVX* Graph Application

This section describes a simple example of graph design and execution. The graph based application computes the gradient magnitude and gradient phase from a blurred input image using standard OpenVX* kernels that are part of Intel® Computer Vision SDK. The following figure describes the algorithm in a form of a DAG:

Example OpenVX* Graph

In an OpenVX* based application, a developer needs to create a context and a graph in the beginning of the application:

//The code to create context and graph 
vx_context context = vxCreateContext();
vx_graph graph = vxCreateGraph(context);

Images, virtual images, arrays, and other data-objects are the means to connect nodes of the graph, as well as communicating the data into and out of the graph:

//The code to create input and output image data objects
vx_image images[] = { 
  vxCreateImage(context, 640, 480,VX_DF_IMAGE_UYVY), 
  vxCreateImage(context, 640, 480,VX_DF_IMAGE_U8),
  vxCreateImage(context, 640, 480,VX_DF_IMAGE_U8),
};
//The code to create virtual images that connect nodes
vx_image virts[] = { 
  vxCreateVirtualImage(graph, 0, 0,VX_DF_IMAGE_VIRT), 
  vxCreateVirtualImage(graph, 0, 0,VX_DF_IMAGE_VIRT),
  vxCreateVirtualImage(graph, 0, 0,VX_DF_IMAGE_VIRT),
  vxCreateVirtualImage(graph, 0, 0,VX_DF_IMAGE_VIRT)
};

Notice that for intermediate (virtual) images you do not need to specify either type or size, as the graph compiler is able to deduce this data automatically. As we discussed earlier, these are actually stubs that have no associated accessible memory.

There other special types of images, for example, uniform images that hold single uniform value in all pixels. Also you can create an image object that references a memory that was externally allocated (refer to the vxCreateImageFromHandle), the Interoperability with other APIs section.

The data-flow of the graph is created by adding the nodes in the graph and providing data objects as input and output to those nodes. In the graph, which is shown in the figure above, the example UYVY image is (non-virtual) input image. It is passed as argument to the luma extracting node vxChannelExtractNode, and that is then connected to the blur node vxGaussian3x3Node.The blur node is connected to the vxSobel3x3Node node that computes gradients. In turn, the gradients node is connected to both magnitude node vxMagnitudeNode and phase node vxPhaseNode.

Notice that both output nodes produces non-virtual images, which otherwise would not be accessible by the host code:

//Create nodes and connect them in the graph
vxChannelExtractNode(graph, images[0],VX_CHANNEL_Y, virts[0]);
vxGaussian3x3Node(graph, virts[0], virts[1];
vxSobel3x3Node (graph, virts[1], virts[2], virts[3]);
vxMagnitudeNode(graph, virts[2], virts[3], images[1]);
vxPhaseNode (graph, virts[2], virts[3], images[2]);

The OpenVX* graph must go through a validation process before the graph can be executed:

//Verify the graph
status = vxVerifyGraph (graph);
//Finally let’s execute the graph and wait for completion:
//Execute the graph synchronously       
vxProcessGraph(graph);

The very last step is to free resources associated with graph, context, images and so on:

vxReleaseGraph(&graph);
...
vxReleaseContext(&context);

For a more complete example including error handling, refer to the samples, described in the Intel(R) Computer Vision SDK Sample Applications section.

Notice that to compile the code, you need to include the header files for OpenVX:

#include <VX/vx.h>

You also should link your code with OpenVX libraries. Refer to the Quick Start Guide for the details on the necessary environment setup.

Wrapping User Code as OpenVX* User Kernels

OpenVX* allows adding your own algorithms to processing pipelines.

To become a true OpenVX kernel, your user code must implement certain APIs, like the validation function (that is called during vxVerifyGraph, refer to the Intel(R) Computer Vision SDK Sample Applications section for code examples). A few other callbacks, like kernel initialization/de-initialization, are optional. All these callbacks along with the kernel function itself are provided to the vxAddKernel:

VX_API vx_kernel vxAddUserKernel(vx_context context,
                             const vx_char name[VX_MAX_KERNEL_NAME],
                             vx_enum enumeration,
                             vx_kernel_f func_ptr,
                             vx_uint32 numParams,
                             vx_kernel_validate_f validate,
                             vx_kernel_initialize_f init,
                             vx_kernel_deinitialize_f deinit);

Then, every parameter that kernel accepts is specified via vxAddParameterToKernel, for example:

vx_kernel kernel = vxAddKernel(context,…);
vxAddParameterToKernel(kernel, 0, VX_INPUT,  VX_TYPE_IMAGE…);
vxAddParameterToKernel(kernel, 1, VX_OUTPUT, VX_TYPE_IMAGE…);

Certain run-time behavior tweaks for the kernel are possible via vxSetKernelAttribute.

Finally, kernel publication is completed with vxFinalizeKernel call.

Refer to the Video Stabilization SDK sample (<SDK_SAMPLES_ROOT>/samples), on the detailed example of implementation of the user kernels, the Intel(R) Computer Vision SDK Sample Applications section.

Adding User Kernels from a User Kernel Library

To enable some kinds of re-use of user kernels between applications, usually the kernels are grouped into dynamically-loaded libraries. Below is a subset of the related OpenVX APIs:

Name	Description
vxLoadKernels	Loads the specified module with kernels into the OpenVX* context. The format of the loaded file might be a dynamically linked library, but it is implementation dependent.
vxPublishKernels	Entry point that should be implemented in modules loaded by vxLoadKernels. Similarly, the vxUnpublishKernels is called when kernels library is unloaded.
vxGetKernelByEnum	Obtains a reference to the kernel using the enumeration.
vxGetKernelByName	Obtains a reference to the kernel using by the name.
vxQueryKernel	Gets information about number of parameters, enum values, etc.
vxCreateGenericNode	Creates a reference to a node object for a given kernel.
vxSetParameterByIndex	Sets parameters to complete a node.

The function vxLoadKernels loads the module into the given context and calls the vxPublishKernels to publish the list of added kernels and their parameter information.

Upon kernel publication, you can use vxGetKernelByEnum or vxGetKernelByName to query kernels and then create a generic node for a kernel object. Finally, vxSetParameterByIndex has to be used to set the arguments, while the argument list also can be queried for each kernel:

//Loads the "libmy_ovx" library with user-kernel for division 
vx_status s = vxLoadKernels(ctx, "libmy_ovx") ;
//Get the kernel, by name
vx_kernel kernel = vxGetKernelByName(ctx, "my_div");
//query the kernel parameters
vxQueryKernel(kernel,VX_KERNEL_PARAMETERS, &np, sizeof(np));
printf("Number of Kernel parameters is: %d\n",np);
for(size_t j = 0; j < np; ++j)
{
    vx_parameter p = vxGetKernelParameterByIndex(kernel, j);
    vx_type_e t;
    vxQueryParameter(p,VX_PARAMETER_TYPE, &t, sizeof(t));
    assert(VX_TYPE_DF_IMAGE==t); //the kernel accepts just images 
}
vx_node div_node = vxCreateGenericNode(graph, kernel);
//Set the arguments to the node, connecting the node to the graph
vx_image img_in0 = vxCreateImage(ctx, 640, 480,VX_DF_IMAGE_U8);
vx_image img_in1 = vxCreateImage(ctx, 640, 480,VX_DF_IMAGE_U8);
vx_image img_out = vxCreateImage(ctx, 640, 480,VX_DF_IMAGE_U8);
vxSetParameterByIndex(div_node, 0, (vx_reference)img_in0);
vxSetParameterByIndex(div_node, 1, (vx_reference)img_in1);
vxSetParameterByIndex(div_node, 2, (vx_reference)img_out);

Alternatively, when the user library header file is available, you can create nodes using regular function declaration:

//in my_div.h:
vx_node vxDivNode(vx_graph g,vx_image i0,vx_image i1,vx_image out);
//in main.cpp:
#include "my_div.h"
...
vxDivNode(graph, in0, in1, out);

General Recommendations on User Kernels

The most straightforward way to combine some legacy code with the OpenVX* is to execute an OpenVX graph, wait for completion, access its data on the host, and execute the user code on the data. Then potentially continue with OpenVX execution and so on. For OpenCV interoperability examples refer to the following sections:

Sometimes it makes sense to include your own code as graph nodes. Below are some user kernel tips:

Just like with any other kernel in a graph, there might be multiple node instances for the same user-defined kernel. These instances might be potentially executed in parallel. So your user-kernel code should be thread safe.
User kernels may act as implicit memory barriers. In most cases the OpenVX implementation should complete all prior computations and update memory objects that the user kernel depends on. This is not entirely performance friendly. So if you see a user kernel in the hotspots (Intel(R) VTune(TM) Tools Support chapter), consider the following:
- If the kernel is heavy enough, leverage parallelism within the user kernel, for example by making the code parallel. Using Intel® Threading Building Blocks is strongly recommended as an API of choice for parallelism, as it would guarantee composability with the rest of OpenVX runtime. Specifically, using the TBB within a user kernel helps avoiding threading oversubscription.
- Consider the Advanced Tiling extension (the Intel Extensions to the OpenVX* API: Advanced Tiling chapter). This way you specify the tiled version of the user kernel and let the runtime to schedule the things in the best way, including parallelism.
- Merging kernels so that they combine steps on local data (kernel fusion) often provides better performance than any automatic approach.

Finally, if your legacy code is in OpenCL™, it is relatively straightforward to wrap calls to the OpenCL API into the user kernels. Just like other types of user code in OpenVX, beware of potential issues when multiple instances of the user kernel might exist in parallel. For example, your code would need to keep separate cl_kernel instances per each user node. This avoids clash from setting different parameters to the same cl_kernel from multiple threads. Also, overheads associated with user nodes that use OpenCL might be more pronounced, since communication with GPU is involved. Eventually, this approach lacks data sharing, as OpenCL context and queues are not shared between OpenVX run-time and your code.

To summarize, instead of wrapping OpenCL calls with user kernels in OpenVX, consider using Custom OpenCL kernels (section Intel Extensions to the OpenVX* API: OpenCL™ Custom Kernels) for leveraging existing OpenCL kernels.

Striving for Performance

Intel OpenVX* Performance Promise

An OpenVX* approach to the performance extends the conventional one-off function acceleration with the notion of graphs. Graphs expose optimization possibilities that might not be available or obvious with traditional approaches. For example, with the Intel OpenVX implementation, different kernels that share data are not forced to use global (slow) memory. Rather, automatic tiling fits data to cache. Similarly, while parallelism itself is not directly expressed in the graph, the independent data flows are extracted from its structure.

OpenVX also creates a logical model where IP blocks of the Intel SoC fully share system resources such as memory; the code can be scheduled seamlessly on the block that is best able to execute it.

In addition to the global graph-level optimizations, performance of the OpenVX vision functions is also resolved via use of optimized implementation with a strong focus on a particular platform. For the CPU, this is leveraged through Intel® Integrated Performance Primitives (Intel® IPP), which has code branches for different architectures. For the GPU, the matured stack of OpenCL* Just-In-Time compilation to the particular architecture is used.

To achieve good performance the trade-offs implicit to the OpenVX model of computation must be well understood. This sections describes general considerations for OpenVX with respect to performance.

Use OpenVX* Graph Mode to Estimate Performance

OpenVX supports a single-function execution model called immediate mode.

NOTE: Notice that use of immediate mode flavors of the vision functions (prefixed with vxu*, for example, vxuColorConvert) still implies using graphs (each comprising just a single function) behind the scene. Thus, graph verification and other costs like memory allocation will be included in the timing, and not amortized over multiple nodes/iterations.

Still the immediate mode can be useful as an intermediate step, for example, when porting an application from OpenCV to OpenVX (see the Example Interoperability with OpenCV section).

Beware of Graph Verification Overheads

The graph verification step is a heavy-weight operation and should be avoided during "tight-loop" execution time. Notice that changing the meta-data (for example, size or type) of the graph inputs might invalidate the graph. Refer to the Map/Unmap for OpenVX* Images section for some tips on updating the data.

Comparing OpenVX Performance to Native Code

When comparing OpenVX performance with native code, for example, in C/C++ or OpenCL, make sure that both versions are as similar as possible:

Wrap exactly the same set of operations.
Do not include graph verification when estimating the execution time. Graph verification is intended to be amortized over multiple iterations of graph execution.
Track data transfer costs (reading/writing images, arrays, and so on.) separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a regular code (by pointers).
Demand the same accuracy.

Enabling Performance Profiling per Node

So far, we discussed overall performance of the graph. In order to get the per-node performance data, OpenVX* 1.1 spec explicitly mandates enabling of the performance profiling in the application. There is a dedicated directive for that:

vx_status res = vxDirective(context, VX_DIRECTIVE_ENABLE_PERFORMANCE);

NOTE: Per-node performance profiling is enabled on the per-context basis. As it might introduce certain overheads, disable it in the production code and/or when measuring overall graph performance.

When the profiling is enabled, you can get performance information for a node:

vx_perf_t perf;
vxQueryNode(node, VX_NODE_ATTRIBUTE_PERFORMANCE, &perf, sizeof(perf));
printf("Average exec time for the %h node: %0.3lfms\n", node, (vx_float32)perf.avg/1000000.0f);

This is exactly the data that tools like Intel® Computer Vision SDK Algorithm Designer (VAD) collects/exposes for every node, for example refer to the Auto-Contrast sample (in the Intel(R) Computer Vision SDK Sample Applications section), also to the VAD Developer Guide. Similarly, few SDK samples (e.g. Lane Detection) also shows this data on the simplified timeline.

NOTE:Notice that to get the performance data for nodes running on the GPU, you need to set the following environment variable:

$export VX_CL_PERFORMANCE_REPORTING=1

General Rules of Thumb

Do not deduce final performance conclusions from individual kernels. Graphs allow the runtime and pipeline manager to do certain system-level optimizations that are simply not possible under a single-function paradigm.
Make sure OpenVX* is used on the performance critical path. You can use Intel® VTune™ tools (Intel(R) VTune(TM) Tools Support section) to determine hotspots for your original pipeline, which should suggest the first candidates to try with OpenVX.
Consider User Kernels to extend OpenVX or re-use your legacy code within the API, refer to the General Recommendations on User Kernels section for performance tips. Also Custom OpenCL™ kernels (see the Intel Extensions to the OpenVX* API: OpenCL™ Custom Kernels chapter), is preferable option of leveraging legacy OpenCL code, or writing new OpenVX kernels that target the GPU. Consider code samples in the Intel(R) Computer Vision SDK Sample Applications section for more details including performance considerations.
For more tips, especially on the heterogeneous scenarios, refer to the General Tips on Heterogeneous Execution section.

Intel's Extensions to the OpenVX* Primitives

OpenVX* Base Vision Kernels

Intel® Computer Vision SDK comes with the OpenVX* standard set of computer vision functions. For the API details of these functions, refer to Khronos* Group – OpenVX™ Specification 1.1, section on Vision Functions.

Both CPU and GPU OpenVX targets support the core vision functions, while extensions are much more device-specific.

Intel's OpenVX* Kernel Extensions

Intel® Computer Vision SDK separates experimental and approved (Intel's "vendor" and/or Khronos*) extensions to the OpenVX. You can find the details like function prototypes and parameter details in the dedicated Intel® Computer Vision SDK Kernels Extensions document (available at the /documentation/ folder in the SDK installation directory).

Note: Unlike true vendor extensions, support for the kernels defined as the experimental extensions is subject to change without notice.

Refer to the <VX/vx_intel_volatile.h> header files for experimental extensions declarations.

Support for Convolutional Neural Networks (CNN)

There is a big interest for Convolution Neural Networks (CNN) in computer vision. Image recognition using CNN saw big progress in the last years. CV SDK supports the official Khronos OpenVX* 1.1 Neural Network extension (see the Khronos* Neural Network Extension topic). This enables the recognition task of the CNN to be accelerated in embedded environments. For the API and new data objects reference, please refer to the documentation in the /documentation/ directory of the SDK.

NOTE: It is strongly recommended that you try graphs with CNN nodes on the GPU target (see the Scheduling Individual Nodes to Different Targets chapter).

Deploying CNNs can be a challenging task, as the end target environment might look different than the training environment. Training is typically done on high end data centers, using popular training frameworks like Caffe*, Torch*, and so on. Scoring (or inference) in contrast, can take place on embedded devices. These environments are much more limited, so usually performance can be much better with a dedicated scoring API. These limitations make the deployment and the training environments look very different in terms of both data types support and API used for scoring.

The Intel's Model Optimizer tool is designed to address these particular problems. It simplifies deployment of the Convolutional Neural Networks with OpenVX. For example, with Model Optimizer you can automatically convert Caffe layout and data to OpenVX graphs and formats. Refer to the Intel(R) Computer Vision SDK Sample Applications chapter for code samples that use Model Optimizer to produce graphs.

Kernels Extensions Support by Different Targets (Devices)

Notice that while both the CPU and the GPU fully support the core vision functions, as defined by OpenVX* 1.1 spec, they support different set of the extensions. Unless you play with heterogeneous support (Heterogeneous Computing with Intel® Computer Vision SDK chapter), the difference in the support is transparent to you: just add the nodes you need to the graph and the runtime allocates them to the supporting devices.

NOTE: Generally, being the fallback device, the CPU should normally support all the extensions. However, this release comes with one notable exception of Bilateral Filter, available only on the GPU.

NOTE: None of the IPU extensions has a CPU equivalent (thus no automatic fallback if IPU device is not available).

Carefully consider the availability of the devices and supported extensions for platforms that are your actual targets. For example, the supported development systems are missing the IPU device (available only in systems based on Next-Generation Intel® Atom™ Processors, Formerly Apollo Lake).

Khronos* Neural Network Extension

The Khronos OpenVX* Neural Network extension specifies an architecture for executing CNN-based inference in OpenVX graphs. The extension defines a multi-dimensional tensor object data structure which can be used to connect neural network layers, represented as OpenVX nodes, to create flexible CNN topologies. OpenVX neural network layer types include convolution, pooling, fully connected, normalization, soft-max and activation. The extension enables neural network inferencing to be mixed with traditional vision processing operations in the same OpenVX graph.

For the details on the extension API, refer to the Khronos* OpenVX™ Neural Network Extension documentation.

Interoperability with other APIs

The general approach to sharing data between OpenVX* and media/graphics APIs like Intel® Media Server Studio or OpenGL* is based on sharing the system memory. That is, in your code you should map or copy the data from the API to the CPU address space first. Then the resulting pointer can be wrapped as OpenVX image. Alternatively, the data (the pointer references to) can be copied to OpenVX object, to unlock the original data (and return ownership to another API).

Similarly, OpenVX data can be accessed on the host and its bits copied or wrapped to the data object of another API. Notice, though, that most APIs do not support memory layouts with strides between pixels that are possible in OpenVX, so your data sharing routine should respect the strides.

In any case, to update the OpenVX image or read data from it, you must use map/unmap methods.

For more details, see the following topics:

Wrapping System Memory with OpenVX* Image

Consider the general interoperability flow:

Map a resource to the CPU.

Create an OpenVX image wrapping the memory:

// populate the OpenVX structure specifying the image layout
vx_imagepatch_addressing_t frameFormat;
//dimension in x, in pixels
frameFormat.dim_x =...;
//dimension in y, in pixels
frameFormat.dim_y = ..;
//stride in x, in bytes
frameFormat.stride_x = ...;
//stride in y, in bytes
frameFormat.stride_y = ...;
   void* frameData = ...;
   //creating input image, by wrapping the host side 
 pointer
        vx_image im = vxCreateImageFromHandle(
            context,
            VX_DF_IMAGE_RGB,
            &frameFormat,
            &frameData,
            VX_MEMORY_TYPE_HOST);

Use the buffer in the OpenVX.
Upon completion of the OpenVX graphs that use the mapped resource bits, the OpenVX image should be destroyed and then the resource can be safely unmapped, returning data ownership to another API.

NOTE: While used by OpenVX*, the original resource (and its mapped data) cannot be used by application.

Refer to the Auto-Contrast SDK sample, in the Intel(R) Computer Vision SDK Sample Applications section, for the example of sharing the image bits with OpenCV (also below).

Map/Unmap for OpenVX* Images

Your code does not need to re-create an OpenVX* image each time new data (for example, from a camera) arrives. Setting the new image as a graph input might potentially invalidate the graph, so that implicit verification will be triggered.

Also most APIs (that you might want to do interoperability with) provide no guarantees that the pointer returned upon resource mapping will be the same each time. In many cases, it is easier to create a regular OpenVX image instead, and do a data copy each time the image data needs to be updated. Images, just like other OpenVX resources (for example, arrays or matrices), can be manipulated via Map/Unmap:

//create an image in a regular way
vx_image image = vxCreateImage(context, 640, 480,VX_DF_IMAGE_RGB);
vx_imagepatch_addressing_t addr;
vx_map_id map_id;
void* ptr = 0;
//request a pointer in the system memory system
vx_status status = vxMapImagePatch(image, &rect, 0,&map_id, &addr, &ptr, usage, VX_MEMORY_TYPE_HOST, 0);
//manipulate image data (referenced by the ptr), e.g. update the values
...
//commit back the changes and pass the ownership back to the OpenVX
status = vxUnmapImagePatch(image, map_id);

Example Interoperability with OpenCV

Unlike APIs that use dedicated address space and/or special data layouts (e.g. compressed OpenGL* textures), regular OpenCV data objects like cv::Mat reside in the conventional system memory. That is, the memory can be actually shared with OpenVX* and only data ownership to be transferred:

//Capture from default webcam 
cv::VideoCapture inp;
inp.open(0);
cv::Mat matBGR, mat RGB;
//read first frame so the ‘matRGB’ allocates data, etc
inp.read(matBGR);
//convert the format, as OpenCV’s default is BGR
cv::cvtColor(matBGR, matRGB, cv::COLOR_BGR2RGB);
// wrap into the OpenVX
vx_imagepatch_addressing_t addr;
addr.dim_x = matRGB.cols;
addr.dim_y = matRGB.rows;
addr.stride_x = matRGB.elemSize();
addr.stride_y = matRGB.step;
void *ptr[] = { matRGB.data };
vx_image image=vxCreateImageFromHandle(ctx,…,&addr,ptr, VX_MEMORY_TYPE_HOST);
//Capture frames
while(1)
{
    // Per spec, mapping of the image created from handle should return 
    // same address (and same memory layout)
    void*p = NULL;
    vx_imagepatch_addressing_t addr;
    vx_rectangle_t rect = { 0, 0, matRGB.cols, matRGB.rows };
    //Request pointer to the system mem
    vx_map_id map_id;
    vxMapImagePatch(image, &rect, 0, &map_id, &addr, &p, VX_WRITE_ONLY );
    //Read the next image
    inp.read(matBGR);
    cv::cvtColor(matBGR, matRGB, cv::COLOR_BGR2RGB);
    // Notify the OpenVX we are done
    vxUnmapImagePatch(image, map_id);
    //Do processing with OpenVX
    ...
}

Notice that original cv::Mat cannot be used simultaneously by the application and OpenVX.

Other important caveats:

Make sure that OpenCV cv::Mat object is not destroyed (or resized, etc.) before work on the corresponding vx_image object finishes.
Notice that by default the OpenCV uses BGR color space for color images, while the OpenVX supports only RGB channel order, so you might need format conversion, see example above.

Refer to the SDK samples in the Intel(R) Computer Vision SDK Sample Applications section for OpenCV capturing, OpenVX* processing, and OpenCV rendering.

Example of OpenCV Processing of the OpenVX* Image

Often, the standard OpenVX* kernels and extensions are not enough for implementing particular computer vision pipeline. Similarly the vx_image can be accessed with OpenCV, using the same vxMapImagePatch/vxUnmapImagePatch functions (see the Example Interoperability with OpenCV section).

void callOpenCVGaussianBlur(vx_image input, vx_image output)
{
    vx_uint32 width = 0, height = 0;
    vxQueryImage(input, VX_IMAGE_WIDTH, &width, sizeof(width));
    vxQueryImage(input, VX_IMAGE_HEIGHT, &height, sizeof(height));
    vx_rectangle_t rect = { 0u, 0u, width, height };
    vx_imagepatch_addressing_t input_addr = {};
    void *input_ptr = nullptr;
    vx_map_id map_id_in;
    vxMapImagePatch(input, &rect, 0, &map_id_in, &input_addr, &input_ptr,  VX_READ_ONLY);
    vx_imagepatch_addressing_t output_addr = {};
    void *output_ptr = nullptr;
    vx_map_id map_id_out;
    vxMapImagePatch(output, &rect, 0, & map_id_out , &output_addr, &output_ptr, VX_WRITE_ONLY);
    cv::Mat mat_input(input_addr.dim_y * VX_SCALE_UNITY / input_addr.scale_y,
                      input_addr.dim_x * VX_SCALE_UNITY / input_addr.scale_x,
                      CV_8UC1,
                      input_ptr,
                      input_addr.stride_y);
    cv::Mat mat_output(output_addr.dim_y * VX_SCALE_UNITY / output_addr.scale_y,
                       output_addr.dim_x * VX_SCALE_UNITY / output_addr.scale_x,
                       CV_8UC1,
                       output_ptr,
                       output_addr.stride_y);
    cv::GaussianBlur(mat_input, mat_output, cv::Size(5, 5), 1.5);
    vxUnmapImagePatch(input, map_id_in);
}

Also, you can make the function above fully pluggable to OpenVX graphs. Specifically, the OpenVX* supports user-defined functions that are executed as nodes from inside the graph. As detailed in multiple examples (see the Intel(R) Computer Vision SDK Sample Applications chapter) in order to define a user kernel, you just need to implement several call-back functions in addition to the execution routine itself.

Interoperability with OpenCV Transparent API

OpenCV 3.0 introduced new Transparent API that enables OpenCL*-based codepaths for OpenCV functions. The T-API is based on cv::Umat objects, which abstract data representation. Notice that all access to cv::Umat objects should be done via cv::UMat methods and not directly via data pointers.

To mix cv::UMat and vx_image objects you can use vx_image->cv::Mat mapping approach described above, followed by the OpenCV copyTo method to propagate data from the mapped cv::Mat and cv::UMat objects:

void callOpenCV_TAPI(vx_image input, vx_image output)
{
    vx_uint32 width = 0, height = 0;
    vxQueryImage(input, VX_IMAGE_WIDTH, &width, sizeof(width));
    vxQueryImage(input, VX_IMAGE_HEIGHT, &height, sizeof(height));
    vx_rectangle_t rect = { 0u, 0u, width, height };
    vx_imagepatch_addressing_t input_addr = {};
    void *input_ptr = nullptr;
    vx_map_id map_id_in;
    vxMapImagePatch(input, &rect, 0, &map_id_in, &input_addr, &input_ptr, VX_READ_ONLY);
   vx_imagepatch_addressing_t output_addr = {};
   void *output_ptr = nullptr;
   vx_map_id map_id_out;
   vxMapImagePatch(output, &rect, 0, & map_id_out , &output_addr, &output_ptr, VX_WRITE_ONLY);
    cv::Mat mat_input(input_addr.dim_y * VX_SCALE_UNITY / input_addr.scale_y,
                      input_addr.dim_x * VX_SCALE_UNITY / input_addr.scale_x,
                      CV_8UC1,
                      input_ptr,
                      input_addr.stride_y);
    cv::Mat mat_output(
                output_addr.dim_y * VX_SCALE_UNITY / output_addr.scale_y, 
                output_addr.dim_x * VX_SCALE_UNITY / output_addr.scale_x,
                       CV_8UC1,
                       output_ptr,
                       output_addr.stride_y);
    cv::UMat umat_input;
    mat_input.copyTo(umat_input);
    cv::UMat umat_tmp, umat_output;
    cv::GaussianBlur(umat_input, umat_tmp, cv::Size(5, 5), 1.5);
    cv::Canny(umat_tmp, umat_output, 50, 100);
    umat_output.copyTo(mat_output);
    vxUnmapImagePatch(input, map_id_in);
}

Difference in Interpretation of the OpenCV and OpenVX* Parameters

When porting from OpenCV to OpenVX* there are cases of straightforward parameter matching between the APIs. However there are cases when the parameter meaning is different. For example, both OpenCV and OpenVX offer WarpAffine and WarpPerspective functions, but with the following subtleties:

OpenVX's WarpAffine and WarpPerspective implement the cv::WARP_INVERSE_MAP flavor of the corresponding OpenCV calls.
OpenCV and OpenVX uses different formulas to access transformation matrix. In order to convert transformation matrix for OpenVX, the OpenCV matrix must be transposed.

Similarly the OpenCV’s Filter2D and OpenVX’s Convolution are both filtering routines that can be used interchangeably along with the following:

OpenVX Convolution formula differs from OpenCV's Filter2D. In order to get the same result in OpenVX the original OpenCV "kernel" parameter must be flipped around both axises.
OpenVX Convolution uses 16-bit signed integer data type for "kernel" parameter, along with additional scale parameter, so which, when calling the OpenCV’s need to be converted accordingly.

Intel's Extensions to the OpenVX* API: OpenCL™ Custom Kernels

Custom OpenCL™ Kernels is the short name for the Intel OpenVX* extension that allows using code from regular OpenCL kernels as OpenVX user kernels. For code examples, please refer to the Intel(R) Computer Vision SDK Sample Applications section.

NOTE: This extension is an experimental feature. It is subject to potential changes without notice. OpenCL custom kernels are supported only for the GPU target.

Current extension's API still limits the OpenVX parameters that OpenCL kernels can access:

OpenVX* Object	OpenCL™ Kernel Argument (OpenCL™ C code)
`vx_image`	`global uchar* ptr,` `uint width, // width of the output image` `uint height, // height of the output image` `uint pixelStride, // pixel stride in bytes` `uint rowPitch, // row stride in bytes` `//repeat for every plain (for multi-plannar images)`
`vx_array`	`global uchar* ptr,` `uint32 items, // number of items` `uint32 itemSize, // size of an item in bytes` `uint32 stride // item stride in bytes`
`vx_matrix`	`global uchar* ptr, // pointer to matrix items` `uint32 rows, // number of rows` `uint32 cols // number of columns`
`vx_tensor`	`global dtype* ptr, // data pointer (short/float/half)` `long offset, // ptr offset where actual content<br /> // starts (in bytes)` `int num_dims, // number of dimensions of the tensor` `global int* dims, // dimensions sizes` `global int* strides // dimensions strides`
`vx_scalar (VX_TYPE_INT16)`	`short`
`vx_scalar (VX_TYPE_INT32)`	`int`
`vx_scalar (VX_TYPE_INT64)`	`long`
`vx_scalar (VX_TYPE_UINT16)`	`ushort`
`vx_scalar (VX_TYPE_UINT32)`	`uint`
`vx_scalar (VX_TYPE_UINT64)`	`ulong`
`vx_scalar (VX_TYPE_FLOAT32)`	`float`
`vx_scalar (VX_TYPE_FLOAT64)`	`double`
`vx_scalar (VX_ENUM)`	`int`
`vx_threshold`	`int value_or_upper, // value for VX_THRESHOLD_TYPE_BINARY` `// upper for VX_THRESHOLD_TYPE_RANGE` `int lower, // lower, defined for VX_THRESHOLD_TYPE_RANGE` `int true_value, // VX_THRESHOLD_TRUE_VALUE attribute` `int false_value // VX_THRESHOLD_FALSE_VALUE attribute`

When an application sets an OpenVX data object, like vx_image or vx_array as an argument to the node, and the OpenVX framework passes the actual data pointer to OpenCL kernel invocation. All work-item built-in functions like get_global_id() are supported. So, it is the programmer's responsibly to assure correct id-based pointer calculations and legal access to OpenVX data objects from the OpenCL kernel.

Refer to the Intel(R) Computer Vision SDK Sample Applications section to find example of the extension API in use. You can find there dedicated tutorials, which explain the OpenCL custom kernels in further details.

In the case of Custom OpenCL Kernels which support image tiling (added to the context via vxIntelAddDeviceTilingKernel), the argument translation is different for vx_image:

OpenVX* Object OpenCL™ Tiling Kernel Argument (C code)

OpenVX* Object	OpenCL™ Tiling Kernel Argument (C code)
`vx_image`	`global uchar* // ptr,` `int tile_x, // x coordinate of the tile` `int tile_y, // y coordinate of the tile` `int tile_width, // width of tile` `int tile_height, // height of tile` `int image_width, // width of the full image` `int image_height, // height of the full image` `int pixelStride, // pixel stride in bytes` `int rowStride, // row stride in bytes` `//repeat for every plane (for multi-planar images)`

vx_image

global uchar* // ptr,

int tile_x, // x coordinate of the tile

int tile_y, // y coordinate of the tile

int tile_width, // width of tile

int tile_height, // height of tile

int image_width, // width of the full image

int image_height, // height of the full image

int pixelStride, // pixel stride in bytes

int rowStride, // row stride in bytes

//repeat for every plane (for multi-planar images)

NOTE: ptr corresponds to the starting location of the tile, not the starting location of the full image. Pixels which reside outside of the given tile should not be accessed.

Furthermore, the kernel attribute VX_KERNEL_INPUT_NEIGHBORHOOD_INTEL should be set with a corresponding vx_neighborhood_size_intel_t object. This is by the runtime to determine the required size and location of the input tile, given an output tile.

For example, the following code snippet demonstrates the required neighborhood that would be set for a 5x5 filter.

vx_neighborhood_size_intel_t n;
n.top = -2;  //requires 2 additional input pixel above
n.left = -2; //requires 2 additional input pixels to the left
n.right = 2; //requires 2 additional input pixels to the right
n.bottom = 2; //requires 2 additional input pixels below
vxSetKernelAttribute(kernel, 
                     VX_KERNEL_INPUT_NEIGHBORHOOD_INTEL, 
                     &n,
                     sizeof(vx_neighborhood_size_intel_t));

NOTE: Input tiles in which the set neighborhood causes a portion of the of the input tile to reside outside of the valid region of the input image, such as along the edges of the image, the tile will be cropped to contain only the portion which intersects with the valid region of the image.

Intel's Extensions to the OpenVX* API: Advanced Tiling

The OpenVX* Advanced Tiling Extension is an Intel API extension. It is designed to support kernels that require complex tile dependencies, beyond one-to-one mapping with Khronos OpenVX* User Kernel Tiling Extension.

For a detailed code example, refer to the Census-Transform SDK sample description in the Intel(R) Computer Vision SDK Sample Applications section.

The Advanced Tiling Extension accommodates the following:

Kernels requiring non-trivial mapping between input/output tiles, such as rotation or scale.
Kernels for which output tiles need to be produced serially (like error diffusion, below).
Kernels for which an output tile depends on different tiles from multiple input images.

An advanced tiling kernel is added to an OpenVX context using the following interface:

vx_kernel vxAddAdvancedTilingKernelIntel(vx_context context,
        vx_char name[VX_MAX_KERNEL_NAME], vx_enum enumeration,
        vx_advanced_tiling_kernel_intel_f kernel_func_ptr,
        vx_mapping_intel_f mapping_func_ptr,
        vx_uint32 num_params,
        vx_kernel_validate_f validate,
        vx_kernel_initialize_f initialize,
        vx_kernel_deinitialize_f deinitialize,
        vx_kernel_preprocess_intel_f preprocess,
        vx_kernel_postprocess_intel_f postprocess,
        vx_kernel_set_tile_dimensions_intel_f settiledimensions
        vx_kernel_tile_dimensions_initialize_intel_f tiledimensionsinit);

For more information, see the following topics:

Mandatory Callbacks for Kernels Supporting Advanced Tiling

Registering an advanced tiling kernel requires defining and specifying the following callback functions for the vxAddAdvancedTilingKernelIntel:

Inputs and output validation function. It is called during vxVerifyGraph to validate every input and output parameter (defined by the index):
```
typedef vx_status (*vx_kernel_validate_f) (vx_node node, const vx_reference parameters[], 
vx_uint32 num, vx_meta_format metas[]);
```
This function is a regular callback required for all any user kernel as defined by the OpenVX* 1.1 specification.

Advanced Tiling Kernel function. This function is called for each tile contained within the output image, during graph execution.

typedef vx_status (*vx_advanced_tiling_kernel_intel_f)(vx_node node, void  * parameters[], 
vx_uint32 num, void *tile_memory, vx_size tile_memory_size);

Tile-Mapping function, which describes dependencies between an output and input tiles during graph verification within vxVerifyGraph. The purpose of this function is to set attributes to the input tile (srcRectOut), required to produce the given output tile (dstRectIn), passed as a const parameter to the tile-mapping function).
If the kernel has multiple vx_image input parameters, this tile mapping function will be called separately for each input. The param_num parameter specifies which input parameter index that the runtime is requesting a tile mapping for.
```
typedef vx_status (*vx_mapping_intel_f) (vx_node node, vx_reference parameters[], 
const vx_tile_t_attributes_intel_t* dstRectIn, vx_tile_t_attributes_intel_t* srcRectOut,vx_uint32 param_num);
```

Refer to the Putting it all Together: Example Code topic for a code example of tiled kernels.

Optional Callbacks for Kernels Supporting Advanced Tiling

An advanced tiling kernel can optionally provide the following callback functions, which if set, are called by the OpenVX* runtime within vxVerifyGraph or vxProcessGraph:

Initialize

typedef vx_status (*vx_kernel_initialize_f)(vx_node node, const vx_reference 
 *parameters, vx_uint32 num);

If set, called once during vxVerifyGraph after all input and output parameters have been verified.

De-Initialize
```
typedef vx_status (*vx_kernel_deinitialize_f)(vx_node node, const vx_reference 
 *parameters, vx_uint32 num);
```
If set, this function is called during node destruction.

The two functions above are regular callbacks required for any user kernel as defined by the OpenVX* 1.0.1/1.1 specifications. Refer to the standard specification from Khronos* for more details on usage and the Putting it all Together: Example Code section for a code example.
Pre-Process
```
typedef vx_status (*vx_kernel_preprocess_intel_f)(vx_node node, const 
 vx_reference *parameters, vx_uint32 num, void * tile_memory[], vx_uint32 
 num_tile_memory_elements,vx_size tile_memory_size);
```
If set, this function is called once at the beginning of each user call to vxProcessGraph, before any nodes in the graph have started processing. It is intended to be used to perform any required per-vxProcessGraph initialization, which may be required for certain kernels.
Post-Process
```
typedef vx_status (*vx_kernel_postprocess_intel_f)(vx_node node, const 
 vx_reference *parameters, vx_uint32 num, void * tile_memory[], vx_uint32 
 num_tile_memory_elements,vx_size tile_memory_size);
```
If set, this function is called once at the end of each user call to vxProcessGraph, after all nodes in the graph have completed processing. It is intended to be used to perform any required per-vxProcessGraph de-initialization or data aggregation that may be required for certain kernels.

NOTE: As this function is called after all nodes in the graph have completed processing, it is not a good idea to set values of output parameters in this function, if the parameter is further used as an input parameter to another node in the same graph.
Set Tile Dimensions
```
typedef vx_status (*vx_kernel_set_tile_dimensions_intel_f)(vx_node node, 
 const vx_reference *parameters, vx_uint32 param_num, const vx_tile_block_size_intel_t 
 *current_tile_dimensions, vx_tile_block_size_intel_t *updated_tile_dimensions);
```
If set, this function is called within vxVerifyGraph to give a kernel the ability to override the current output tile size chosen by the runtime with a custom tile size. For example, a kernel may be designed to work only with tile widths that equal the output image size. The function is required to set the updated tile dimensions using a combination of the kernel parameters and current tile dimensions, which is passed in as an input parameter.
Initialize (as a function of tile dimensions)
```
typedef vx_status (*vx_kernel_tile_dimensions_initialize_intel_f)(vx_node 
 node, const vx_reference *parameters, vx_uint32 param_num, const vx_tile_block_size_intel_t 
 *current_tile_dimensions);
```
If set, this function is called within vxVerifyGraph to give a kernel the ability to perform some initialization as a function of the currently set tile dimensions. For example, a kernel may want to set some node attribute, such as VX_KERNEL_TILE_MEMORY_SIZE_INTEL, as a function of the currently set tile dimensions.

Advanced Tiling Sequence Diagram during vxVerifyGraph

Below is a call sequence diagram for the required and optional advanced tiling kernel callbacks within vxVerifyGraph:

Advanced Tiling Sequence Diagram during vxProcessGraph

Below is a call sequence diagram for the required and optional advanced tiling kernel functions within vxProcessGraph:

Customizing Behavior of the Advanced Tiling Kernels

In addition to the function callbacks passed into vxAddAdvancedTilingKernelIntel upon kernel registration, there are also a number of kernel attributes that can be set with vxSetKernelAttribute.

NOTE:All the attributes described in this section must be set before call to the vxFinalizeKernel.

For example, the family of the attributes, defined by vx_node_attribute_tiling_intel_e enumeration:

VX_KERNEL_SERIAL_TYPE_INTEL which is used to indicate that individual output tiles of this kernel must be produced "in order".
The specific order is defined by providing a parameter of type vx_serial_type_intel_e:
```
vx_serial_type_intel_e v = <specific order, below>;
        vxSetKernelAttribute(k,VX_KERNEL_SERIAL_TYPE_INTEL,&v,sizeof(v));
```
Possible flavors of execution order are following:
VX_SERIAL_LEFT_TO_RIGHT_TOP_TO_BOTTOM_INTEL

For a given output tile to be produced, the tile to the "left" in the X direction must be completed first. If the given output tile’s x coordinate is 0, the "right" most tile on the previous row of tiles must be completed. The tile with x,y coordinates (0,0) is the only output tile without dependency.
VX_SERIAL_LEFTTOP_TO_RIGHTBOTTOM_INTEL

For a given output tile to be produced, the tile to the "left" in the X direction must be completed, the tile "above" in the Y direction must be completed, and the "upper left" tile must be completed. If the given output tile’s x coordinate is 0, the tile is only dependent on the "above" tile. If the given output tile’s y coordinate is 0, the tile is only dependent on the "left" tile. The tile with x,y coordinates (0,0) is the only output tile without dependency.

In-place Processing with Advanced Tiling Kernels

Two dedicated kernel attributes (VX_KERNEL_INPLACE_KERNEL_INTEL and VX_KERNEL_INPLACE_PARAMS_FUNC_INTEL) together control how intermediate buffers are used.

Usually, the OpenVX* runtime maintains separate intermediate buffers for the input and output of a node. For some kernels, it is possible to save memory and improve performance by using the same buffer. This is called an "in-place" buffer.

VX_KERNEL_INPLACE_PARAMS_FUNC_INTEL is set to a function pointer of type vx_inplace_params_intel_f:
```
typedef vx_int32 (*vx_inplace_params_intel_f) (vx_reference parameters[], 
         vx_uint32 output_image_index);
```
The intent of this function is to notify the runtime which input vx_image can be shared with a given output vx_image. The index of the given output vx_image parameter is passed in. This function should return the parameter index of the input vx_image which can share a buffer (be in-placed) with the given output vx_image. If the given output vx_image cannot be shared with any input vx_image, the function should return -1.
VX_KERNEL_INPLACE_KERNEL_INTEL is set to a function pointer of type vx_advanced_tiling_kernel_intel_f. This is the same function type as the advanced kernel function itself (that is passed in vxAddAdvancedTilingKernelIntel upon kernel registration). In many cases, the same kernel function can be used regardless of whether "in-place" buffers are used of not.

Even if both of these attributes are set, the OpenVX* runtime can ignore the request for "in-place" buffers in certain situations. For example, if the input vx_image is not a virtual image, or if the input buffer is shared with another node.

Using Dedicated Thread Scratch Memory with Advanced Tiling Kernels

The kernel function vx_advanced_tiling_kernel_intel_f may be called by multiple runtime threads simultaneously to process many tiles in parallel. A kernel may require some scratch memory to be available, dedicated to each worker thread, to aid in processing, or to be used as thread-specific storage.

If the desired size of dedicated thread scratch memory is known before vxFinalizeKernel is called, the following kernel attribute can be set as the dedicated thread scratch memory size in bytes:

VX_KERNEL_TILE_MEMORY_SIZE_INTEL

If the desired size of dedicated thread scratch memory is not a constant for all instances of a given kernel, and instead is a function of a parameter or an attribute that is not known until vxVerifyGraph, the following node attribute can be set as the dedicated thread scratch memory size in bytes:

VX_NODE_TILE_MEMORY_SIZE_INTEL

If set, the runtime allocates a cache-aligned buffer per each runtime thread, which may call vx_advanced_tiling_kernel_f during vxProcessGraph. For each call to the advanced tiling kernel function, the runtime will pass in tile_memory, the starting pointer of the thread's dedicated scratch memory buffer, along with tile_memory_size, the allocated size of the buffer.

typedef vx_status (*vx_advanced_tiling_kernel_intel_f)(vx_node node, 
 void * parameters[], vx_uint32 num, void * tile_memory, vx_size tile_memory_size);

If neither of the "tile memory size" attributes described above are set, vx_advanced_tiling_kernel_intel_f is called with a tile_memory pointer equal to NULL, and tile_memory_size equal to 0.

In addition to passing the dedicated thread scratch memory to vx_advanced_tiling_kernel_intel_f, the buffers are also passed to the optional callback functions vx_kernel_preprocess_intel_f and vx_kernel_postprocess_intel_f:

typedef vx_status (*vx_kernel_preprocess_intel_f)(vx_node node, const 
 vx_reference *parameters, vx_uint32 num, void * tile_memory[], vx_uint32 
 num_tile_memory_elements,vx_size tile_memory_size);

typedef vx_status (*vx_kernel_postprocess_intel_f)(vx_node node, const 
 vx_reference *parameters, vx_uint32 num, void * tile_memory[], vx_uint32 
 num_tile_memory_elements,vx_size tile_memory_size);

As these functions are only called once per vxProcessGraph, the pointers to each of the thread specific buffers are passed in as parameters. tile_memory is an array of pointers. Each element of this array corresponds to a dedicated scratch memory buffer for a particular thread. num_tile_memory_elements specifies the number of elements within the tile_memory array. tile_memory_size specifies the allocated size for the buffers.

Access to the thread specific scratch buffers during "Pre-Process" and "Post-Process" can be useful in many ways. One specific example is a custom histogram kernel. This histogram kernel may choose to accumulate histogram data into thread-specific histograms, as this is much more optimal than synchronizing access to a single histogram during vxProcessGraph. Within vx_kernel_preprocess_intel_f, the kernel would need to ensure that the thread specific histograms are initialized with zeros. Within vx_kernel_postprocess_intel_f, the kernel would generate the final histogram by accumulating the results of each entry for every thread-specific histogram.

Putting it all Together: Example Code

This section summarizes the code snippets required to implement callback functions for an advanced tiling user node. The example node implements image transpose operation.

First, any user node requires input parameter validation function per spec (error handling is omitted):

static vx_status TranspValidator (vx_node node, const vx_reference parameters[], vx_uint32 num, vx_meta_format metas[])
{
    if (num!=2) //the node accepts single input (image) and single ouput only
        return VX_ERROR_INVALID_PARAMETERS;
    vx_status status = VX_FAILURE;
    vx_df_image format;
    vxQueryImage((vx_image)parameters[0],VX_IMAGE_FORMAT,&format,   sizeof(format)); //input image format
    if (format != VX_DF_IMAGE_U8)//the node supports only U8 images
    {
        return VX_INVALID_FORMAT;
    }
 
    vx_uint32 in_width = 0, in_height = 0;
    const size_t sz = sizeof(vx_uint32);
    //Query the width & height of the input image 
    vxQueryImage((vx_image)parameters[0],VX_IMAGE_WIDTH, &in_width, sz);
    vxQueryImage((vx_image)parameters[0],VX_IMAGE_HEIGHT,&in_height,sz);
    //Swap the width & height as required by transpose, accordingly
    vx_uint32 out_width = in_height;
    vx_uint32 out_height = in_width;
    vx_df_image format = VX_DF_IMAGE_U8;
    status = vxSetMetaFormatAttribute(meta[1], VX_IMAGE_WIDTH, &out_width, sz);
    status |= vxSetMetaFormatAttribute(meta[1], VX_IMAGE_HEIGHT, &out_height, sz);
    status |= vxSetMetaFormatAttribute(meta[1], VX_IMAGE_FORMAT, &format, sizeof(format));
    return status;
}

This function verifies that the (single) input image is of the correct format. It also validates the the output parameter plus sets the "meta format attributes" defining the output image. In the case of transpose, the output width equals the input height, and the output height equals the input width. The validator also specifies that the output image is of the same format as input (VX_DF_IMAGE_U8).

The tile mapping function is first callback that is specific to the Advanced tiling:

static vx_status TranspTileMapping(vx_node node, vx_reference params[],
        const vx_tile_t_attributes_intel_t* dstRectIn,
        vx_tile_t_attributes_intel_t* srcRectOut,
        vx_uint32 param_num)
{
    //Given the output tile attributes (dstRectIn),
    //set the input tile attributes which are required (srcRectOut)
    srcRectOut->x = dstRectIn->y;
    srcRectOut->y = dstRectIn->x;
    srcRectOut->tile_block.width  = dstRectIn->tile_block.height;
    srcRectOut->tile_block.height = dstRectIn->tile_block.width;
    return VX_SUCCESS;
}

This function sets the input tile which is required to produce a given output tile. In the case of transpose, the mapping is trivial (swap).

The last function that is required is the kernel function, which actually performs the computation of the transpose:

static vx_status TranspAdvancedTilingKernel(vx_node node,
                   void * parameters[],
                   vx_uint32 num,
                   void * tile_memory,
                   vx_size tile_memory_size)
{
    vx_tile_t *pInTileIn  = (vx_tile_intel_t *)parameters[0];<!--?rh-implicit_p ????-->
    vx_tile_t *pInTileOut = (vx_tile_intel_t *)parameters[1];
    const vx_uint8 *pSrc = pInTileIn->base[0];
    vx_uint8 *pDst = pInTileOut->base[0];
        
    const int srcStride = pInTileIn->addr[0].stride_y;
    const int dstStride = pInTileOut->addr[0].stride_y;
    const int output_tile_width = pInTileOut->addr[0].dim_x;
    const int output_tile_height = pInTileOut->addr[0].dim_y;
    vx_uint8 *pDstRow = pDst;
    //for each output line
    for(int y = 0; y < output_tile_height; y++)
    {
        vx_uint8 *pSrcColumn = pSrc + y*sizeof(vx_uint8);
        //for each output pixel for this line
        for(int int x = 0; x < output_tile_width; x++)
        {
            pDstRow[x] = *pSrcColumn;
            pSrcColumn += srcStride;
        }
        pDstRow += dstStride;
    }
    return VX_SUCCESS;
}

The runtime passes the pointer for the output tile to be processed, along with the pointer for the input tile that was requested through the tile-mapping function. Note the correct usage of the stride_y for the source and destination buffers. The tiling kernel should make no assumptions about the way in which the buffers are laid out in system memory and respect the striding.

//create a kernel via the vxAddAdvancedTilingKernelIntel interface
vx_kernel kernel = vxAddAdvancedTilingKernelIntel
        (context, 
        (vx_char *)"AdvancedTilingTranspose",//name
        TRANSPOSE_ID,                       //enumeration
        TranspAdvancedTilingKernel,         //kernel_func_ptr
        TranspTileMapping,                  //mapping_func_ptr
        2,                                  //num_params
        TranspValidator,               //input/output validate
        NULL,                               //initialize
        NULL,                               //deinitialize
        NULL,                               //preprocess
        NULL,                               //postprocess
        NULL,                               //settiledimensions
    NULL);                              //tiledimensionsinit
//add the parameters for this kernel
vxAddParameterToKernel(kernel, 0, VX_INPUT, VX_TYPE_IMAGE, VX_PARAMETER_STATE_REQUIRED);
vxAddParameterToKernel(kernel, 1, VX_OUTPUT, VX_TYPE_IMAGE, VX_PARAMETER_STATE_REQUIRED);
//done with publishing the kernel
vxFinalizeKernel(kernel);

Advanced Tiling Sample Code for Error Diffusion

Error diffusion is a common technique used to convert a multi-level image to bi-level. In this example, we will implement a version of error diffusion that takes as input a single channel, 8 bit/pixel image (range 0 to 255), and will produce a single channel, 8 bit/pixel image whose pixel values are either 0 or 255.

The basic idea is to compare each input pixel value to a fixed threshold (127 in this example). If the pixel value is above the threshold, the corresponding output pixel is set to 255, otherwise it is set to 0. An "error" value is calculated in the following way:

Error = InputPixelValue – OutputPixelValue

For example, say that a given input pixel value is 96. This value is below the threshold of 127, so the output pixel value is set to 0. The error in this case is (96 – 0) = 96.

The computed error for each pixel is then distributed to neighboring input pixels. For Floyd-Steinberg error diffusion that we implement, the "coefficients" used to distribute error to neighboring pixels are:

	X	7/16
3/16	5/16	1/16

Using our example above, (7/16)*96 would be distributed to the neighboring input pixel to the right, (1/16*96) would be distributed to the neighboring input pixel down and so on.

The following images are input grayscale image on the left and the result of the error diffusion on the right:

The error diffusion technique is a good example for showing the need for some of the optional callback functions that the advanced tiling API supports because:

To process a given input pixel, the neighboring input pixels have to complete processing. This means that we will need to set the output tile serialization attribute.
We need a dedicated "error buffer" that we will allocate within the "initialize" function. Accordingly, we deallocate this buffer within the "deinitialize" function.
We need to initialize our error buffer to 0s before each vxProcessGraph; therefore, we have to implement the "preprocess" function.

We skip the input and output validators’ code, as these are similar to what we defined in the previous example. Also, in the case of error diffusion, there is a one-to-one correspondence between input tiles and output tiles; hence, we omit the tile mapping function, which is trivial.

The following code snippet shows the implementation for the "initialize" callback function:

vx_status ErrorDiffusionInitialize(vx_node node,
const vx_reference *parameters, vx_uint32 num)
{
  // we are going to allocate a floating
  // point error buffer, such that there is an error entry
  // for each pixel.
  vx_image input = (vx_image)parameters[0];
  vx_int32 width, height;
  vxQueryImage(input, VX_IMAGE_WIDTH, &width, sizeof(width));
  vxQueryImage(input, VX_IMAGE_HEIGHT, &height, sizeof(height));
  //we pad image with 2 pixels, to prevent memory
  // access violations on the right and left edges of the image.
  width += 2;
  //we add 1 to the height, to prevent memory
  // access violations on the bottom edge of the image
  height += 1;
  vx_float32 *pErrorBuffer = 
        (vx_float32*)malloc(width*height*sizeof(vx_float32));
  if(!pErrorBuffer )
  {
    return VX_ERROR_NO_MEMORY;
  }
  //free previously set local ptr for this node
  vx_float32 *p = 0;
  vxQueryNode(node, VX_NODE_LOCAL_DATA_PTR, &p, sizeof(p));
  if( p) free(p);
  //set the 'local data ptr' for this node to the new errors buffer.
  return vxSetNodeAttribute(node, VX_NODE_LOCAL_DATA_PTR,                                      
        &pErrorBuffer, sizeof(pErrorBuffer));
}

In the case of error diffusion, the initialize function (which is called once per user call to vxVerifyGraph) allocates our errors buffer. The buffer is used at runtime to propagate error to neighboring pixels. We set a node attribute, the "local data pointer", with the allocated buffer. Note that it is a good practice to check if a previously allocated buffer already exists, to prevent memory leaks. This data pointer can be retrieved inside the kernel function using vxQueryNode.

"Deinitialize" is only called upon node destruction, so two successive user calls to vxVerifyGraph with the same graph imply two successive calls to our "initialize" function in which a previously allocated buffer may have already been set.

The following code snippet shows the implementation for the "deinitialize" callback function:

vx_status ErrorDiffusionDeinitialize(vx_node node,
const vx_reference* parameters, vx_uint32 num)
{
  vx_float32 *pErrorBuffer = 0;
  vxQueryNode(node, VX_NODE_LOCAL_DATA_PTR, &pErrorBuffer, sizeof
  pErrorBuffer));
  if( pErrorBuffer )
  {
    free(pErrorBuffer);
    pErrorBuffer = 0;
    // set the local data ptr to 0
    vxSetNodeAttribute(node, VX_NODE_LOCAL_DATA_PTR, 
                       &pErrorBuffer, sizeof(pErrorBuffer));
  }
  return VX_SUCCESS;
}

The "deinitialize" function is called once upon node destruction. We must free the error buffer that was allocated within the "initialize" function. Note that it is required to set the node’s "local data ptr" back to 0 to prevent the runtime from attempting to also free the pointer.

The following code snippet shows the "set tile dimensions" callback function:

vx_status ErrorDiffusionSetTileDimensions(vx_node node,
                        const vx_reference *parameters,
                        vx_uint32 num,
                        const vx_tile_block_size_intel_t *current_tile_dimensions,
                        vx_tile_block_size_intel_t *updated_tile_dimensions)
{
  vx_image input = (vx_image)parameters[0];
  vx_int32 width;
  vxQueryImage(input, VX_IMAGE_WIDTH, &width, sizeof(width));
  //Set the desired tile width to the entire input image width
  updated_tile_dimensions->width = width;
  //Keep the height as the current tile height
  updated_tile_dimensions->height = current_tile_dimensions->height;
  return VX_SUCCESS;
}

The following code snippet shows the implementation for the "preprocess" callback function:

vx_status ErrorDiffusionPreProcess(vx_node node, 
const vx_reference *parameters, vx_uint32 num, void * tile_memory[], vx_uint32 num_tile_memory_elements,vx_size tile_memory_size)
{
  vx_image input = (vx_image)parameters[0];
  vx_int32 width, height;
  vxQueryImage(input, VX_IMAGE_WIDTH, &width, sizeof(width));
  vxQueryImage(input, VX_IMAGE_HEIGHT, &height, sizeof(height));
  vx_float32 *pErrorBuffer = 0;
  size_t sz = sizeof(pErrorBuffer);
        
  vxQueryNode(node, VX_NODE_LOCAL_DATA_PTR, &pErrorBuffer, sz);
  if( !pErrorBuffer )
    return VX_ERROR_NOT_ALLOCATED;
  // patch our width & height (following the logic of the Initialize function) 
  width  += 2;
  height += 1;
  //initialize our error buffer to all 0's
  memset(pErrorBuffer, 0, width*height*sz);
  return VX_SUCCESS;
}

The "preprocess" callback function is called at the beginning of every user call to vxProcessGraph before any nodes have started processing. We use this function to re-initialize our error buffer.

For brevity we the code for the kernel for the error diffusion itself is skipped:

static vx_status ErrorDiffusionAdvancedTilingKernel
(                         vx_node node,void * parameters[],
                         vx_uint32 num,void * tile_memory,
                         vx_size tile_memory_size)
{
  vx_tile_t *pInTileIn = (vx_tile_t *)parameters[0];
  vx_tile_t *pInTileOut = (vx_tile_t *)parameters[1];
  ...
  return VX_SUCCESS;
}

Finally, to create the advanced tiling kernel, the vxAddAdvancedTilingKernelIntel interface should be invoked as in the following code snippet:

//create a kernel via the vxAddAdvancedTilingKernelIntel interface
  vx_kernel kernel = vxAddAdvancedTilingKernelIntel(context,
                     (vx_char *)"ErrorDiffusion",           //name
                     ERRORDIFFUSION_ID,                   //enumeration
                     ErrorDiffusionAdvancedTilingKernel, //kernel_func_ptr
                     ErrorDiffusionTileMapping,       //mapping_func_ptr
                     2,                               //num_params
                     ErrorDiffusionInputValidator,    //input validate
                     ErrorDiffusionOutputValidator,   //output validate
                     ErrorDiffusionInitialize,        //initialize
                     ErrorDiffusionDeinitialize,      //deinitialize
                     ErrorDiffusionPreProcess,        //preprocess
                     NULL,                            //postprocess
                     ErrorDiffusionSetTileDimensions, //settiledimensions
                     NULL                             //tiledimensionsinit);      
  //specifiy the parameters for this kernel
  vxAddParameterToKernel(kernel, 0, VX_INPUT, VX_TYPE_IMAGE, VX_PARAMETER_STATE_REQUIRED);
  vxAddParameterToKernel(kernel, 1, VX_OUTPUT, VX_TYPE_IMAGE, VX_PARAMETER_STATE_REQUIRED);
  //set the serial attribute, to produce output tiles serially
  vx_intel_serial_type_intel_e serial_type = VX_SERIAL_LEFTTOP_TO_RIGHTBOTTOM_INTEL;
  vxSetKernelAttribute(kernel, VX_KERNEL_SERIAL_TYPE_INTEL, 
                       &serial_type, sizeof(serial_type));
  //done with publishing
  vxFinalizeKernel(kernel);

Heterogeneous Computing with Intel® Computer Vision SDK

The Intel® Computer Vision SDK (Intel® CV SDK) supports the OpenVX* 1.1 API for scheduling individual nodes of a graph to a particular accelerator or target.

Currently, only CPU and GPU are supported as full-blown targets. Support for the IPU target is restricted (as number of kernels that this device supports is very limited).

The Intel® CV SDK comes with a simple utility that lists the supported targets and kernels they support. For more details, refer to the README file in the <SDK_ROOT>/samples/kernel_enumerator.

For more information, see the following topics:

General Note on the IPU Support

IPU target is supported in the limited way:

It supports just a subset of OpenVX core vision functions with further restrictions (see the Intel's Extensions to the OpenVX* Primitives section). Specifically, only 3x3 flavors of the following standard kernels are supported: Sobel, Median, Box, Dilate, Erode, Gaussian and Canny L1/L2. Additionaly, Canny 5x5 (L1/L2) is supported. Also, refer to Intel® Computer Vision SDK Kernel Extensions for more information on the IPU-supported extensions.
Current implementation is just functional, not performant.
The only general rule of thumb is that IPU suits relatively well for image filters, and not general complex kernels.
Using IPU in the heterogeneous scenarios might result in data-communication penalties and redundant synchronization.
Using IPU for compute simultaneously with regular image capturing (for example, via libCamHal) might result in undefined behavior.

System-Level Device Affinities

It is possible to direct execution to the specific device via the shell environment variable, for example, the following line enables only the GPU:

$ export VX_INTEL_ALLOWED_TARGETS="gpu"

NOTE: Intel® MKL powered implementation of the CNN kernels for the CPU constitutes a separated target "mkldnn". It accelerates only regular (32-bit) floating point numbers, and not the half precision (16-bit) floats.

Few important details:

By default (when VX_INTEL_ALLOWED_TARGETS is not specified), all targets are allowed.
The OpenVX run-time will attempt to respect order the targets are defined (for example, by trying the second traget only if some extension is not supported by the first target, and so on).
It is recommended to try performance scenario (especially for the graphs with CNN nodes, see the Intel's Extensions to the OpenVX* Primitives section) with the GPU target getting the highest priority in affinity list:
$ export VX_INTEL_ALLOWED_TARGETS="ipu,cpu,mkldnn,gpu"
If your code uses extension that is not supported by the allowed targets, the graph execution will fail.

NOTE: To avoid failures due to the target not supporting some vision function, always put "cpu" to the list of allowed targets. Another alternative is to use finer-grain (per-node) control, described in the next section.

Scheduling Individual Nodes to Different Targets

Intel® CV SDK supports OpenVX* 1.1 "target" API for assigning individual nodes of an OpenVX graph to a particular target, overriding run-time defaults. This enables heterogeneous usages of Intel platforms for better hardware utilization.

You might want to schedule nodes to a particular target to improve performance or power consumption. Also, if there is enough parallelism in the graph, multiple branches of the graph (like individual RGB channels) can be scheduled to the different targets and then execute simultaneously.

NOTE: For detailed information and code examples for the Targets API, refer to the OpenVX Heterogeneous Basic Sample, located at the <SDK_ROOT_DIR>/samples/hetero_basic.

NOTE: As discussed previously, only CPU and GPU are supported as full-blown targets. IPU support is limited.

NOTE: Intel® MKL powered implementation of the CNN kernels for the CPU constitutes a separated target "mkldnn". The recommended order of targets for graphs with CNN nodes is following:
$ export VX_INTEL_ALLOWED_TARGETS="cpu,mkldnn,gpu"

The following code snippet illustrates the target API in action:

// Instantiate a node in a graph and set its target 
node = vxAccumulateImageNode(graph, input, output);
status = vxSetNodeTarget(node, VX_TARGET_GPU_INTEL, NULL);
if (status == VX_ERROR_NOT_SUPPORTED)
{
        // the node is not supported by the target
   // so the vxAssignNodeAffinity had no effect
         // and the implementation falls back to the default target
}

NOTE: All calls to the vxSetNodeTarget for nodes of a graph should happen before vxVerifyGraph for this graph.

Following are the important considerations for setting nodes affinities:

If no target is specified for a node, it will be targeted to run on the CPU
If setting a target fails (for example, if node is not supported by the given target), the implementation will fall back to the default target (CPU).

General Tips on Heterogeneous Execution

As discussed in the Striving for Performance section, deducing performance conclusions from the execution time of individual kernels might be misleading. In most cases, the larger subgraph you charge an accelerator with, the better the communication costs are amortized.
Generally, GPU performance is better on large images. So if the amount of work is too small (<1ms of execution time) - run the graph on CPU device instead or fuse kernels.
Notice that using the GPU target introduces one-time overhead (order of few seconds) of compiling the OpenCL™ kernels. The compilation happens upon OpenVX* context creation and does not affect the execution.
A typical strategy to start with is to test the CPU-only and GPU-only scenarios first (section 9.2). Beware of the situations when some nodes are not supported by the particular target (refer to the Kernel Extensions document for the kernels support matrix). In this case, the only way is to schedule nodes individually and search for optimal split by scheduling subgraphs (or independent branches in the graph) to different targets.
For scenarios where CPU and GPU targets are mixed in the graph, it is recommended to try the option of enabling the GPU tiling (which is set to OFF by default). That might unleash the additional (data-) parallelism between two devices:
$ export VX_CL_TILED_MODE=1

For the GPU-only scenarios, the option should be definitely reverted back to OFF.
It is advised to do performance analysis (next chapter) to determine "hotspot" nodes which should be first candidates for offloading to the additional targets. At the same time it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other run-time overheads.
Notice that GPU can be busy with other tasks (like rendering), similarly the CPU can be in charge for the general OS routines.
Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease even in compare to single-device scenario.
Similarly, even in the GPU-only scenario, a high interrupt rate and frequent synchronization with the host can raise the frequency of the CPU and drag the frequency of GPU down.

Intel® VTune™ Tools Support

The Intel® VTune™ tools fully support collecting and viewing performance data on development systems running Ubuntu*. Start with the regular Intel® VTune™ Amplifier XE for Linux*.

Intel VTune also offers OpenVX API tracing. This is possible for CPU (and its conventional performance counters like time spent in the function) and GPU (via OpenCL™ support in Intel VTune), below. The SDK comes with few samples that explains performance workflow, including analysis in VTune. Refer to the Intel(R) Computer Vision SDK Sample Applications section.

When profiling your application with Intel VTune, make sure to setup all the environment variables (e.g. that points to the CV SDK) for the tool:

Another option is to run the Intel VTune direcly from under the setup that is used for running the OpenVX app, so that the tool inherits the environment variables entirely.

Example Analysis with Intel® Vtune™ (for CPU Target)

To get the fully qualified kernels names for the CPU in the VTune, you should set the following environment variable:

$export VX_INTEL_ENABLE_VTUNE_INSTRUMENTATION=1

Also, in the tool, when selecting analysis type, make sure to check Analyze user tasks, events, and counters option, below is example for Basic Hotspots analysis:

Setting "Analyze user tasks, events, and counters" option for the VTune(TM)

Consider simple OpenVX* graph below (refer to the Auto-Contrast SDK sample for details):

                             >|channel extract|->Y->|equalize_hist|-
                            /                                        \
RGB(in)->|convert|->NV12->|channel extract|->----U------------- ->|combine|->|convert|->RGB(out)
                            \                                        /
                             >|channel extract|->V-------------------

According to the VTune™, the hotspots when running on the CPU (Intel® Core™ development machine) for 100 iterations are the following:

CPU time breakdown of the Auto-Contrast SDK Sample, as seen in "Bottom-up" view in Intel® VTune™. Notice that primary grouping is by the "Task domain"

It is relatively easy to correlate the fully qualified OpenVX* kernels names like org.khronos.openvx.color_convert (also attributed as the NV12-> RGB conversion) with the particular (first) node in the sample graph. In this simple graph the color-conversion nodes almost exclusively dominate execution time.

Notice that "TBB node functor call" is just aggregation of the underlying nodes execution time as the run-time runs them by tiles in parallel with use of Intel® TBB. This is clearly visible on another VTune view:

CPU tiles, mapped to the (TBB) threads in the VTune Platform pane (heavily zoomed)

The User Tasks option, which if unset, would disable displaying the tiling information completely. The user tasks profiling is exactly enabled by the dedicated knob (see the previous picture). Refer to the Video Stabilization and Census Transform SDK samples for example of analysis related to the tiles execution.

Unless you have other kernels to potentially replace the top hotspots (or if there is no way to relax the compute demand by changing the kernels’ parameters), one possible step is offloading the top hotspots to the GPU target (refer to the Scheduling Individual Nodes to Different Targets section for more details).

Example Analysis with Intel® Vtune™ (for GPU Target)

Intel® VTune™ offers rich support for the OpenCL™ API tracing and associated hardware counters. Since OpenVX* implementation for the GPU relies on the mature stack of the OpenCL, it is relatively easy to correlate the APIs in the tool.

Specifically, consider the graph example from the previous chapter. When executing solely on the GPU (refer to the System-Level Device Affinities section) from under VTune, this is a summary that VTune generates:

GPU Hotspots in Intel® VTune™

Here you can see that similar to the CPU (previous section), the color conversions are most time-consuming, seconded by the histogram. The next level of details (including actual execution time on the device, driver overheads, cost of copies, and so on) can be gathered with OpenCL profiling and queue information. Refer to the general trainings on the OpenCL support in VTune™ for more info.

OpenCL Queue for the GPU Executing the OpenVX Graph (“Platform” view in Intel® VTune™)

Profiling Remote Targets

There is support for the remote performance analysis for Yocto* targets using VTune™ Amplifier for Systems 2017, which is part of Intel® System Studio 2017. Few useful instructions for Yocto enabling can be found at https://software.intel.com/en-us/articles/vtune-embedded/#profiling_on_target. Notice that for basic metrics you can skip the SEP building steps from the instructions page.

Intel® Computer Vision SDK Sample Applications

Intel® CV SDK comes with samples, available online at the Intel CV SDK samples page https://software.intel.com/en-us/computer-vision-sdk-support/code-samples.

OpenCL™ Custom Kernel in OpenVX* Sample that demonstrates an experimental Intel extension that enables using code of regular OpenCL kernels as OpenVX* user kernels.
For more details, refer to the OpenCL™ Custom Kernel in OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/ocl_custom_kernel.
Heterogeneous Basics with OpenVX* sample that uses Targets API to schedule parts of an OpenVX graph to different compute units (for example, CPU, GPU, or IPU).
Auto-Contrast Sample that implements straightforward contrast normalization via histogram equalization. The focus of the sample is basics of OpenVX development step by step. It also shows trivial interoperability with OpenCV, which is used for loading/saving and displaying of the images. The sample also explains basic analysis with Intel® VTune™ tools.
For more details, refer to the Auto-Contrast OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/auto_contrast.
CNN AlexNet Sample that builds the OpenVX graph (using Khronos* CNN OpenVX* extension) for AlexNet by running the Intel CV SDK Model Optimizer tool on the original Caffe* data. Then the graph is used to perform inferences for the regions of the given image and top-5 results are reported.
NOTE: This sample targets Linux* only.

For more details, refer to the CNN AlexNet OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/cnn_alexnet.
CNN Custom Kernel Sample that builds the OpenVX graph for FCN8s* model by running the Intel CV SDK Model Optimizer tool on the original Caffe data. The resulting graph is used to perform semantic segmentation of the input image. The FCN8s features certain layers that the OpenVX runtime doesn’t support by default. The sample explains how to implement missing layers in OpenCL and register the implementation so that the Model Optimizer will generate rights calls upon parsing Caffe model.
For more details, refer to the CNN Custom Layer for OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/cnn_custom_kernel.

Also <SDK_SAMPLES_ROOT>/samples/cnn_custom_kernels_lib features more example of kernels (Yolo* and SSD* specific).

NOTE: This sample targets Linux* only.
Video-Stabilization Sample that showcases more advanced OpenVX features like delays and user nodes. The sample implements pipeline based on Harris corners for feature points detection and optical flow for points tracking. Basics of user nodes are explained. OpenCV is used to read and display input video from a file, debug visualization, and finally for displaying the result. The sample also showcases the benefits from the heterogeneous scenario when some nodes are scheduled to the GPU, along with performance discussion, including basic Intel® VTune™ analysis.
For more details, refer to the Video Stabilization OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/video_stabilization.
Census-Transform Sample that implements a well-known CENTRIST visual descriptor calculation algorithm targeted for scene categorization and object recognition tasks. This sample introduces a code example for Intel OpenVX Advanced Tiling Extension for user nodes. Just like previous sample, it features basic interoperability with OpenCV through data sharing. OpenCV is used for reading the data from a video file and results visualization. Additionally, the sample implements the CENTRIST as Custom OpenCL kernel (see the Difference in Interpretation of the OpenCV and OpenVX* Parameters topic) to showcases benefits of running it on the GPU.
For more details, refer to the Census Transform OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/census_transform.
GStreamer*-OpenVX* Interoperability Sample that shows how to develop GStreamer plugins that use OpenVX for typical computer vision tasks. The OpenVX code is wrapped in GStreamer plugin template processing function to become a part of larger GStreamer media processing pipeline.
For more details, refer to the GStreamer* - OpenVX* Interoperability Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/gstovx_plugin.

NOTE: This sample targets Linux* only.
Face Detection Sample that demos usage of the OpenVX* to implement face detection algorithm based on classifier cascade detection.
For more details, refer to the Face Detection OpenVX* Workload – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/face_detection.
Motion Detection Sample that use OpenVX* to develop a motion detection application. Specifically, it implements a simplified motion detection algorithm based on Background Subtraction MOG2, dilate, erode and connected component labeling.
For more details, refer to the Motion Detection OpenVX* Workload – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/motion_detection.
Lane Detection Sample that implements simple lane detection pipeline in OpenVX. The sample uses core OpenVX vision functions, combined with Intel’s extensions (like HoughLineP) to calculate lane borders from a front-view camera video.
For more details, refer to the Lane Detection OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/lane_detection.

For more details, refer to the Heterogeneous Basics with OpenVX* Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/hetero_basic.
Kernel Enumerator Sample that is a command line sample utility for querying list of kernel extensions, supported targets and the list of kernels from the OpenVX* runtime.
For more details, refer to the README file in the <SDK_SAMPLES_ROOT>/samples/kernel_enumerator.
Color Copy Pipeline Sample that is example of Printing and Imaging pipeline that utilizes IPU and GPU in addition to the CPU to demonstrate heterogeneous goodness of the product along with performance discussion, including basic Intel® VTune™ analysis.
For more details, refer to the Copy Pipeline OpenVX Sample – Developer Guide and README in the <SDK_SAMPLES_ROOT>/samples/color_copy_pipeline.

NOTE: This sample targets Linux* only.

OpenVX* Performance Tips

A rule of thumb for OpenVX* is applicability of general "compute" optimizations, as from the optimization point of view, the OpenVX is just another API for compute. Thus, common tips like providing the sufficient parallel slack, avoiding redundant copies or extensive synchronization, and so on, hold true. These are especially pronounced for heterogeneous scenarios, as latencies of communicating to accelerators are involved.

There are specific actions to optimize the performance of an OpenVX application. These actions are as follows:

Use virtual images whenever possible, as this unlocks many graph compiler optimizations.
Whenever possible, prefer standard nodes and/or extensions over user kernel nodes (which serve as memory and execution barriers, hindering performance). This gives the Pipeline Manager much more flexibility to optimize the graph execution.
If you still need to implement a user node, base it on the Advanced Tiling Extensions (see the Intel's Extensions to the OpenVX* API: Advanced Tiling chapter)
If the application has independent graphs, run these graphs in parallel using vxScheduleGraph API call.
Provide enough parallel slack to the scheduler- do not break work (for example, images) into too many tiny pieces. Consider kernel fusion.
For images, use smallest data type that fits the application accuracy needs (for example, 32->16->8 bits).
Consider heterogeneous execution (see the Heterogeneous Computing with Intel® Computer Vision SDK chapter).
You can create an OpenVX image object that references a memory that was externally allocated (vxCreateImageFromHandle). To enable zero-copy with the GPU the externally allocated memory should be aligned. For more details, refer to https://software.intel.com/en-us/node/540453.
Beware of the (often prohibitive) vxVerifyGraph latency costs. For example, construct the graph in a way it would not require the verification upon the parameters updates. Notice that unlike Map/Unmap for the input images (see the Map/Unmap for OpenVX* Images section), setting new images with different meta-data (size, type, etc) almost certainly triggers the verification, potentially adding significant overhead.

Advanced Face Capabilities in Intel's OpenCV

The Photography Vision Library (PVL) is set of state-of-the-art computer vision and imaging algorithms developed at Intel. Each component in the PVL is optimized on Intel architecture both in performance and in power consumption. The PVL is a part of the OpenCV, coming with the Intel® CV SDK.

There are several computer vision algorithms in the PVL, all now exposed with OpenCV API:

Face Detection: PVL provides fast and light-weight face detection with industry-leading accuracy. A person's face can be automatically detected and tracked, and it supports wide range of rotation and long distance. It also shows top-tier face detection rates in the FDDB benchmark.
Blink Detection: A blink of a person is detected and indexed in real-time with PVL. Blink detection can be used while taking a photo to trigger the capture (in wink) or to get rid of the photo where the subject has blinked. It can also be applied to the Driving Safety Management System for alerting a drowsy driver.
Smile Detection: A person's smile intensity is able to be detected through analyzing the changes and movements of the face with PVL. Smile detection can be applied to digital still cameras and camera phones as a mechanism for the smile auto shutter, which automatically triggers the capture function when the subject is smiling.
Face Recognition: PVL provides efficient face recognition is real-time with industry-leading accuracy. It shows top-tier face recognition rates in FRGC benchmark. You can sort or browse photos by person using the PVL face recognition algorithm. It can also be utilized in biometric authentication.

FaceDetector

The FaceDetector class detects face features including face rectangle, eye position, mouth position, blink status, and smile status. The face features can be detected selectively. You may detect face rectangles only, or detect face rectangles, eye position, and blink status. There are dependencies among face features. To detect eye position or mouth position, face rectangle is needed. To detect smile status or blink status, face rectangle and eye position also are required. Required face features would be detected even though they are not specified for detection.

Many parameters affect face detection results. Depending on the parameters you set, the number of detected faces can vary. As you try to detect more and more faces, the detection takes a longer time. Detection can be done in two different modes, which are normal mode and tracking mode. With normal mode, you can get all the detected faces at one time. With tracking mode, you need to run detection several times to detect all the faces. By doing so, face detection is relatively very fast in tracking mode. If you want to detect faces with a still image, normal mode should be used. However, if you want to detect faces in a video stream, usually tracking mode is recommended.

First, you must create an instance of FaceDetector class. FaceDetector has no public constructor. Instead, by calling the static method you can create the instance:

cv::Ptr<cv::pvl::FaceDetector> fd = cv::pvl::FaceDetector::create();

FaceDetector has the following parameters:

Name	Description	Default Value	Min/Max
Max Detectable Faces	Maximum number of detectable faces	32	1/32
RIP	Rotation-in-Plane	135	0/180
ROP	Rotation-out-of-Plane	90	0/90
Blink Threshold	Threshold value used for evaluating blink	50	1/99
Smile Threshold	Threshold value used for evaluating smile	48	1/99
Min Face Size	Minimum face size in pixel	64	32/None
Tracking Mode	Tracking mode	False	None

Each parameter has a get/set method and the min/max value of some parameters are declared as enum type. You can refer to the header file.

The following examples demonstrate how you can detect face landmarks from the image.

To change the setting of FaceDetector, calls set methods for each parameter:

fd->setMaxDetectableFaces(15);<!--?rh-implicit_p ????-->
fd->setRIPAngleRange(45);
fd->setROPAngleRange(90);
fd->setBlinkThreshold(80);
fd->setSmileThreshold(70);
fd->setMinFaceSize(50);
fd->setTrackingModeEnabled(false); // normal mode

To detect face rectangles, you need a vector of Face class. Face class contains information on face landmarks and face classifications. There are face rectangle, eye, and mouth information for face landmarks. In addition, there are blink and smile information for face classifications. All of the information can be acquired through the FaceDetector’s detection methods. It is also possible to set the information manually by yourself. If you had your own algorithm for any of these face landmarks and classifications, you can use your algorithm along with FaceDetector and FaceRecognizer:
```
std::vector<cv::pvl::Face> faces;
```
All the methods of FaceDetector should use grayscale image, which has only one channel and 8-bit depth.
```
cv::Mat grayImg = cv::imread("image.jpg", cv::IMREAD_GRAYSCALE);
fd->detectFaceRect(grayImg, faces);
```

As mentioned earlier, there are dependencies among face features. If you want to get blink status or smile status, you should detect eye position.

for (uint i = 0; i < faces.size(); ++i)
{
        fd->detectEye(grayImg, faces[i]);
        fd->detectMouth(grayImg, faces[i]);
        fd->detectBlink(grayImg, faces[i]);
        fd->detectSmile(grayImg, faces[i]);
}

Putting It All Together: Example Code for Face Detection

The following demonstrates how to detect face landmarks in tracking mode using web cam:

VideoCapture cam(0);
Ptr<FaceDetector> fd = FaceDetector::create();
Mat frame, grayedFrame;
vector<Face> faces;
fd->setTrackingModeEnabled(true); // enable tracking mode
while (true)
{
  cam >> frame;
  cvtColor(frame, grayedFrame, COLOR_BGR2GRAY);
  fd->detectFaceRect(grayedFrame, faces);
  for (uint i = 0; i < faces.size(); ++i)
  {
    fd->detectEye(grayedFrame, faces[i]);
    fd->detectMouth(grayedFrame, faces[i]);
    fd->detectBlink(grayedFrame, faces[i]);
    fd->detectSmile(grayedFrame, faces[i]);
  }
  for (uint i = 0; i < faces.size(); ++i)
  {
    const Face& face = faces[i];
    Rect faceRect = face.get<Rect>(Face::FACE_RECT);
    Point leftEyePos = face.get<Point>(Face::LEFT_EYE_POS);
    Point rightEyePos = face.get<Point>(Face::RIGHT_EYE_POS);
    Point mouthPos = face.get<Point>(Face::MOUTH_POS);
    bool closingLeft = face.get<bool>(Face::CLOSING_LEFT_EYE);
    bool closingRight = face.get<bool>(Face::CLOSING_RIGHT_EYE);
    bool smiling = face.get<bool>(Face::SMILING);
  }
  if (cv::waitKey(5) > 0)
    break;
}

In the last for loop block, you can see how to get the value of Face class using the template method. You also can set the value using the set template method. The following are the values of the Face class that you can set and get.

Enum	Type	Description
FACE_RECT	cv::Rect	Rectangle of the face. There is no default value. (type: cv::Rect)
RIP_ANGLE	int	Rotation-in-plane angle of the face. Default value is 0.
ROP_ANGLE	int	Rotation-out-of-plane (yawing) angle of the face.
FACE_RECT_CONFIDENCE	int	Value indicating how much the face is close to the typical face. Default value is 100.
TRAKING_ID	int	Tracking ID of the face. Default value is -1.
LEFT_EYE_POS	cv::Point	Left eye position. There is no default value.
RIGHT_EYE_POS	cv::Point	Right eye position. There is no default value.
EYE_POS_CONFIDENCE	int	Value indicating how much the eye position is close to the typical eye position. Default value is 100.
MOUTH_POS	cv::Point	Mouth position. There is no default value.
MOUTH_POS_CONFIDENCE	int	Value indicating how much the mouth position is close to the typical mouth position. Default value is 100.
CLOSING_LEFT_EYE	bool	Indicates whether the person is closing the left eye. There is no default value.
CLOSING_RIGHT_EYE	bool	Indicates whether the person is closing the right eye. There is no default value.
LEFT_BLINK_SCORE	int	Left eye score. The value has range from 0 to 100. 0 means wide opened eye, while 100 means fully closed eye. There is no default value.
RIGHT_BLINK_SCORE	int	Right eye score. The value has range from 0 to 100. 0 means wide opened eye, while 100 means fully closed eye. There is no default value.
SMILING	bool	Indicates whether the person is smiling. There is no default value.
SMILE_SCORE	int	Smile score between 0 and 100. 0 means non-smile, while 100 means full smile. There is no default value.

Face class provides the other methods to set member variables that are used conveniently when you want to call only the specific detect method of FaceDetector.

The following examples demonstrate how to detect eye position when there is the detected face rectangle from another face detection algorithm:

cv::Rect faceRect;
// the face rectangle acquired by your own algorithm
cv::Ptr<cv::pvl::FaceDetector> fd = cv::pvl::FaceDetector::create();
cv::pvl::Face face;
face.setFaceRectInfo(faceRect);
fd->detectEye(grayImg, face);

The setFaceRectInfo() method of the Face class has many parameters such as RIP angle, ROP angle, confidence, and tracking id. If you do not know these values, these are set by default values.

FaceRecognizer

The FaceRecognizer class recognizes the given faces in the input image or frame. To recognize faces, an internal face database should be constructed first. You can register faces to the internal database of this class's instance. You can also deregister the registered faces from the internal database. Registered faces can be stored in a file, and you can load the stored file when you create an instance. Unless you explicitly store registered faces into a file, the internal database is removed when the instance's destructor is invoked.

Face recognition runs in two different modes, which are normal mode and tracking mode. In the normal mode, it always tries to handle the whole input image to recognize faces. In the tracking mode, if the given faces are the same as the ones of previous frames and the faces are recognized, it will not recognize faces again. If performance is an important factor, it is recommended to use the tracking mode.

You should create an instance of FaceRecognizer class through its static factory method. FaceRecognizer does not provide any public constructors.
```
cv::Ptr<cv::pvl::FaceRecognizer> fr = cv::pvl::FaceRecognizer::create();
```
If you already have a saved DB xml file, you may load it and it includes the creation process.
```
cv::Ptr<cv::pvl::FaceRecognizer> fr = cv::Algorithm::load<cv::pvl::FaceRecognizer>("saved_db.xml");
```

You can register faces to the internal database of this class instance. To register a face, face detection is needed to get Face values.

std::Mat imgGray;<!--?rh-implicit_p ????-->

std::vector<cv::pvl::Face> faces;
int personID = fr->createNewPersonID();
...
//fill imgGray
//need face detect process to fill faces
...
fr->registerFace(imgGray, faces, personID);

If you want to save the face data into an external xml file, you should set the fourth parameter true of registerFace().

fr->registerFace(imgGray, faces, personID, true);

The purpose of face recognition is getting personID that matches among the registered DB. If it recognizes the face, the personID has positive value. If it does NOT recognize any faces, the personID has -10000, which means unknown person.
```
std::Mat imgGray;
std::vector<cv::pvl::Face> faces;
std::vector<int> personIDs;
...
//fill imgGray
//need face detect process to fill faces
...
fr->recognize(imgGray, faces, personIDs);
```
You may recognize faces under sequential frames like camera or video file. In this case, recognizing speed might be more important. You can enable tracking mode for this sequential input. Both face detector and face recognizer should enable tracking mode.
```
fd->setTrackingModeEnabled(true);
fr->setTrackingModeEnabled(true);
//need face detect process to fill faces
...
fr->recognize(imgGray, faces, personIDs);
```
5. When you want to save the face data in an external file, you can use save() API. This API saves only registered faces which saveToFile parameter is true in registerFace().
```
fr->save("save_db.xml");
```
Sets the maximum number of faces that can be recognized in one frame in tracking mode. This value may not exceed MAX_SUPPORTED_FACES_FOR_RECOGNITION_IN_TRACKING (8). Default value is MAX_SUPPORTED_FACES_FOR_RECOGNITION_IN_TRACKING (8).
```
fr->setMaxFacesInTracking(5);
```

Putting It All Together: Sample Code for Face Recognition

The sample code below shows the basic use case for camera input:

VideoCapture cam(0);
Mat frame, grayedFrame;
Ptr<FaceDetector> fd = FaceDetector::create();
Ptr<FaceRecognizer> fr = FaceRecognizer::create();
vector<Face> faces;
Mat image = imread("my_face.jpg", cv::IMREAD_GRAYSCALE);
vector<int> personIDs;
// Register a face by one image with my face
fd->detectFaceRect(image, faces);
fr->registerFace(image, faces[0], fr->createNewPersonID(), true);
//turn on tracking
fd->setTrackingModeEnabled(true);
fr->setTrackingModeEnabled(true);
while (true)
{
    cam >> frame;
    cvtColor(frame, grayedFrame, COLOR_BGR2GRAY);
    // Detect faces first and then try to recognize the faces
    fd->detectFaceRect(grayedFrame, faces);
    if (!faces.empty())
    {
        fr->recognize(grayedFrame, faces, personIDs);
        for (int i = 0; i < personIDs.size(); ++i)
        {
            // recognized
        }
        if (waitKey(5) > 0)
            break;
    }
    fr->save("frdb.xml");
}

Related Information

For information on Intel® Computer Vision SDK requirements, known issues and limitations, refer to the Intel® Computer Vision SDK Release Notes document.

For additional information, refer to the following documents:

Getting Help and Support

You can submit issues at Online Service Center: http://www.intel.com/supporttickets. Use Request Support with specifying the Product ID: computer vision.

Use the links on this page to access additional information about Intel® Computer Vision SDK: