Bolt
1.1
C++ template library with support for OpenCL
|
Bolt is a C++ template library optimized for GPUs. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with STL will recognize many of the Bolt APIs and customization techniques.
C++ templates can be used to customize the algorithms with new types (for example, the Bolt sort
can operate on ints, float, or any custom type defined by the user). Additionally, Bolt lets users customize the template routines using function objects (functors) written in OpenCL ™ for example, to provide a custom comparison operation for sort
, or a custom reduction operation.
Bolt can directly interface with host memory structures such as std::vector
or host arrays (e.g. float*
). On today's GPU systems, the host memory is mapped or copied automatically to the GPU. On future systems that support the Heterogeneous System Architecture, the GPU will directly access the host data structures. Bolt also provides a bolt::cl::device_vector that can be used to allocate and manage device-local memory for higher performance on discrete GPU systems. Bolt APIs can accept either host memory or the device vector.
This document introduces the architecture of Bolt and also provides a reference for the Bolt APIs.
The simple example below shows how to use Bolt to sort a random array of 8192 integers.
The code will be familiar to the programmers who have used the C++ Standard Template Library; the difference is the include file (bolt/cl/sort.h) and the bolt::cl
namespace before the sort
call. Bolt developers do not need to learn a new device-specific programming model to leverage the power and performance advantages of GPU computing.
The example demonstrates two important features of Bolt:
sort
routine, without a need to explicitly allocate and manage GPU device memory. bolt::cl::sort
call submits to the platform, rather than a specific device. The Bolt runtime selects the best device to run the sort, potentially running it on the CPU in the event a GPU is not available or the sort size is too small to benefit from GPU acceleration.Below example shows how to use functors with BOLT. For user defined datatypes to work with Bolt functor, user has to wrap his data type around a macro BOLT_FUNCTOR. This enables the user defined class to be available as a string to the OpenCL compiler which is invoked during a call to clBuildProgram(). Consider the following example.
After defining the class, user need to register it with Bolt. Internally Bolt algorithms will make use of the device_vector class. But these are defined only for the basic data types like int, float, unsigned int etc. For user defined data types it’s the responsibility of the application developer to create a definition of the deveice_vector for his own defined data type. See the below example.
In the above example bolt::cl::greater functor is defined in include/bolt/cl/functional.h. For the user defined data types it is the responsibility of the application developer to define as shown above for the functors which he uses from include/bolt/cl/functional.h .
Stitching it all together with a below sample code for Sort:
User can also work with his own defined Functor as shown below:
User can also use Templatized version of Bolt Functor with User defined data types as shown below:
If the user wants to use TEMPLATE_FUNCTOR for more than one datatype then he can use other variants of this like BOLT_TEMPLATE_FUNCTOR2, BOLT_TEMPLATE_FUNCTOR3 etc from include/bolt/cl/clcode.h file.
List of Supported Functions and Code Paths.
Bolt function is designed to be executed with four code paths (OpenCL™, C++ AMP, Multicore CPU and Serial CPU). The default mode is "Automatic" which means it will go into GPU path first, then Multicore CPU (Intel TBB), then SerialCpu with below mentioned order, control will go to other paths only if the selected one not found.
Selection order will be:
Forcing mode to any device will run the function on that device only. There are two ways in BOLT to force the control to specific Device.
If Debug Log is enabled, we record the actual code path taken for execution. It could be OpenCL™ GPU/CPU, Multicore TBB or Serial. Users need to initialize the log object before the bolt call and a query after the call to know which all paths have been executed.
Bolt uses an OpenCLTM implementation that supports the static C++ kernel features; specifically, C++ template support for OpenCLTM kernels. Currently, the AMD OpenCLTM SDK 2.7 and above versions are designed to provides this support; other vendors may adopt this feature in the future. If you face any issue with Bolt. Please log an issue in Bolt Issues page. Bolt Issues page.
Note: If the user has installed both Visual Studio 2012 and Visual Studio 2010, the latter should be updated to SP1.
Note: Bolt pre-built binaries for Linux are build with GCC 4.7.3, same version should be used for Application building else user has to build Bolt from source with GCC 4.6.3 or higher.
The latest Catalyst package contains the most recent OpenCL runtime. Recommended Catalyst package is 13.11 Beta V1.
13.4 and higher is supported.
Note: 13.9 in not supported.
AMD APU Family with AMD Radeon™ HD Graphics
AMD Radeon™ HD Graphics
AMD Radeon™ HD Graphics
Compiled binary windows packages (zip packages) for Bolt may be downloaded from the Bolt landing page hosted on AMD's Developer Central website.