__kernel kernel __attribute__((vec_type_hint(<typen>))) __attribute__((work_group_size_hint(X, Y, Z))) __attribute__((reqd_work_group_size(X, Y, Z))) |
The __kernel
(or kernel
) qualifier
declares a function to be a kernel that can be
executed by an application on an OpenCL device(s).
The following rules apply to functions that
are declared with this qualifier:
It can be executed on the device only
It can be called by the host
It is just a regular function call if a __kernel
function is called by another kernel function.
The __kernel
qualifier can be used with the keyword
__attribute__ to declare additional
information about the kernel function as described below.
The optional
__attribute__((vec_type_hint(<typen>)))
is a hint to the
compiler and is intended to be a representation of the computational
width of the
__kernel
,
and should serve as the basis for calculating processor
bandwidth utilization when the compiler
is looking to autovectorize the code.
vec_type_hint (<typen>)
shall be one of the built-in scalar or vector data type described in
tables 6.1 and 6.2.
If
vec_type_hint (<typen>)
is not specified, the default value is int.
The
__attribute__((vec_type_hint(int)))
is the default type.
For example, where the developer specified a width of float4,
the compiler should assume
that the computation usually uses up 4 lanes of a float vector,
and would decide to merge work-items or possibly even separate
one work-item into many threads to better match the hardware
capabilities. A conforming implementation is not required
to autovectorize code, but shall
support the hint. A compiler may autovectorize, even if no
hint is provided. If an
implementation merges N
work-items into one thread,
it is responsible for correctly handling
cases where the number of global or local work-items
in any dimension modulo N
is not zero.
If for example, a __kernel
is declared with
__attribute__(( vec_type_hint (float4)))
(meaning that most operations in the __kernel
are explicitly vectorized using
float4) and the kernel is running using
Intel® Advanced Vector Instructions
(Intel® AVX)
which implements a 8-float-wide vector unit,
the autovectorizer might choose to merge two
work-items to one thread, running a second
work-item in the high half of the 256-bit AVX
register.
As another example, a Power4 machine has two scalar
double precision floating-point units with
an 6-cycle deep pipe. An autovectorizer for the
Power4 machine might choose to interleave six
__attribute__(( vec_type_hint (double2))) __kernel
s
into one hardware
thread, to ensure that there is always 12-way
parallelism available to saturate the FPUs. It might
also choose to merge 4 or 8 work-items (or some
other number) if it concludes that these are
better choices, due to resource utilization
concerns or some preference for divisibility by 2.
The optional
__attribute__((work_group_size_hint(X, Y, Z)))
is a hint to the
compiler and is intended to specify the work-group size
that may be used i.e. value most likely to
be specified by the local_work_size
argument to
clEnqueueNDRangeKernel.
For example the
__attribute__((work_group_size_hint(1, 1, 1)))
is a hint to the compiler
that the kernel will most likely be executed
with a work-group size of 1.
The optional
__attribute__((reqd_work_group_size(X, Y, Z)))
is the work-group size that must be used as the
local_work_size
argument to
clEnqueueNDRangeKernel.
This allows the compiler to optimize the generated
code appropriately for this kernel. The optional
__attribute__((reqd_work_group_size(X, Y, Z)))
,
if specified, must be (1, 1, 1) if the kernel is executed via
clEnqueueTask.
If Z
is one, the work_dim
argument to
clEnqueueNDRangeKernel
can be 2 or 3. If Y
and Z
are
one, the work_dim
argument to
clEnqueueNDRangeKernel
can be 1, 2 or 3.
Implicit in autovectorization is the assumption that any libraries called from the __kernel must be recompilable at run time to handle cases where the compiler decides to merge or separate workitems. This probably means that such libraries can never be hard coded binaries or that hard coded binaries must be accompanied either by source or some retargetable intermediate representation. This may be a code security question for some.
// autovectorize assuming float4 as the // basic computation width __kernel __attribute__((vec_type_hint(float4))) void foo( __global float4 *p ) { .... // autovectorize assuming double as the // basic computation width __kernel __attribute__((vec_type_hint(double))) void foo( __global float4 *p ){ .... // autovectorize assuming int (default) // as the basic computation width __kernel void foo( __global float4 *p ){ .... |