![]() |
OpenVX User Kernel Tiling Extension
r29628
|
The User Kernel Tiling facility enables optimizations of the user nodes (e.g. locality of execution or parallelism) when performing computation on the image data. Modern processors have a diverse memory hierarchy that varies from relatively small but fast and expensive memory to relatively large but slow and inexpensive memory. Image data is typically too large to fit into the fast but small memory. The ability to break the image data into smaller-sized rectangular units allows for optimized computation on these smaller units with fast memory access or parallel execution of a user node on multiple image ``tiles'' simultaneously. The OpenVX Graph Manager possesses the knowledge about the memory hierarchy of the target and is hence in a position to break the image data into smaller units for memory optimization. Knowledge of the memory access pattern of an algorithm is key for the graph manager to enable optimizations.
As with conventional User Kernels, Tiling User Kernels will typically be loaded and executed on HLOS/CPU-compatible targets, not on remote processors or other accelerators — the intent of this extension is to allow efficient scheduling and memory transfer between accelerators and user functions, not to provide code to run on an accelerator, since this capability is necessarily vendor-specific. This specification does not mandate what constitutes compatible platforms.
The purpose of having User Kernel Tiling Functions is to:
The user kernel tiling facility is largely composed of two components:
The creator of the tiled user kernel must provide a kernel function to perform the actual computation. Optionally, the kernel creator may provide a second function with different edge-condition checks.
If the kernel creator only provides the fast function, some of the output image (near the edges) will not be computed if the fast function cannot be safely called at the edge of these images. If the kernel creator only provides the flexible function, it will be called at all appropriate locations of the image, including near the edges. The OpenVX implementation should still attempt to call the flexible function with regular tile sizes and alignments where possible. If both functions are provided, the fast function should be used as much as possible given its restrictions, and the flexible function will be called elsewhere.
Terms and Definitions of the Extension:
vx_tiling_kernel_f. VX_BORDER_MODE_SELF — which may rely on a neighborhood that corresponds to pixels beyond the bounds of the input images. The user function should be capable of running at arbitrary coordinate alignment. If only the flexible function is provided, this will be invoked on tile-block aligned tiles as well as unaligned tiles. The user can expect aligned regions to be invoked preferentially even when only a flexible function is provided. At least one of the fast and flexible functions must be provided. The flexible function is a function pointer of type vx_tiling_kernel_f. VX_BORDER_MODE_SELF, the flexible function will be invoked on tiles near the image border that would require neighborhood access beyond the image bounds, and this function is responsible for accessing any pixels whose coordinates are invalid. If the border mode is VX_BORDER_MODE_UNDEFINED, the graph manager will ensure that the user functions are not invoked on coordinates that, because of neighborhood, require access outside the image area. VX_KERNEL_ATTRIBUTE_TILE_MEMORY_SIZE attribute (which may be zero) multiplied by the number of tile blocks in the tile being processed, and the allocation will be passed to the user-provided functions through the tile_memory parameter. The size allocated for that function invocation is provided via the tile_memory_size parameter. The valid lifetime of the given pointer is per invocation of the user function that processes a tile. This allocation is provided per tile — and therefore two parallel invocations of the user tiling functions will receive different allocations. This is distinct from the per-node memory managed via the VX_KERNEL_ATTRIBUTE_LOCAL_DATA_SIZE and VX_KERNEL_ATTRIBUTE_LOCAL_DATA_PTR kernel attributes.The following tables describe the circumstances under which each user function will be invoked, categorized by the border mode settings of the node.
| Functions provided | VX_BORDER_MODE_UNDEFINED | VX_BORDER_MODE_SELF |
|---|---|---|
| If only the FAST function is defined... | The Fast function is invoked on the largest aligned image subset that avoids out-of-bounds accesses. Other pixels are not processed. | Prohibited (graph validation fails). |
| If only the FLEXIBLE function is defined... | The flexible function is invoked on the largest image subset that ensures neighborhood accesses are within bounds, irrespective of alignment. Other pixels are not processed. | The flexible function is invoked for all pixels in the output image irrespective of alignment, including pixels that correspond to neighborhood accesses beyond the image bounds. |
| If both the FAST and FLEXIBLE functions are defined... | The fast function is invoked on the largest aligned image subset that avoids out-of-bounds neighborhood accesses. The flexible function is invoked for the remaining pixels that do not require out-of-bounds accesses. | The fast function is invoked on the largest aligned image subset that avoids out-of-bounds neighborhood accesses. The flexible function is invoked on remaining pixels. |
Note that a missing flexible function when using VX_BORDER_MODE_SELF is treated as an error even if the fast function might never be expected to be called (for example, if the image size is divisible by the tile block size and the neighborhood is 0).
| Pixels to process | VX_BORDER_MODE_UNDEFINED | VX_BORDER_MODE_SELF |
|---|---|---|
| Pixels that require access beyond the image bounds (due to neighborhood) | Not processed by any function | Processed by the flexible function, which must be provided |
| Pixels that require access only within the image, but in an unaligned tile | Processed by the flexible function (if provided) and not processed otherwise | Processed by the flexible function, which must be provided |
| Pixels within an aligned tile that do not access beyond the image bounds | Processed by the fast function if provided, otherwise by the flexible function | Processed by the fast function if provided, otherwise by the flexible function |
Users can add simple filter-style kernels similar to this example of a Gaussian Blur.
Note that this function has a non-zero neighborhood, and performs no special checking for running at the image borders. Therefore it must be used as a “fast” function and, unless a corresponding “flexible” function is provided, must be used with VX_BORDER_MODE_UNDEFINED.
Users can also add more complex addressing functionality in order to optimize their code in this example which uses two inputs:
With a tile block width and height of one, combined with the zero neighborhood, this function can be used as a “flexible” function. With a larger tile block size, this is a “fast” function.
For better performance, the user might split this into two functions. Firstly, a “fast” function:
This fast function has a fixed 16 by 16 tile block size, the loops of which the compiler may be able to unroll or otherwise optimize — the pixel access macro is designed to give the compiler the best chance of doing this. Less portable code may be able to rely on byte alignment and fast memory accesses, and perform SIMD operations in the fast function to maximize the host CPU performance.
In addition to performance optimization, many algorithms have tile block size requirements due to spatial effects on pixels. For example, ordered dithering, UV upsampling and debayering all require pixels to be processed differently according to location within a repeating block.
This would be complemented by a corresponding “flexible” function:
This function ignores the tile block and processes the tile given to it one pixel at a time. It is expected that the flexible function should be called for much less of the image than the fast function, so this overhead should be considered acceptable. This applies whether the flexible function is invoked because of tile alignment or in order to support out-of-bounds accesses, and users writing portable OpenVX code should be able to devote effort to the fast function. However, it is legal for an implementation with minimal tiling support to process an entire image in a single, possibly unaligned, tile for portability reasons.
The simplest user functions (that depend on any inputs at all) will read only the pixel in the input image(s) that correspond(s) to the pixel in the output image(s). Such a function requires a neighborhood size of 0 in each direction. If only a single image is processed at a time, the tile block size should be set to 1 by 1. Here we show the tile block — and therefore minimum tile size — in blue (trivially).
A more complex user function may require source pixels surrounding the coordinate of the output pixel in order to work. For example, a Sobel filter requires the a three by three block centered on the current pixel. Here we show the neighborhood in cyan around the single pixel to be processed, in blue — each square represents one pixel. This neighborhood would be specified as “-1,1,-1,1”.
The user may choose to optimize the same function by making it work on blocks of 4 by 4 pixels, rather than one pixel at a time. In this case, the tile block should be set to 4 by 4. Depending on the implementation, the neighborhood may be unmodified. If a fast function is defined that uses this 4 by 4 tile block, the user should still offer a flexible function that can process tiles that are not aligned to this 4 by 4 grid, since otherwise some portions of the image area may be left unprocessed. Here we show the 4 by 4 tile block with a single pixel neighborhood, specified in the same way as before.
Tiles processed using the “fast” function will be aligned to a 4x4 grid, and some multiple of this tile block size; the fast function will be able to access valid pixels from the neighborhood area (which will surround the complete tile by a single pixel).
Consider a simple function that has no neighborhood - let us say a function which posterizes the image by replicating the top four bits of an 8-bit input into the lower four bits:
out(x,y) = (in(x,y) & 0xF0) | (in(x,y) >> 4)
If the implementation guarantees 32-bit alignment for the start of each row — not something guaranteed by OpenVX, but likely common — then the user could write a “fast” function which processes four pixels at once, using a 32-bit integer. This function would then have a 4 by 1 “tile block”.
In processing a 12 by 6 image using this function, the graph manager might choose to break the image into three rectangular “tiles”. An example in shown in different colors below. The fast function can be called for each of these tiles. It is up to a graph manager to decide different arrangements, and to decide the order in which the fast function would be invoked for these tiles — including whether the tiles are evaluated in parallel — but the tile block size guarantees that the tiles will be multiples of 4 by 1 pixels.
Next, consider what happens if the image is 14 by 6 pixels:
In this case, the 2-by-6 strip at the right edge of the image cannot be processed by tile-block-aligned tiles, and therefore the fast function cannot process these pixels. If the user has provided a “flexible” function, that function will be used to process these pixels (still provided with rectangular tiles of data).
We might imagine a function that is similar to the above, but which requires access to all the source pixels surrounding the coordinate of the destination pixel in order to operate. While it may process a single pixel, the user may choose to perform a similar optimization to that discussed in the previous section, resulting in a 4-by-1 tile block with a one-pixel border neighborhood around it.
With the user node's border mode set to VX_BORDER_MODE_UNDEFINED, the graph manager might choose to break the same 14 by 8 image into the following tiles for the user functions to process:
In this case, the white pixels at the image edge are not processed at all. The red tile is aligned to tile blocks, and can be processed with a fast function, if one is provided (otherwise the flexible function is used). The blue and green tiles do not start on (blue tile) or end on (green tile) a 4 by 1 tile block grid, so these pixels must be processed by the flexible function. If no flexible function is provided, these pixels will not be processed.
With the user node's border mode set to VX_BORDER_MODE_SELF, the graph manager might choose to break the same 14 by 8 image into the following tiles for the user functions to process:
With VX_BORDER_MODE_SELF, the flexible function must be provided. In this example, only the red tile can be processed by the fast function (if provided), since only this tile is aligned to a tile block boundary and has no requirements on pixels beyond the image borders. The blue, cyan and yellow regions are aligned to the tile block boundary, but must be processed by the flexible function because the neighborhood of these pixels extends beyond the image boundary. The green tile is additionally not aligned to the tile block size.
Tiled user kernel functions should be coded in a very specific manner to maintain the requirement of data independence in that all blocks should be computed in a functionally-independent manner. This allows the implementation greater flexibility in optimization choices. These informal requirements include: