Read Get Moving with Alveo Example 0: Loading Alveo an Image
Full source for the examples in this article can be found here: https://github.com/xilinx/get_moving_with_alveo
The FPGA image that we’ve loaded contains a very simple vector addition core. It takes two buffers of arbitrary length as inputs and produces a buffer of equal length as an output. As the name implies, during the process it adds them together.
Our code has not really been optimized to run well in an FPGA. It’s mostly equivalent to directly putting the algorithm in listing 3.3 directly into the FPGA fabric. This isn’t particularly efficient. We can process one addition operation on each tick of the clock but we’re still only processing one 32-bit output at a time.
Listing 3.3: Vector Addition Algorithm
void vadd_sw(uint32_t *a, uint32_t *b, uint32_t *c, uint32_t size)
{
for (int i = 0; i < size; i++) {
c [i] = a [i] + b [i];
}
}
It’s very important to note that at this point there is no way this code will beat the processor. The clock in the FPGA fabric is significantly slower than the CPU clock. This is expected, though - thinking back to our earlier example, we’re only loading a single passenger into each car on the train. We also have overhead to pass the data over PCIe, set up DMA, etc. For the next few examples, we’ll look at how to efficiently manage the buffers for our inputs and outputs to this function. Only after that will we start to take advantage of the acceleration we can get from the Alveo™ Data Center accelerator card.
This example is the first time we’re going to actually run something on the FPGA, modest though it may be. In order to run something on the card there are four things that we must do:
As you can see, only one of those things actually takes place on the card. Memory management will make or break your application’s performance, so let’s start to take a look at that.
If you haven’t done acceleration work before, you may be tempted to jump in and just use normal calls to malloc()
or new to allocate your memory. In this example we’ll do just that, allocating a series of buffers to transfer between the host and the Alveo card. We’ll allocate four buffers: two input buffers to add together, one output buffer for the Alveo to use, and an extra buffer for a software implementation of our vadd
function. This allows us to see something interesting: how we allocate memory for Alveo also impacts how efficiently the processor will run.
Buffers are allocated simply, as in listing 3.4. In our case, BUFSIZE
is 24 MiB, or 6 × 1024 × 1024 values of type uint32_t
. Any code not mentioned here is either identical or functionally equivalent to the previous examples.
3.4: Simple Buffer Allocation
uint32_t* a =new uint32_t[BUFSIZE];
uint32_t* b =new uint32_t[BUFSIZE];
uint32_t* c =new uint32_t[BUFSIZE];
uint32_t* d =new uint32_t[BUFSIZE];
This will allocate memory that is virtual, paged, and, most importantly, non-aligned. In particular it’s this last one that is going to cause some problems, as we’ll soon see.
Once we allocate the buffers and populate them with initial test vectors, the next acceleration step is to send them down to the Alveo global memory. We do that by creating OpenCL buffer objects using the flag CL_MEM_USE_HOST_PTR. This tells the API that rather than allocating its own buffer, we are providing our own pointers. This isn’t necessarily bad, but because we haven’t taken care allocating our pointers it’s going to hurt our performance.
Listing 3.5 contains the code mapping our allocated buffers to OpenCL buffer objects.
Listing 3.5: Mapping OCL Buffers with Host Memory Pointers
std::vector<cl::Memory> inBufVec, outBufVec;
cl::Buffer a_to_device(context,
static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof (uint32_t),
a,
NULL);
cl::Buffer b_to_device (context,
static_cast<cl_mem_flags(CL_MEM_READ_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof(uint32_t),
b,
NULL);
cl::Buffer c_from_device(context,
static_cast<cl_mem_flags>(CL_MEM_WRITE_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof(uint32_t),
c,
NULL);
inBufVec.push_back(a_to_device);
inBufVec.push_back(b_to_device);
outBufVec.push_back(c_from_device);
What we’re doing here is allocating cl::Buffer
objects, which are recognized by the API, and passing in pointers a, b, and c from our previously-allocated buffers. The additional flags CL_MEM_READ_ONLY
and CL_MEM_WRITE_ONLY
specify to the runtime the visibility of these buffers from the perspective of the kernel. In other words, a and b are written to the card by the host - to the kernel they are read only. Then, c
is read back from the card to the host. To the kernel it is write only. We additionally add these buffer objects to vectors so that we can transfer multiple buffers at once (note that we’re essentially adding pointers to the vectors, not the data buffers themselves).
Next, we can transfer the input buffers down to the Alveo card using the code in listing3.6.
Listing 3.6: Migrating Host Memory to Alveo
cl::Event event_sp;
q.enqueueMigrateMemObjects(inBufVec, 0, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
In this code snippet the “main event” is the call to enqueueMigrateMemObjects()
on line 108. We pass in our vector of buffers, the 0 indicates that this is a transfer from host to device, and we also pass in a cl::Event
object.
This is a good time to segue briefly into synchronization. When we enqueue the transfer we’re adding it to the runtime’s “to-do list”, if you will, but not actually waiting for it to complete. By registering a cl::Event
object, we can then decide to wait on that event at any point in the future. In general this isn’t a point where you would necessarily want to wait, but we’ve done this at various points throughout the code to more easily instrument it to display the time taken for various operations. This adds a small amount of overhead to the application, but again, this is a learning exercise and not an example of optimizing for maximum performance.
We now need to tell the runtime what to pass to our kernel, and we do that in listing 3.7. Recall that our argument list looked like this:
(uint32_t*a, uint32_t*b, uint32_t*c, uint32_t size)
In our case a
is argument 0, b
is argument 1, and so on.
Listing 3.7: Setting Kernel Arguments
krnl.setArg(0, a_to_device);
krnl.setArg(1, b_to_device);
krnl.setArg(2, c_from_device);
krnl.setArg(3, BUFSIZE);
Next, we add the kernel itself to the command queue so that it will begin executing. Generally speaking, you would enqueue the transfers and the kernel such that they’d execute back-to-back rather than synchronizing in between. The line of code that adds the execution of the kernel to the command queue is in listing 3.8.
Listing 3.8: Enqueue Kernel Run
q.enqueueTask(krnl, NULL, &event_sp);
If you don’t want to wait at this point you can again pass in NULL
instead of acl::Event
object.
And, finally, once the kernel completes we want to transfer the memory back to the host so that we can access the new values from the CPU. This is done in listing 3.9.
Listing 3.9: Transferring Data Back to the Host
q.enqueueMigrateMemObjects(outBufVec, CL_MIGRATE_MEM_OBJECT_HOST, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
In this instance we do want to wait for synchronization. This is important; recall that when we call these enqueue functions, we’re placing entries onto the command queue in a non-blocking manner. If we then attempt to access the buffer immediately after enqueuing the transfer, it have finished reading back in.
Excluding the FPGA configuration from example 0, the new additions in order to run the kernel are:
cl::Buffer
objects.Only one synchronization is needed were this a real application. As previously, mentioned we’re using several to better report on the timing of various operations in the workflow.
With the XRT initialized, run the application by running the following command from the build directory.
./01_simple_malloc alveo_examples
The program will output a message similar to this:
-- Example 1: Vector Add with Malloc() --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBINFile Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
Running kernel test with malloc()ed buffers
WARNING: unaligned host pointer ’0x154f7909e010’ detected,
this leads to extra memcpy
WARNING: unaligned host pointer ’0x154f7789d010’ detected,
this leads to extra memcpy
WARNING: unaligned host pointer ’0x154f7609c010’ detected,
this leads to extra memcpy
Simple malloc vadd example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 247.371 ms
Allocating memory buffer: 0.030 ms
Populating buffer inputs: 47.955 ms
Software VADD run: 35.706 ms
Map host buffers to OpenCL buffers: 64.656 ms
Memory object migration enqueue: 24.829 ms
Set kernel arguments: 0.009 ms
OCL Enqueue task: 0.064 ms
Wait for kernel to complete: 92.118 ms
Read back computation results: 24.887 ms
Note that we have some warnings about unaligned host pointers. Because we didn’t take care with our allocation, none of our buffers that we’re transferring to or from the Alveo card are aligned to the 4 KiB boundaries needed by the Alveo DMA engine. Because of this, we need to copy the buffer contents so they’re aligned before transfer, and that operation is quite expensive.
From this point on in our examples, let’s keep a close eye on these numbers. While there will be some variabilit yon the latency run-to-run, generally speaking we are looking for deltas in each particular area. For now let’s establish a baseline in table 3.1.
Table 3.1: Timing Summary - Example 1
Operation | Example 1 |
---|---|
OCL Initialization | 247.371 ms |
Buffer Allocation | 30 μs |
Buffer Population | 47.955 ms |
Software VADD | 35.706 ms |
Buffer Mapping | 64.656 ms |
Write Buffers Out | 24.829 ms |
Set Kernel Args | 9 μs |
Kernel Runtime | 92.118 ms |
Read Buffer In | 24.887 ms |
∆Alveo→CPU | −418.228 ms |
∆Alveo→CPU (algorithm only) | −170.857 ms |
That’s certainly... not great. But are we going to give up? What, you think there must be some kind of reason Xilinx built this thing? Ok, let’s see if we can do better!
Some things to try to build on this experiment:
Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.