Overview

You didn’t think you were getting out of here quite so fast, did you?   As I said at the beginning: vadd will never be at the processor.  It’s too simple; if you don’t have to transfer data and you can burn through local cache, the CPU will always win in the end, assuming you haven’t hauled your uncle’s dusty 386 Turbo out of the basement to run this test.

The results from the previous session look good - on paper.  I get it, I’m in marketing.  I want to call it a day.  But I seem to recall also having a degree with the word “engineering” on it gathering dust somewhere, so - as a fellow engineer - let me break it to you.


Key Code

For simple algorithms, an accelerator just won’t win.  Let’s use OpenMP to parallelize the processor loop.  We include the header omp.h, and then apply an OpenMP pragma to the CPU code as in listing 3.19.

Listing 3.19: SParallelize the Software with OpenMP

    
void vadd_sw(uint32_t* a, uint32_t *b, uint32_t* c, uint32_t size)
{
#pragma omp parallel for
      for (inti = 0; i < size; i++) {
          c[i] = a[i] + b[i];
      }
}

And that’s it.  There are some command line flags to pass to GCC, but CMake will take care of those (assuming you have OpenMP installed), so we can directly build and run.  The code for this example is otherwise identical to the code from Example 5.


Running the Application

With the XRT initialized, run the application by running the following command from the build directory:

./06_wide_processor alveo_examples

The program will output a message similar to this:

-- Example 6: VADD with OpenMP --

Loading XCLBin to program the Alveo board:

Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’

-- Running kernel test with XRT-allocated contiguous buffers and wide VADD (16 values/clock), with software OpenMP

OCL-mapped contiguous buffer example complete!

--------------- Key execution times ---------------
OpenCL Initialization: 253.898 ms
Allocate contiguous OpenCL buffers: 907.183 ms
Map buffers to userspace pointers:  0.307 ms
Populating buffer inputs: 1188.315 ms
Software VADD run : 157.226 ms
Memory object migration enqueue : 1.429 ms
Wait for kernel to complete : 618.231 ms
-- Example 5: Pipelining Kernel Execution --

Table 3.7 shows these results compared to the previous run:

Table 3.7: 1 GiB Buffer Timing Summary - Sequential vs. OpenMP

Operation Example 5 Example 6 ∆5→6
Software VADD 1166.471 ms 157.226 ms −1009.245 ms
Hardware VADD (Total) 692.172 ms 618.231 ms −73.94 ms
∆Alveo→CPU −503.402 ms 461.005 ms 964.407 ms

The accelerator runtime fluctuation is primarily a result of running these tests in a virtualized cloud environment, but that’s not the point of the exercise.


Extra Exercises

Some things to try to build on this experiment:

  • Try to beat the processor at vector addition!
  • Play with the OpenMP pragmas; how many CPU cores are needed to beat a hardware accelerator?

Key Takeaways

  • I’ve said it before and I’ll say it again: simple O(N) will never win.
But despair not!  Now it’s time to look at something real.
 
 

About Rob Armstrong

About Rob Armstrong

Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges.  Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.