AI Engine-ML Intrinsics User Guide  (v2023.2)
Multiply Accumulate

Intrinsics allowing you to perform MUL/MAC operations and a few of their variants. More...

Overview

Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.

For integer datatypes, a matrix A of size MxN is multiplied with a matrix B of size NxP. The naming convention for these operations is: [operation][_MxN_NxP]{_Cch}{_conf} or [operation]_conv_MxN{_Cch}{_conf}. Properties in [] are mandatory, properties in {} are optional. In this naming, conv indicates a convolutional operation, conf indicates the use of sub, zero or shift masks and C gives the number of channels.
For an MxN vector multiply convolution operation, the calculation performed is:

\[ \text{mul_conv_MxN}(F,G) = \sum_{u=0}^{\text{N}-1}{G(u) F(x+u)} \]

where the vector \(F\) has length \(\text{M}+\text{N}-1\), and the vector \(G\) has length \(\text{N}\).

For element-wise operations, the naming is [operation_elem_C]{_N}. Here, C is the number of channels and N is the number of columns of matrix A/rows of matrix B. N is either two or it is omitted. The element-wise operations are executed channel by channel. The output will also be a matrix of with C channels.

For complex datatypes, a multiplication of two matrices with complex elements is performed. The naming convention for these operations is [operation_elem_8]{_conf} for Multiply-accumulate of 32b x 16b complex integer datatypes and [operation_elem_8_2]{_conf} for Multiply-accumulate of 16b x 16b complex integer datatypes. Here, eight is the number of channels and the two is the number columns of matrix A/rows of matrix B. The matrix multiplication is performed indvidually for each channel of the input matrices. The output will also be a matrix with eight channels.

The following table shows the matrix multiplications that can be completed within a single cycle.

Precision Mode Channels Matrix A Matrix B Matrix C
8-bit x 4-bit = 32-bit 1 4x16 16x8 4x8
8-bit x 4-bit = 32-bit 1 4x32 32x8 (sparse) 4x8
8-bit x 8-bit = 32-bit 1 4x8 8x8 4x8
8-bit x 8-bit = 32-bit 32 1x2 2x1 1x1
8-bit x 8-bit = 32-bit 8 4x4 (convolution) 4x1 4x1
8-bit x 8-bit = 32-bit 4 8x8 (convolution) 8x1 8x1
8-bit x 8-bit = 32-bit 1 32x8 (convolution) 8x1 32x1
8-bit x 8-bit = 32-bit 1 4x16 16x8 (sparse) 4x8
16-bit x 8-bit = 32-bit 1 4x4 4x8 4x8
16-bit x 8-bit = 32-bit 2 4x4 4x4 4x4
16-bit x 16-bit = 32-bit 1 4x2 2x8 4x8
16-bit x 16-bit = 32-bit 32 1x1 1x1 1x1
16-bit x 8-bit = 64-bit 1 2x8 8x8 2x8
16-bit x 8-bit = 64-bit 1 4x8 8x4 4x4
16-bit x 8-bit = 64-bit 1 2x16 16x8 (sparse) 2x8
16-bit x 16-bit = 64-bit 1 2x4 4x8 2x8
16-bit x 16-bit = 64-bit 1 4x4 4x4 4x4
16-bit x 16-bit = 64-bit 16 1x2 2x1 1x1
16-bit x 16-bit = 64-bit 1 16x4 (convolution) 4x1 16x1
Complex 16-bit x Complex 16-bit = 64-bit 8 1x2 2x1 1x1
16-bit x 16-bit = 64-bit 1 2x8 8x8 (sparse) 2x8
32-bit x 16-bit = 64-bit 1 4x2 2x4 4x4
Complex 32-bit x Complex 16-bit = 64-bit 8 1x1 1x1 1x1
bfloat16 x bfloat16 = fp32 1 4x8 8x4 4x4
bfloat16 x bfloat16 = fp32 16 1x2 2x1 1x1
bfloat16 x bfloat16 = fp32 1 4x16 16x4 (sparse) 4x4

Matrix mult intrinsics

We can summarize the MUL and the MAC operation like this:

MAC: res = acc_in1 + (X_vec x Y_vec)
MUL: res = (X_vec x Y_vec)

The 'x' operator being the matrix multiplication operator. The same way we can summarize the MSC, NEGMUL, MACMUL and MAC/MSC variants with additional acc_in2 input operations as this:

MSC: res = acc_in1 - (X_vec x Y_vec)
NEGMUL: res = - (X_vec x Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec x Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec x Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec x Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec x Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec x Y_vec)

The convolve variants

The convolve variants of these intrinsics differs as they apply a convolution product on the vectors instead of a matrix multiplication. The '*' operator being the vector convolution operator. Therefore, the X_vec is the matrix, and Y_vec the kernel.

MAC: res = acc_in1 + (X_vec * Y_vec)
MUL: res = (X_vec * Y_vec)
MSC: res = acc_in1 - (X_vec * Y_vec)
NEGMUL: res = - (X_vec * Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec * Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec * Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec * Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec * Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec * Y_vec)

Zeroing, sign and negation masks

Some variant allow the passing of masks that are used to determine sign, zeroing and negation of vector or accumulator lanes. These masks are the following:

int sgn_x: Sign mask of matrix X. If it is one matrix X is interpreted as signed, else it treated as unsigned.
int sgn_y: Sign mask of matrix Y. If it is one matrix Y is interpreted as signed, else it treated as unsigned.
int zero_acc1: Zeroing of acc1. If it is one then acc1 is zeroed.
int zero_acc2: Zeroing of acc2. If it is one then acc2 is zeroed.
int sub_mul: Negation mask of the matrix multiplication result. If it is one the result of the operation will be negated.
int sub_acc1: Negation mask of acc1. If it is one acc1 will be negated.
int sub_acc2: Negation mask of acc2. If it is one acc2 will be negated.
int shift16: Shift mask of acc1. If a bit is set the <<16 operation will be executed on acc1.
int sub_mask: Negation mask of complex multiplications. Negates a term of a complex multiplication.

Complex multiplications require some terms to be negated in order to implement conjugation and minus j multiplication. This is done through the sub_mask. The following examples show how this mask is used when two complex numbers, X and Y, are multiplied to get an output O. For Multiply-accumulate of 16b x 16b complex integer datatypes there are two complex numbers post-added. They are indicated by the postfix 0/1:

O[re] = -1^sub_mask[0] * X[re0] * Y[re0] + -1^sub_mask[1] * X[im0] * Y[im0]
+ -1^sub_mask[2] * X[re1] * Y[re1] + -1^sub_mask[3] * X[im1] * Y[im1]
O[im] = -1^sub_mask[4] * X[re0] * Y[im0] + -1^sub_mask[5] * X[im0] * Y[re0]
+ -1^sub_mask[6] * X[re1] * Y[im1] + -1^sub_mask[7] * X[im1] * Y[re1]

For Multiply-accumulate of 32b x 16b complex integer datatypes there is no postadding and only four unique terms are needed. However, all 8 bit must be specified apropriately. In the following equation the index bits used for one term must be the same value.

O[re] = -1^sub_mask[0|2] * X[re] * Y[re] + -1^sub_mask[1|3] * X[im] * Y[im]
O[im] = -1^sub_mask[4|6] * X[re] * Y[im] + -1^sub_mask[5|6] * X[im] * Y[re]

Multiplication of matrices with multiple channels

Some intrinsics are used for multiplications of matrices with a given number of channels. Each MxN matrix is stored in row-major and channel-minor fashion. The following example shows the resulting layout of elements in the vector for a 4x4 matrix with two channels. The indexes for each element are given as (m,n,c)

[a(0,0,0) a(0,0,1) a(0,1,0) a(0,1,1) a(0,2,1) a(0,3,0) a(0,3,1)
a(1,0,0) a(1,0,1) a(1,1,0) a(1,1,1) a(1,2,1) a(1,3,0) a(1,3,1)
a(2,0,0) a(2,0,1) a(2,1,0) a(2,1,1) a(2,2,1) a(2,3,0) a(2,3,1)
a(3,0,0) a(3,0,1) a(3,1,0) a(3,1,1) a(3,2,1) a(3,3,0) a(3,3,1)]
Note
Matrices with multiple channels are used for convolutional and element-wise operations. Element-wise operations are performed along the channels. E.g. an element-wise mutltiplication of two matrices with 32 channels would perform a matrix multiplication for each individual channel. The output would again have 32 channels.

Element-wise multiplication

The elem variants allow you to perform element-wise operations. The operations are performed along the channels. For example, if you perform a (1x1x32) x (1x1x32) operation a multiplication will be done between the elements of the same channel. So, the elements of channel zero will be multiplied, the elements of channel one will be multiplied etc... The end result would again have 32 channels.

Some of the elem variants perform matrix multiplications along the channels. For those cases the multiplication (1x2xC) x (2x1xC) is performed. The end result is a (1x1x32) matrix. Despite the name, this is not a true element-wise multiplication.

Convolution operation

Convolutional operations work similar to element-wise multiplication. In every step the kernel will be multiplied with the matrix before it is shifted to the next position. The same is done for each channel. The difference to a regular element-wise multiplication is that after the multiplications for each channel have been completed the resulting matrices are added together so that the final result will have only one channel.

Considerations when using bfloat16 data type

When multiplying with a scalar bfloat16 it will be internally cast to float which influences the rounding behaviour with negation. The following example shows how this behaviour affects the multiplication. As the cast involves a rounding operation it matters if the negation is performed before or after the cast. In the first case, the rounding happens to the positive result before the negation. For the second and third case the rounding happens before that which will lead to a different result.

bfloat16 a, b;
auto v1 = -(a * v[0]); //This will not match the other operations because the rounding is done to the positive result before negation
auto v2 = (-a * v[0]);
auto v3 = (a * -v[0]);

Considerations when using emulated FP32 Intrinsics

elementwise multiplication and matrix multiplication intrinsics for FP32 input type are emulated using bfloat16 data-path. There are 3 options to chose from. Default option (Most accurate but slow):

### _accuracy_safe intrinsics
Most accurate option since input fp32 number is split in to 3 bfloat16 numbers to extract all the bits of the mantissa.
float a, b;
a*b would require 9 mac operations due to 3 bfloat16 splits each.

Fast and accurate option:

### _accuracy_fast intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_FAST": Fast and Accurate option.
Input fp32 number is split in to 3 bfloat16 numbers to extract all the bits of the mantissa.
float a,b;
both a and b are split in to 3 bfloat16 numbers each. Hence there would be 9 mac operations in multiplication of a and b.
In the 9 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (3 last terms).
This helps improve cycle count of mul and has least impact on accuracy of result.
float a, b;
a*b would require 6 mac operations.

Fastest option with loss of accuracy:

### _accuracy_low intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_LOW": Fast and least accurate option.
Input fp32 number is split in to 2 bfloat16 numbers. Hence not all the bits from mantissa can be used.
float a,b;
Both a and b are split in to 2 bfloat16 numbers each. Hence there would be 4 mac operations in multiplication of a and b.
In the 4 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (1 last term).
This helps improve cycle count of mul
float a, b;
a*b would require 3 mac operations.

Modules

 Emulated Multiply-accumulate of 16b x 32b datatypes
 Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of 32b x 16b datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of 32b x 32b datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b integer datatypes and Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of Complex 32b x Complex 32b datatypes
 Matrix multiplications in which matrix A has data elements of complex 32 bit and matrix B has data elements of complex 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b complex integer datatypes and might not have optimal performance.
 
 Multiply-accumulate of 16b x 16b complex integer datatypes
 Matrix multiplications in which matrix A and matrix B have complex data elements of 16 bit. For an explanation how these operations works see Multiply Accumulate.
 
 Multiply-accumulate of 16b x 16b integer datatypes
 Matrix multiplications in which matrix A and matrix B have data elements of 16 bit.
 
 Multiply-accumulate of 16b x 8b integer datatypes
 Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 8 bit.
 
 Multiply-accumulate of 32b x 16b complex integer datatypes
 Matrix multiplications in which matrix A has complex data elements of 32 bit and matrix B has complex data elements of 16 bit.
 
 Multiply-accumulate of 32b x 16b integer datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit.
 
 Multiply-accumulate of 8b x 4b datatypes
 Matrix multiplications in which matrix A has data elements of 8 bit and matrix B has data elements of 4 bit.
 
 Multiply-accumulate of 8b x 8b integer datatypes
 Matrix multiplications in which matrix A and matrix B have data elements of 8 bit.
 
 Multiply-accumulate of bfloat16 datatypes
 Matrix multiplications in which matrix A and B have bfloat16 data elements.
 
 Multiply-accumulate of fp32 x fp32 datatypes
 Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE2_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate.
 
 Multiply-accumulate with a sparse matrix
 Matrix multiplications in which matrix B is a sparse matrix.
 
bfloat16
Definition: me_bfloat16.h:72