AI Engine-ML Intrinsics User Guide  (v2023.2)
Multiply-accumulate of fp32 x fp32 datatypes

Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE2_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate. More...

Overview

Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE2_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate.

Element-wise multiplication using bf16 data-path


v16accfloat mul_elem_16 (v16float v1, v16float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mul_elem_16_accuracy_low (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. More...
 
v16accfloat mul_elem_16_accuracy_fast (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. More...
 
v16accfloat mul_elem_16_accuracy_safe (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic. More...
 
v8caccfloat mul_elem_8 (v8float v1, v8cfloat v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8 (v8cfloat v1, v8float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8 (v8cfloat v1, v8cfloat v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_low (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_low. More...
 
v8caccfloat mul_elem_8_accuracy_low (v8cfloat v1, v8float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_fast (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. More...
 
v8caccfloat mul_elem_8_accuracy_fast (v8cfloat v1, v8float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_safe (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic. More...
 
v8caccfloat mul_elem_8_accuracy_safe (v8cfloat v1, v8float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat negmul_elem_16 (v16float v1, v16float v2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_16 intrinsic is same as neg(mul_elem_16_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat negmul_elem_8 (v8float v1, v8cfloat v2)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat negmul_elem_8 (v8cfloat v1, v8float v2)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat negmul_elem_8 (v8cfloat v1, v8cfloat v2)
 Elementwise multiplication of cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat negmul_elem_16_accuracy_low (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. More...
 
v8caccfloat negmul_elem_8_accuracy_low (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. More...
 
v8caccfloat negmul_elem_8_accuracy_low (v8cfloat v1, v8float v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. More...
 
v8caccfloat negmul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. More...
 
v16accfloat negmul_elem_16_accuracy_fast (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. More...
 
v8caccfloat negmul_elem_8_accuracy_fast (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. More...
 
v8caccfloat negmul_elem_8_accuracy_fast (v8cfloat v1, v8float v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. More...
 
v8caccfloat negmul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. More...
 
v16accfloat negmul_elem_16_accuracy_safe (v16float v1, v16float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_16 intrinsic is same as negmul_elem_16_accuracy_safe intrinsic. More...
 
v8caccfloat negmul_elem_8_accuracy_safe (v8float v1, v8cfloat v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. More...
 
v8caccfloat negmul_elem_8_accuracy_safe (v8cfloat v1, v8float v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. More...
 
v8caccfloat negmul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. More...
 
v16accfloat mac_elem_16 (v16float v1, v16float v2, v16accfloat acc)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat mac_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmac_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat msc_elem_16 (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat msc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat msc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat msc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v8caccfloat msc_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmsc_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmsc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmsc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmsc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
 Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. More...
 

Matrix multiplication using bf16 data-path


v16accfloat mul_4x8_8x4 (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v4caccfloat mul_2x8_8x2 (v16float v1, v16cfloat v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mul_4x8_8x4_accuracy_safe (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mul_4x8_8x4 intrinsic is same as mul_4x8_8x4_accuracy_safe intrinsic. More...
 
v4caccfloat mul_2x8_8x2_accuracy_safe (v16float v1, v16cfloat v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mul_4x8_8x4_accuracy_fast (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast intrinsic. More...
 
v4caccfloat mul_2x8_8x2_accuracy_fast (v16float v1, v16cfloat v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mul_4x8_8x4_accuracy_low (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath. 16 bits in mantissa used). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low. More...
 
v4caccfloat mul_2x8_8x2_accuracy_low (v16float v1, v16cfloat v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat negmul_4x8_8x4 (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of negmul_4x8_8x4 is same as negmul_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat negmul_4x8_8x4_accuracy_safe (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of negmul_4x8_8x4 intrinsic is same as negmul_4x8_8x4_accuracy_safe intrinsic. More...
 
v16accfloat negmul_4x8_8x4_accuracy_fast (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast intrinsic. More...
 
v16accfloat negmul_4x8_8x4_accuracy_low (v32float v1, v32float v2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low intrinsic. More...
 
v16accfloat mac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat mac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic. More...
 
v16accfloat mac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic. More...
 
v16accfloat mac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic. More...
 
v16accfloat addmac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic. More...
 
v16accfloat addmac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic. More...
 
v16accfloat addmac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic. More...
 
v16accfloat msc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_4x8_8x4 is same as msc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat msc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of msc_4x8_8x4 intrinsic is same as msc_4x8_8x4_accuracy_safe intrinsic. More...
 
v16accfloat msc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast intrinsic. More...
 
v16accfloat msc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low intrinsic. More...
 
v16accfloat addmsc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. More...
 
v16accfloat addmsc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic. More...
 
v16accfloat addmsc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic. More...
 
v16accfloat addmsc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
 Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic. More...
 

Function Documentation

◆ addmac_4x8_8x4()

v16accfloat addmac_4x8_8x4 ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmac_4x8_8x4_accuracy_fast()

v16accfloat addmac_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmac_4x8_8x4_accuracy_low()

v16accfloat addmac_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmac_4x8_8x4_accuracy_safe()

v16accfloat addmac_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmac_elem_16()

v16accfloat addmac_elem_16 ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_fast()

v16accfloat addmac_elem_16_accuracy_fast ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_low()

v16accfloat addmac_elem_16_accuracy_low ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_safe()

v16accfloat addmac_elem_16_accuracy_safe ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and accumulate Result

◆ addmsc_4x8_8x4()

v16accfloat addmsc_4x8_8x4 ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmsc_4x8_8x4_accuracy_fast()

v16accfloat addmsc_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmsc_4x8_8x4_accuracy_low()

v16accfloat addmsc_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmsc_4x8_8x4_accuracy_safe()

v16accfloat addmsc_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmsc_elem_16()

v16accfloat addmsc_elem_16 ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_fast()

v16accfloat addmsc_elem_16_accuracy_fast ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_low()

v16accfloat addmsc_elem_16_accuracy_low ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_safe()

v16accfloat addmsc_elem_16_accuracy_safe ( v16float  v1,
v16float  v2,
v16accfloat  acc1,
v16accfloat  acc2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters
acc1accumulator 1 input
acc2accumulator 2 input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ mac_4x8_8x4()

v16accfloat mac_4x8_8x4 ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ mac_4x8_8x4_accuracy_fast()

v16accfloat mac_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ mac_4x8_8x4_accuracy_low()

v16accfloat mac_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ mac_4x8_8x4_accuracy_safe()

v16accfloat mac_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ mac_elem_16()

v16accfloat mac_elem_16 ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_16_accuracy_fast()

v16accfloat mac_elem_16_accuracy_fast ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and accumulate Result

◆ mac_elem_16_accuracy_low()

v16accfloat mac_elem_16_accuracy_low ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and accumulate Result

◆ mac_elem_16_accuracy_safe()

v16accfloat mac_elem_16_accuracy_safe ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and accumulate Result

◆ mac_elem_8() [1/3]

v8caccfloat mac_elem_8 ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8() [2/3]

v8caccfloat mac_elem_8 ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8() [3/3]

v8caccfloat mac_elem_8 ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_fast() [1/3]

v8caccfloat mac_elem_8_accuracy_fast ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_fast() [2/3]

v8caccfloat mac_elem_8_accuracy_fast ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_fast() [3/3]

v8caccfloat mac_elem_8_accuracy_fast ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_low() [1/3]

v8caccfloat mac_elem_8_accuracy_low ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_low() [2/3]

v8caccfloat mac_elem_8_accuracy_low ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_low() [3/3]

v8caccfloat mac_elem_8_accuracy_low ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_safe() [1/3]

v8caccfloat mac_elem_8_accuracy_safe ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_safe() [2/3]

v8caccfloat mac_elem_8_accuracy_safe ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mac_elem_8_accuracy_safe() [3/3]

v8caccfloat mac_elem_8_accuracy_safe ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ msc_4x8_8x4()

v16accfloat msc_4x8_8x4 ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_4x8_8x4 is same as msc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ msc_4x8_8x4_accuracy_fast()

v16accfloat msc_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ msc_4x8_8x4_accuracy_low()

v16accfloat msc_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ msc_4x8_8x4_accuracy_safe()

v16accfloat msc_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2,
v16accfloat  acc 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of msc_4x8_8x4 intrinsic is same as msc_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
accacc input
Returns
Result of operation

◆ msc_elem_16()

v16accfloat msc_elem_16 ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_16_accuracy_fast()

v16accfloat msc_elem_16_accuracy_fast ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_16_accuracy_low()

v16accfloat msc_elem_16_accuracy_low ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_16_accuracy_safe()

v16accfloat msc_elem_16_accuracy_safe ( v16float  v1,
v16float  v2,
v16accfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8() [1/3]

v8caccfloat msc_elem_8 ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8() [2/3]

v8caccfloat msc_elem_8 ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8() [3/3]

v8caccfloat msc_elem_8 ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_fast() [1/3]

v8caccfloat msc_elem_8_accuracy_fast ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_fast() [2/3]

v8caccfloat msc_elem_8_accuracy_fast ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_fast() [3/3]

v8caccfloat msc_elem_8_accuracy_fast ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_low() [1/3]

v8caccfloat msc_elem_8_accuracy_low ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_low() [2/3]

v8caccfloat msc_elem_8_accuracy_low ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_low() [3/3]

v8caccfloat msc_elem_8_accuracy_low ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_safe() [1/3]

v8caccfloat msc_elem_8_accuracy_safe ( v8cfloat  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_safe() [2/3]

v8caccfloat msc_elem_8_accuracy_safe ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ msc_elem_8_accuracy_safe() [3/3]

v8caccfloat msc_elem_8_accuracy_safe ( v8float  v1,
v8cfloat  v2,
v8caccfloat  acc 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters
accaccumulator input
v1Vector v1
v2Vector v2
Returns
Elementwise mutipliy and subtraction (acc - mul_out)

◆ mul_2x8_8x2()

v4caccfloat mul_2x8_8x2 ( v16float  v1,
v16cfloat  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_2x8_8x2_accuracy_fast()

v4caccfloat mul_2x8_8x2_accuracy_fast ( v16float  v1,
v16cfloat  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_2x8_8x2_accuracy_low()

v4caccfloat mul_2x8_8x2_accuracy_low ( v16float  v1,
v16cfloat  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_2x8_8x2_accuracy_safe()

v4caccfloat mul_2x8_8x2_accuracy_safe ( v16float  v1,
v16cfloat  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_4x8_8x4()

v16accfloat mul_4x8_8x4 ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_4x8_8x4_accuracy_fast()

v16accfloat mul_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_4x8_8x4_accuracy_low()

v16accfloat mul_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath. 16 bits in mantissa used). Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_4x8_8x4_accuracy_safe()

v16accfloat mul_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mul_4x8_8x4 intrinsic is same as mul_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ mul_elem_16()

v16accfloat mul_elem_16 ( v16float  v1,
v16float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_16_accuracy_fast()

v16accfloat mul_elem_16_accuracy_fast ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_16_accuracy_low()

v16accfloat mul_elem_16_accuracy_low ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_16_accuracy_safe()

v16accfloat mul_elem_16_accuracy_safe ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8() [1/3]

v8caccfloat mul_elem_8 ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8() [2/3]

v8caccfloat mul_elem_8 ( v8cfloat  v1,
v8float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8() [3/3]

v8caccfloat mul_elem_8 ( v8float  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_fast() [1/3]

v8caccfloat mul_elem_8_accuracy_fast ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_fast() [2/3]

v8caccfloat mul_elem_8_accuracy_fast ( v8cfloat  v1,
v8float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_fast() [3/3]

v8caccfloat mul_elem_8_accuracy_fast ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_low() [1/3]

v8caccfloat mul_elem_8_accuracy_low ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_low() [2/3]

v8caccfloat mul_elem_8_accuracy_low ( v8cfloat  v1,
v8float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_low() [3/3]

v8caccfloat mul_elem_8_accuracy_low ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_safe() [1/3]

v8caccfloat mul_elem_8_accuracy_safe ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_safe() [2/3]

v8caccfloat mul_elem_8_accuracy_safe ( v8cfloat  v1,
v8float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ mul_elem_8_accuracy_safe() [3/3]

v8caccfloat mul_elem_8_accuracy_safe ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication Result

◆ negmul_4x8_8x4()

v16accfloat negmul_4x8_8x4 ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of negmul_4x8_8x4 is same as negmul_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ negmul_4x8_8x4_accuracy_fast()

v16accfloat negmul_4x8_8x4_accuracy_fast ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ negmul_4x8_8x4_accuracy_low()

v16accfloat negmul_4x8_8x4_accuracy_low ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ negmul_4x8_8x4_accuracy_safe()

v16accfloat negmul_4x8_8x4_accuracy_safe ( v32float  v1,
v32float  v2 
)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of negmul_4x8_8x4 intrinsic is same as negmul_4x8_8x4_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Result of operation

◆ negmul_elem_16()

v16accfloat negmul_elem_16 ( v16float  v1,
v16float  v2 
)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_16 intrinsic is same as neg(mul_elem_16_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
negated Elementwise mutiplication Result

◆ negmul_elem_16_accuracy_fast()

v16accfloat negmul_elem_16_accuracy_fast ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_16_accuracy_low()

v16accfloat negmul_elem_16_accuracy_low ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_16_accuracy_safe()

v16accfloat negmul_elem_16_accuracy_safe ( v16float  v1,
v16float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_16 intrinsic is same as negmul_elem_16_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8() [1/3]

v8caccfloat negmul_elem_8 ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise multiplication of cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
negated Elementwise mutiplication Result

◆ negmul_elem_8() [2/3]

v8caccfloat negmul_elem_8 ( v8cfloat  v1,
v8float  v2 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
negated Elementwise mutiplication Result

◆ negmul_elem_8() [3/3]

v8caccfloat negmul_elem_8 ( v8float  v1,
v8cfloat  v2 
)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters
v1Vector v1
v2Vector v2
Returns
negated Elementwise mutiplication Result

◆ negmul_elem_8_accuracy_fast() [1/3]

v8caccfloat negmul_elem_8_accuracy_fast ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_fast() [2/3]

v8caccfloat negmul_elem_8_accuracy_fast ( v8cfloat  v1,
v8float  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_fast() [3/3]

v8caccfloat negmul_elem_8_accuracy_fast ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE2_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_low() [1/3]

v8caccfloat negmul_elem_8_accuracy_low ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_low() [2/3]

v8caccfloat negmul_elem_8_accuracy_low ( v8cfloat  v1,
v8float  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_low() [3/3]

v8caccfloat negmul_elem_8_accuracy_low ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE2_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_safe() [1/3]

v8caccfloat negmul_elem_8_accuracy_safe ( v8cfloat  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_safe() [2/3]

v8caccfloat negmul_elem_8_accuracy_safe ( v8cfloat  v1,
v8float  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result

◆ negmul_elem_8_accuracy_safe() [3/3]

v8caccfloat negmul_elem_8_accuracy_safe ( v8float  v1,
v8cfloat  v2 
)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

Parameters
v1Vector v1
v2Vector v2
Returns
Elementwise mutiplication and negation of result