[1]杨指政,杜子东,文渊博.基于国产PuDianNao 芯片的向量函数库优化[J].郑州大学学报(工学版),2023,44(01):31-37. 　YANG Z Z,DU Z D,WEN Y B..Optimization of Vector Function Library Based on Domestic PuDianNao Chip[J].Journal of Zhengzhou University (Engineering Science),2023,44(01):31-37. 点击复制 基于国产PuDianNao 芯片的向量函数库优化() 分享到： var jiathis_config = { data_track_clickback: true };

44

2023年01期

31-37

2022-12-06

文章信息/Info

Title:
Optimization of Vector Function Library Based on Domestic PuDianNao Chip

Author(s):

A

Abstract:
At present, the vector math functions on the PuDianNao chip of the domestic artificial intelligence processor can only be implemented by calling scalar functions cyclically, and the performance of this method is relatively low. Based on the PuDianNao chip, three optimization methods were proposed. The first two were interpolation method and SIMD masking method. Thirdly, based on a hardware array structure on PuDianNao, VLIW instructions were used to operate each processing unit in the array, and the SIMT programming model was encapsulated programmatically. The accuracy and performance of the above three methods were tested, and the experimental results showed that the third method had the best accuracy and performance. The third method was used to implement the vector mathematical function library PuDianNao-VecMath based on the domestic PuDianNao chip, which solved the problem that the multi-branch structure of mathematical functions was difficult to vectorize. The function library had good precision performance, stable functions and correct operation. The provided interfaces included rounding functions, transcendental functions, comparison functions, activation functions and other common basic math library functions. In terms of precision, the entire data of the function definition domain interval was used as input, and the operation result was compared with the result of the scalar function running on the CPU i7. The results showed that the maximum ULP value was 2, and the maximum ULP value of the half-precision version was 1. Compared with the use of scalar loop, the performance was greatly improved. Compared with the scalar loop, the single-precision version had an average speed-up ratio of 18. 26 and a maximum speed-up ratio of 35. 90. The halfprecision version had an average speed-up ratio of 15. 65 and a maximum speed-up ratio of 30. 11.