[1]杨指政,杜子东,文渊博.基于国产PuDianNao 芯片的向量函数库优化[J].郑州大学学报(工学版),2023,44(01):31-37.
 YANG Z Z,DU Z D,WEN Y B..Optimization of Vector Function Library Based on Domestic PuDianNao Chip[J].Journal of Zhengzhou University (Engineering Science),2023,44(01):31-37.
点击复制

基于国产PuDianNao 芯片的向量函数库优化()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
44
期数:
2023年01期
页码:
31-37
栏目:
出版日期:
2022-12-06

文章信息/Info

Title:
Optimization of Vector Function Library Based on Domestic PuDianNao Chip
作者:
杨指政杜子东文渊博
Author(s):
YANG Z Z DU Z D WEN Y B.
文献标志码:
A
摘要:
目前国产人工智能处理器PuDianNao 芯片上的向量数学函数只能依靠循环调用标量函数来实现,该方法 性能比较低。基于PuDianNao 芯片提出了3 种优化方法。方法一为插值方法;方法二为SIMD 加掩码方法;方法三 基于PuDianNao 的硬件阵列结构,使用VLIW 指令操作阵列中的每个处理单元,封装出SIMT 编程模型,提出了暴露 分支范围和分支扁平化的编程方法。对以上3 种方法进行精度和性能测试,对比实验结果表明,方法三具有最好 的精度和性能。使用方法三实现基于国产PuDianNao 芯片的向量数学函数库PuDianNao-VecMath,解决了数学函 数多分支结构难以向量化的难题。该函数库精度性能较好、功能稳定、运行正确,提供的接口包括取整函数、超越 函数、比较函数、激活函数等常见基础数学库函数。在精度上,将函数定义域区间全数据作为输入,运算结果和标 量函数在CPU i7 运行的结果进行对比。结果表明,单精度版本最大ULP 值为2,半精度版本最大ULP 值为1。性 能与使用标量循环相比有较大提高,单精度版本相对于标量循环平均加速比平均值为18. 26,最大加速比为35. 90; 半精度版本平均加速比平均值为15. 65,最大加速比为30. 11。
Abstract:
At present, the vector math functions on the PuDianNao chip of the domestic artificial intelligence processor can only be implemented by calling scalar functions cyclically, and the performance of this method is relatively low. Based on the PuDianNao chip, three optimization methods were proposed. The first two were interpolation method and SIMD masking method. Thirdly, based on a hardware array structure on PuDianNao, VLIW instructions were used to operate each processing unit in the array, and the SIMT programming model was encapsulated programmatically. The accuracy and performance of the above three methods were tested, and the experimental results showed that the third method had the best accuracy and performance. The third method was used to implement the vector mathematical function library PuDianNao-VecMath based on the domestic PuDianNao chip, which solved the problem that the multi-branch structure of mathematical functions was difficult to vectorize. The function library had good precision performance, stable functions and correct operation. The provided interfaces included rounding functions, transcendental functions, comparison functions, activation functions and other common basic math library functions. In terms of precision, the entire data of the function definition domain interval was used as input, and the operation result was compared with the result of the scalar function running on the CPU i7. The results showed that the maximum ULP value was 2, and the maximum ULP value of the half-precision version was 1. Compared with the use of scalar loop, the performance was greatly improved. Compared with the scalar loop, the single-precision version had an average speed-up ratio of 18. 26 and a maximum speed-up ratio of 35. 90. The halfprecision version had an average speed-up ratio of 15. 65 and a maximum speed-up ratio of 30. 11.
更新日期/Last Update: 2022-12-07