[1]杨指政,杜子东,文渊博.基于国产PuDianNao 芯片的向量函数库优化[J].郑州大学学报(工学版),2023,44(01):31-37.[doi:10.13705/j.issn.1671-6833.2023.01.013]
 Yang Zhizheng,Du Zidong,Wen Yuanbo..Optimization of Vector Function Library Based on Domestic PuDianNao Chip[J].Journal of Zhengzhou University (Engineering Science),2023,44(01):31-37.[doi:10.13705/j.issn.1671-6833.2023.01.013]
点击复制

基于国产PuDianNao 芯片的向量函数库优化()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
44
期数:
2023年01期
页码:
31-37
栏目:
出版日期:
2022-12-06

文章信息/Info

Title:
Optimization of Vector Function Library Based on Domestic PuDianNao Chip
作者:
杨指政1杜子东2文渊博3
1.郑州大学河南先进技术研究院,河南郑州 450001, 2.中国科学院计算技术研究所计算机体系结构国家重点实验室,北京 100190, 3.中国科学技术大学计算机学院,安徽合肥 230026

Author(s):
Yang Zhizheng1 Du Zidong2 Wen Yuanbo3.
Zhengzhou University Henan Advanced Technical Research Institute, Henan Zhengzhou 450001, National Key Laboratory of Computing Technology Structure Structure of the Institute of Computing Technology of the Chinese Academy of Sciences, Beijing 100190, Chinese University of Science and Technology of Science and Technology

关键词:
Keywords:
分类号:
TP311
DOI:
10.13705/j.issn.1671-6833.2023.01.013
文献标志码:
A
摘要:
目前国产人工智能处理器PuDianNao 芯片上的向量数学函数只能依靠循环调用标量函数来实现,该方法 性能比较低。基于PuDianNao 芯片提出了3 种优化方法。方法一为插值方法;方法二为SIMD 加掩码方法;方法三 基于PuDianNao 的硬件阵列结构,使用VLIW 指令操作阵列中的每个处理单元,封装出SIMT 编程模型,提出了暴露 分支范围和分支扁平化的编程方法。对以上3 种方法进行精度和性能测试,对比实验结果表明,方法三具有最好 的精度和性能。使用方法三实现基于国产PuDianNao 芯片的向量数学函数库PuDianNao-VecMath,解决了数学函 数多分支结构难以向量化的难题。该函数库精度性能较好、功能稳定、运行正确,提供的接口包括取整函数、超越 函数、比较函数、激活函数等常见基础数学库函数。在精度上,将函数定义域区间全数据作为输入,运算结果和标 量函数在CPU i7 运行的结果进行对比。结果表明,单精度版本最大ULP 值为2,半精度版本最大ULP 值为1。性 能与使用标量循环相比有较大提高,单精度版本相对于标量循环平均加速比平均值为18. 26,最大加速比为35. 90; 半精度版本平均加速比平均值为15. 65,最大加速比为30. 11。
Abstract:
At present, the vector math functions on the PuDianNao chip of the domestic artificial intelligence processor can only be implemented by calling scalar functions cyclically, and the performance of this method is relatively low. Based on the PuDianNao chip, three optimization methods were proposed. The first two were interpolation method and SIMD masking method. Thirdly, based on a hardware array structure on PuDianNao, VLIW instructions were used to operate each processing unit in the array, and the SIMT programming model was encapsulated programmatically. The accuracy and performance of the above three methods were tested, and the experimental results showed that the third method had the best accuracy and performance. The third method was used to implement the vector mathematical function library PuDianNao-VecMath based on the domestic PuDianNao chip, which solved the problem that the multi-branch structure of mathematical functions was difficult to vectorize. The function library had good precision performance, stable functions and correct operation. The provided interfaces included rounding functions, transcendental functions, comparison functions, activation functions and other common basic math library functions. In terms of precision, the entire data of the function definition domain interval was used as input, and the operation result was compared with the result of the scalar function running on the CPU i7. The results showed that the maximum ULP value was 2, and the maximum ULP value of the half-precision version was 1. Compared with the use of scalar loop, the performance was greatly improved. Compared with the scalar loop, the single-precision version had an average speed-up ratio of 18. 26 and a maximum speed-up ratio of 35. 90. The halfprecision version had an average speed-up ratio of 15. 65 and a maximum speed-up ratio of 30. 11.

参考文献/References:

[1] Intel Corporation. Intel ® oneAPI math kernel library [EB / OL] . ( 2019 - 11 - 01) [ 2022 - 02 - 10] . https: / / software. intel. com / en-us/ mkl.

[2] Intel Corporation. Intel short vector math library [ EB / OL] . ( 2021 - 06 - 20) [ 2022 - 02 - 10 ] . https: / / software. intel. com / en-us/ node / 523613.
[3] Advanced Micro Devices, Inc. . AMD core math library [EB / OL] . (2013-07-24) [2022- 02- 10] . http: / / developer. amd. com / tools-and-sdks/ archive / acml-productfeatures/ .
[4] ANAND C K, KAHL W. An optimized cell BE special function library generated by coconut[ J] . IEEE transactions on computers, 2009, 58(8) : 1126-1138. 
[5] LAUTER C. A new open-source SIMD vector libm fully implemented with high-level scalar C [ C ] / / 2016 50th Asilomar Conference on Signals, Systems and Computers. Piscataway: IEEE, 2016: 407-411.
[6] PIPARO D, INNOCENTE V, HAUTH T. Speeding up HEP experiment software with a library of fast and autovectorisable mathematical functions[ J] . Journal of physics: conference series, 2014, 513(5) : 052027. 
[7] 刘聃, 郭绍忠, 郝江伟, 等. 基于 SIMD 扩展部件的长 向量超越函数实现方法[ J] . 计算机科学, 2021, 48 (6) : 26-33. 
LIU D, GUO S Z, HAO J W, et al. Implementation of transcendental functions on vectors based on SIMD exten- 第 1 期 杨指政,等:基于国产 PuDianNao 芯片的向量函数库优化 37 sions[J]. Computer science, 2021, 48(6): 26-33.
[8] LIU D F, CHEN T S, LIU S L, et al. PuDianNao[ J] . ACM SIGPLAN notices, 2015, 50(4) : 369-381.
[9] HUCK J, MORRIS D, ROSS J, et al. Introducing the IA64 architecture[J]. IEEE micro, 2000, 20(5): 12-23. 
[10] ZHANG Y, HU Y, LI B, et al. Performance and power analysis of ATI GPU: a statistical approach [ C] / / 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage. Piscataway: IEEE, 2011: 149- 158.
 [11] KUMURA T, IKEKAWA M, YOSBIDA M, et al. VLIW DSP for mobile applications[ J] . IEEE signal processing magazine, 2002, 19(4) : 10-21. 
[12] KYUNG G, JUNG C M, LEE K. An implementation of a SIMT architecture-based stream processor[C] / / TENCON 2014-2014 IEEE Region 10 Conference. Piscataway: IEEE, 2014: 1-5. 
[13] XIONG Y Q. A unified programming model for heterogeneous computing with CPU and accelerator technologies [C] / / 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) . Piscataway: IEEE, 2019: 1-4.
[14] MULLER J K. On the definition of ulp ( x ) [ R / OL ] . (2005-02-01) [ 2022- 02- 10] . https: / / www. researchgate. net / publication / 236944278_On _ the _ definition _ of _ ulpx. 
[15] Free Software Foundation. The GNU C Library ( glibc ) [EB / OL] . ( 2019 - 11 - 01) [ 2022 - 02 - 10] . https: / / www. gnu. org / software / libc / . 
[16] NVIDIA. CUDA Math API [ EB / OL] . ( 2022 - 01 - 12) [2022- 02 - 10 ] . https: / / docs. nvidia. com / cuda / cudamath-api / index. html.

更新日期/Last Update: 2022-12-07