«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2022.03.018]
点击复制

基于 CPU+GPU 异构并行的广义共轭余差算法性能优化()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 43
期数:: 2022年06期

页码:: 15-21

栏目:

出版日期:: 2022-09-02

文章信息/Info

Title:: Performance Optimization of GCR in GRAPES Based on CPU+GPU Heterogeneous Parallel

作者:: 黄东强; 黄建强; 贾金; 吴利; 刘令斌; 王晓英; 青海大学计算机技术与应用系;

Author(s):: HUANG Dongqiang; HUANG Jianqiang; JIA Jinfang; WU Li; LIU Lingbin; WANG Xiaoying; Department of Computer Technology and Application, Qinghai University, Xining 810016, China

关键词:: GRAPES; 广义共轭余差算法; GPU; 异构并行

Keywords:: GRAPES; generalized conjugate residual method; GPU; heterogeneous parallel

分类号:: TP311. 1；O244

DOI:: 10.13705/j.issn.1671-6833.2022.03.018

文献标志码:: A

摘要:: 为了提高 GRAPES 数值天气预报模式的计算效率,改善动力框架部分的性能,针对广义共轭余差算法( GCR)求解赫姆霍兹方程在 GRAPES 模式中耗时较大的问题,提出了一种基于 CPU+GPU 异构并行的预处理广义共轭余差算法。采用不完全 LU 分解对系数矩阵进行预处理来减少迭代次数,在此基础上实现了 OpenMP 的细粒度并行和 MPI 粗粒度并行,OpenMP 并行主要是采用循环展开的方式对程序中无数据依赖的循环体使用编译制导来提高程序的性能;MPI 并行主要是将数据划分给各个进程,采用非阻塞通信和优化进程通信数据量的方式来提高并行程序的可拓展性。实现了 MPI+CUDA 异构并行, MPI 负责节点间进程通信以及迭代控制,CUDA 负责处理计算密集型任务,将 GCR 中耗时较大的矩阵计算部分移植到 GPU 上处理,采用访存优化和数据传输优化来减少 CPU 和 GPU 间的数据传输开销。实验结果表明:与串行程序相比,OpenMP 并行加速比为 2. 24,MPI 并行加速比为 3. 32,MPI+CUDA 异构并行加速比为4. 69, 实现了异构平台上的广义共轭余差算法性能优化,提高了程序的计算效率。

Abstract:: In order to improve the computational efficiency of the GRAPES(global/regional assimilation and prediction system) numerical weather prediction model, and to improve the performance of the dynamic framework, In order to solve the problem that the GCR algorithm was time-consuming in GRAPES mode, a CPU+GPU heterogeneous parallel preprocessing GCR algorithm was implemented. Firstly, incomplete LU decomposition was used to preprocess the coefficient matrix to reduce the number of iterations. On this basis, fine-grained parallelism of OpenMP and coarse-grained parallelism of MPI were implemented. OpenMP parallelism was mainly used to improve the performance of the program by using compiler guidance to the loop body without data dependence in the way of loop unrolling. MPI parallelism was used to divide the data into various processes and improve the scalability of parallel programs by non-blocking communication and optimizing the amount of communication data. MPI was responsible for process communication and iterative control between nodes, while CUDA was responsible for processing computation-intensive tasks. The time-consuming matrix calculation part of GCR was transferred to GPU for processing, and memory optimization and data transmission optimization were adopted to reduce the data transmission overhead between CPU and GPU. The experimental results showed that the parallel acceleration ratio of OpenMP was 2.24 times that of the serial program, the parallel acceleration ratio of MPI was 3.32 times that of the serial program, and the parallel acceleration ratio of MPI+CUDA was 4.69 times that of the serial program. The performance optimization of the generalized conjugate redundancy algorithm on the heterogeneous platform was realized, and the computational efficiency of the program was improved.

参考文献/References:

[1] 沈旭军，王建军，李志强，等.中国数值天气预报研究与业务发展[J] .气象研究， 2020， 34（4）： 675-698.

[2] 张志远，周宇峰，刘利，等.MASNUM 海浪模式的性能特点分析与并行优化[ J] .计算机研究与发展， 2015， 52（4）： 851-860.

[3] 肖洒，魏敏，邓帅，等.基于 GPU-OpenACC 的气候模式加速优化研究 [ J ] . 气象, 2019, 45 ( 7 ) : 1001-1008.

XIAO S, WEI M, DENG S, et al. Research on accelerated optimization of climate models based on GPUOpenACC [ J ] . Meteorological monthly, 2019, 45 (7) : 1001-1008.

[4] 魏敏,王彬,何香,等. BCCAGCM 模式在神威·太湖之光系统的优化 [ J ] . 应用气象学报, 2019, 30 (4) : 502-512.

WEI M, WANG B, HE X, et al. Optimizing BCCAGCM on sunway TaihuLight[ J] . Journal of applied meteorological science, 2019, 30(4) : 502-512.

[5] DIWAN G C, MOHAMED M S. Iterative solution of Helmholtz problem with high-order isogeometric analysis and finite element method at mid-range frequencies [ J] . Computer methods in applied mechanics and engineering, 2020, 363: 112855.

[6] 金之雁,杨磊,林隽民,等. 广义共轭余差法的通信避免算法[ J] . 计算机工程与应用, 2020, 56( 3) : 74-79.

JIN Z Y, YANG L, LIN J M, et al. Communication avoiding algorithm of generalized conjugate residual method[ J] . Computer engineering and applications, 2020, 56(3) : 74-79.

[7] LI L F, XUE W, RANJAN R, et al. A scalable Helmholtz solver in GRAPES over large-scale multicore cluster [ J ] . Concurrency and computation: practice and experience, 2013, 25(12) : 1722-1737.

[8] 刘钊. 基于国产高性能计算机的 GRAPES 性能优化研究[D] . 上海: 上海交通大学, 2014.

LIU Z. Study of GRAPES numerical weather prediction system optimization on domestic high performance computers[D]. Shanghai: Shanghai Jiao Tong University, 2014.

[9] 王卓薇,许先斌,赵武清,等. 基于 GPU 的 GRAPES 模型并行加速及性能优化 [ J] . 计算机研究与发展, 2013, 50(2) : 401-411.

WANG Z W, XU X B, ZHAO W Q, et al. Parallel acceleration and performance optimization for GRAPES model based on GPU [ J ] . Journal of computer research and development, 2013, 50(2) : 401-411.

[10] 王克文,冶梦雨,刘艳红. 建立电力系统状态空间方程的并行方法[ J] . 郑州大学学报(工学版) , 2021, 42(1) : 15-20.

WANG K W, YE M Y, LIU Y H. Parallel method for establishing state space equation of power system[ J] . Journal of Zhengzhou university ( engineering science) , 2021, 42(1) : 15-20.

[11] 薛纪善,陈德辉. 数值预报系统 GRAPES 的科学设计与应用[M] . 北京: 科学出版社, 2008.

XUE J S, CHEN D H. Scientific design and application of numerical prediction system GRAPES [ M ] . Beijing: Science Press, 2008.

[12] 李建斌,王鹏程,傅侃,等. 基于预处理共轭梯度迭代法的电力系统状态估计算法[ J] . 电力系统自动化, 2021, 45(14) : 90-96.

LI J B, WANG P C, FU K, et al. State estimation algorithm of power system based on preconditioned conjugate gradient iteration [ J ] . Automation of electric power systems, 2021, 45(14) : 90-96.

[13] 刘宇,曹建文. 适用于 GRAPES 数值天气预报软件的 ILU 预条件子 [ J] . 计算机工程与设计, 2008, 29(3) : 731-734.

LIU Y, CAO J W. ILU preconditioner for NWP system: GRAPES[ J] . Computer engineering and design, 2008, 29(3) : 731-734.

[14] LI Y, XIE H H, XU R, et al. A parallel generalized conjugate gradient method for large scale eigenvalue problems[ J] . CCF transactions on high performance computing, 2020, 2(2) : 111-122.

[15] GANDER M J, GRAHAM I G, SPENCE E A. Applying GMRES to the Helmholtz equation with shifted Laplacian preconditioning: what is the largest shift for which wavenumber-independent convergence is guaranteed[ J ] . Numerische mathematik, 2015, 131 ( 3 ) : 567-614.

[16] HUANG J Q, XUE W, BIAN H D, et al. Helmholtz solving and performance optimization in global / regional assimilation and prediction system [ J] . Tsinghua science and technology, 2020, 26(3) : 335-346.

更新日期/Last Update: 2022-10-03

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics