基于 CPU+GPU 异构并行的广义共轭余差算法性能优化-《郑州大学学报(工学版)》

文章信息/Info

Title:: Performance Optimization of GCR in GRAPES Based on CPU+GPU Heterogeneous Parallel

作者:: 黄东强; 黄建强; 贾金 102; 102); background-color: rgb(255; 255; 255); font-family: Arial; Verdana; sans-serif; font-size: 12pt;">; 吴利; 刘令斌; 王晓英; 青海大学计算机技术与应用系;

Author(s):: HUANG D Q; HUANG J Q; JIA J F; et al; Department of Computer Technology and Application of Qinghai University;

关键词:: GRAPES; 广义共轭余差算法; GPU; 异构并行

Keywords:: Grapes; Grand the meaning of the algorithm; gpu; the heterogeneous parallel

分类号:: TP311. 1O244

DOI:: 10.13705/j.issn.1671-6833.2022.03.018

文献标志码:: A

摘要:: 为了提高 GRAPES 数值天气预报模式的计算效率,改善动力框架部分的性能,针对广义共轭余差算法( GCR)求解赫姆霍兹方程在 GRAPES 模式中耗时较大的问题,提出了一种基于 CPU+GPU 异构并行的预处理广义共轭余差算法。采用不完全 LU 分解对系数矩阵进行预处理来减少迭代次数,在此基础上实现了 OpenMP 的细粒度并行和 MPI 粗粒度并行,OpenMP 并行主要是采用循环展开的方式对程序中无数据依赖的循环体使用编译制导来提高程序的性能;MPI 并行主要是将数据划分给各个进程,采用非阻塞通信和优化进程通信数据量的方式来提高并行程序的可拓展性。实现了 MPI+CUDA 异构并行, MPI 负责节点间进程通信以及迭代控制,CUDA 负责处理计算密集型任务,将 GCR 中耗时较大的矩阵计算部分移植到 GPU 上处理,采用访存优化和数据传输优化来减少 CPU 和 GPU 间的数据传输开销。实验结果表明:与串行程序相比,OpenMP 并行加速比为 2. 24,MPI 并行加速比为 3. 32,MPI+CUDA 异构并行加速比为4. 69, 实现了异构平台上的广义共轭余差算法性能优化,提高了程序的计算效率。

Abstract:: In order to improve the computational efficiency of the GRAPES( global/regional assimilation and prediction system)numerical weather prediction model, and to improve the performance of the dynamic frame- work, In order to solve the problem that the GCR algorithm was time-consuming in GRAPES mode, a CPU+ GPU heterogeneous parallel preprocessing GCR algorithm was implemented. Firstly, incomplete LU decompo- sition was used to preprocess the coefficient matrix to reduce the number of iterations. On this basis, fine- grained parallelism of OpenMP and coarse-grained parallelism of MPI were implemented. OpenMP parallelism was mainly used to improve the performance of the program by using compiler guidance to the loop body with- out data dependence in the way of loop unrolling. MPI parallelism was used to divide the data into various processes and improve the scalability of parallel programs by non-blocking communication and optimizing the amount of communication data. MPI was responsible for process communication and iterative control between nodes, while CUDA was responsible for processing computation-intensive tasks. The time-consuming matrix calculation part of GCR was transferred to GPU for processing, and memory optimization and data transmission optimization were adopted to reduce the data transmission overhead between CPU and GPU. The experimental results showed that the parallel acceleration ratio of OpenMP was 2. 24 times that of the serial program, the parallel acceleration ratio of MPI was 3. 32 times that of the serial program, and the parallel acceleration ratio of MPI+CUDA was 4. 69 times that of the serial program. The performance optimization of the generalized con- jugate redundancy algorithm on the heterogeneous platform was realized, and the computational efficiency of the program was improved.

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

常用功能

导航/Navigate

工具/Tools

统计/Statistics