GPU矩阵乘法和FFT算法的性能优化

英文篇名：Performance optimization of matrix multiplication and FFT in GPU
中文刊名：现代电子技术
英文刊名：Modern Electronics Technique
作者：李晓雯 ; 崔翔
英文作者：LI Xiao-wen1 ; CUI Xiang2 (1.Department of Command and Control ; Air Defense Forces Academy ; Zhengzhou 450000 ; China ; 2.College of Computer & Information Engineering ; Henan University ; Kaifeng 475003 ; China)
中文关键词：GPU程序设计 ; 矩阵乘法 ; 快速傅里叶变换 ; 性能优化技术
英文关键词：GPU programming ; matrix multiplication ; FFT ; performance optimization technique
出版日期：2013-02-15
机构：防空兵学院指挥控制系;河南大学计算机与信息工程学院;
年：2013
期：04
出版单位：现代电子技术

摘要

当前GPU的体系结构为高性能计算提供了良好的可编程性。为了得到众核GPU高性能程序设计的一般方法,探索GPU程序性能优化技术,对在GPU上进行高性能程序设计的经验进行了总结。通过基准测试,得到GPU性能指标,对GPU程序设计进行指导。使用CUDA对单精度矩阵乘法和FFT进行性能优化,前一个算法是计算密集型任务,后一个算法是带宽密集型任务。在NVIDIA GeForce GTX280 GPU上,矩阵乘法算法达到393 Gflop/s的峰值速度,比CUBLAS 2.0数学库提高了5%;对于一些维度的FFT计算也取得了较好的性能。
The optimization technique of GPU program performance is investigated for obtaining the common method to design many-core GPU high-performance program.The authors′ experiences in improving the performance of two key algorithms: single-precision matrix-matrix multiplication subprogram(SGEMM of BLAS) and single precision FFT using CUDA are discussed in this paper.The former is computation intensive,while the latter is memory bandwidth or communication-intensive.The peak speed of 393 Gflops was achieved on NVIDIA GeForce GTX280 GPU for the former.It is about 5% faster than the CUBLAS 2.0 library.Better FFT performance was obtained for a range of dimensions.Some common principles are discussed for the design and implementation of many-core algorithms.

引文

[1]吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5):601-612.
    [2]吴恩华.图形处理器用于通用计算的技术、现状及其挑战[J].软件学报,2004,15(10):101-112.
    [3]苏超轼,赵明昌,张向文.GPU加速的八叉树体绘制加速算法[J].计算机应用,2008(10):51-62.
    [4]储璟骏,杨新,高艳.使用GPU编程的光线投射体绘制算法[J].计算机辅助设计与图形学学报,2007,19(5):257-262.
    [5]朱亚平,杨慧珠,董渊,等.OpenGL技术在地震数据可视化中的应用[J].石油地球物理勘探,2000(8):35-42.
    [6]徐品,蓝善祯,刘兰兰.利用GPU进行通用数值计算的研究[J].中国传媒大学学报:自然科学版,2009(4):101-112.
    [7]NVIDIA Corporation.NV1DIA CUDA programming guide Ver sion 0.1[EB/OL].[2009-04 28].http://developer.nvidia.con/cuda.
    [8]COPPERSMITH D,WINOGRAD S.Matrix multipli cation viaarithmetic progressions[C]//Proceedings of the nineteenth an nual ACM symposium on Theory of computing.New York,NY,USA:ACM,1987:1-6.
    [9]RYOO S.Optimization principles and application performanceevaluation of a multithreaded GPU using CUDA[C]//Pro ceedings of 13th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming.[S.l.]:ACM,2008:73-82.
    [10]VOLKOV V,DEMMEL J W.Benchmarking GPUs to tunedense linear algebra SC08[C]//Proceedings of the 2008 ACM/IEEE Conference on Supercomputing.[S.l.]:IEEE Press,2008:1-11.
    [11]GOTO K,VAN DE GEIJN R A.Anatomy of high performancematrix multiplication[J].ACM Transactions on MathematicalSoftware,2008,3(43):12-25.
    [12]HWU W M,RODRIGUES C,RYOO S,et al.Compute uni fied device architecture application suitability[J].Computingin Science and Engineering,2009,11(3):16-26.