GPU矩阵乘法和FFT算法的性能优化
详细信息 本馆镜像全文    |  推荐本文 | | 获取馆网全文
摘要
当前GPU的体系结构为高性能计算提供了良好的可编程性。为了得到众核GPU高性能程序设计的一般方法,探索GPU程序性能优化技术,对在GPU上进行高性能程序设计的经验进行了总结。通过基准测试,得到GPU性能指标,对GPU程序设计进行指导。使用CUDA对单精度矩阵乘法和FFT进行性能优化,前一个算法是计算密集型任务,后一个算法是带宽密集型任务。在NVIDIA GeForce GTX280 GPU上,矩阵乘法算法达到393 Gflop/s的峰值速度,比CUBLAS 2.0数学库提高了5%;对于一些维度的FFT计算也取得了较好的性能。
The optimization technique of GPU program performance is investigated for obtaining the common method to design many-core GPU high-performance program.The authors′ experiences in improving the performance of two key algorithms: single-precision matrix-matrix multiplication subprogram(SGEMM of BLAS) and single precision FFT using CUDA are discussed in this paper.The former is computation intensive,while the latter is memory bandwidth or communication-intensive.The peak speed of 393 Gflops was achieved on NVIDIA GeForce GTX280 GPU for the former.It is about 5% faster than the CUBLAS 2.0 library.Better FFT performance was obtained for a range of dimensions.Some common principles are discussed for the design and implementation of many-core algorithms.
引文
[1]吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5):601-612.
    [2]吴恩华.图形处理器用于通用计算的技术、现状及其挑战[J].软件学报,2004,15(10):101-112.
    [3]苏超轼,赵明昌,张向文.GPU加速的八叉树体绘制加速算法[J].计算机应用,2008(10):51-62.
    [4]储璟骏,杨新,高艳.使用GPU编程的光线投射体绘制算法[J].计算机辅助设计与图形学学报,2007,19(5):257-262.
    [5]朱亚平,杨慧珠,董渊,等.OpenGL技术在地震数据可视化中的应用[J].石油地球物理勘探,2000(8):35-42.
    [6]徐品,蓝善祯,刘兰兰.利用GPU进行通用数值计算的研究[J].中国传媒大学学报:自然科学版,2009(4):101-112.
    [7]NVIDIA Corporation.NV1DIA CUDA programming guide Ver sion 0.1[EB/OL].[2009-04 28].http://developer.nvidia.con/cuda.
    [8]COPPERSMITH D,WINOGRAD S.Matrix multipli cation viaarithmetic progressions[C]//Proceedings of the nineteenth an nual ACM symposium on Theory of computing.New York,NY,USA:ACM,1987:1-6.
    [9]RYOO S.Optimization principles and application performanceevaluation of a multithreaded GPU using CUDA[C]//Pro ceedings of 13th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming.[S.l.]:ACM,2008:73-82.
    [10]VOLKOV V,DEMMEL J W.Benchmarking GPUs to tunedense linear algebra SC08[C]//Proceedings of the 2008 ACM/IEEE Conference on Supercomputing.[S.l.]:IEEE Press,2008:1-11.
    [11]GOTO K,VAN DE GEIJN R A.Anatomy of high performancematrix multiplication[J].ACM Transactions on MathematicalSoftware,2008,3(43):12-25.
    [12]HWU W M,RODRIGUES C,RYOO S,et al.Compute uni fied device architecture application suitability[J].Computingin Science and Engineering,2009,11(3):16-26.

版权所有:© 2023 中国地质图书馆 中国地质调查局地学文献中心