从图形处理器到基于GPU的通用计算

英文篇名：From Graphic Processing Unit to General Purpose Graphic Processing Unit
中文刊名：武汉大学学报(理学版)
英文刊名：Journal of Wuhan University(Natural Science Edition)
作者：刘金硕 ; 刘天晓 ; 吴慧 ; 曾秋梅 ; 任梦菲 ; 顾宜淳
英文作者：LIU Jinshuo1 ; LIU Tianxiao1 ; WU Hui1 ; ZENG Qiumei1 ; REN Mengfei2 ; GU Yichun2(1.School of Computer ; Wuhan University ; Wuhan 430072 ; Hubei ; China ; 2.International School of Software ; Wuhan University ; Wuhan 430072 ; Hubei ; China)
中文关键词：GPGPU ; CUDA架构 ; 并行计算
英文关键词：GPGPU(general purpose graphic process unit) ; CUDA framework ; parallel computing
出版日期：2013-04-24
机构：武汉大学计算机学院;武汉大学国际软件学院;
年：2013
期：02
出版单位：武汉大学学报(理学版)

摘要

对GPU(graphic process unit)、基于GPU的通用计算(general purpose GPU,GPGPU)、基于GPU的编程模型与环境进行了界定;将GPU的发展分为4个阶段,阐述了GPU的架构由非统一的渲染架构到统一的渲染架构,再到新一代的费米架构的变化;通过对基于GPU的通用计算的架构与多核CPU架构、分布式集群架构进行了软硬件的对比.分析表明:当进行中粒度的线程级数据密集型并行运算时,采用多核多线程并行;当进行粗粒度的网络密集型并行运算时,采用集群并行;当进行细粒度的计算密集型并行运算时,采用GPU通用计算并行.最后本文展示了未来的GPGPU的研究热点和发展方向——GPGPU自动并行化、CUDA对多种语言的支持、CUDA的性能优化,并介绍了GPGPU的一些典型应用.
This paper defines the outline of GPU(graphic processing unit),the general purpose computation,the programming model and the environment for GPU.Besides,it introduces the evolution process from GPU to GPGPU(general purpose graphic processing unit),and the change from non-uniform render architecture to the unified render architecture and the next Fermi architecture in details.Then it compares GPGPU architecture with multi-core GPU architecture and distributed cluster architecture from the perspective of software and hardware.When doing the middle grain level thread data intensive parallel computing,the multi-core and multi-thread should be utilized.When doing the coarse grain network computing,the cluster computing should be utilized.When doing the fine grained compute intensive parallel computing,the general purpose computation should be adopted.Meanwhile,some classical applications of GPGPU have been mentioned.At last,this paper demonstrates the further developments and research hotspots of GPGPU,which are automatic parallelization of GPGPU,multi-language support and performance optimization of CUDA,and introduces the classic application of GPGPU.

引文

[1]Harris M.What is GPGPU[EB/OL].[2012-4-24].http://gpgpu.org/.
    [2]刘振林,黄永忠,王磊,等.基于Brook在GPU的应用[J].信息工程大学学报,2008,9(1):80-84.Liu Zhenlin,Huang Yongzhong,Wang Lei,et al.Application of GPU based on Brook[J].Journal ofInformation Engineering University,2008,9(1):80-84(Ch).
    [3]Papakipos M.The PeakStream Platform:High-Pro-ductivity Software Development for Multi-Core Proces-sors,Whitepaper,PeakStreamCorp[EB/OL].[2007-04-05].http://download.microsoft.com/download/d/f/6/df6accd5-4bf2-4984-8285-f4f23b7b1f37/win-hec2007_peakstream.doc.
    [4]Bayoumi A,Chu M.Scientific and engineering compu-ting using ATI stream technology[J].Computing inScience&Engineering,2009,11(6):92-97.
    [5]Nukada A,Ogata Y,Endo T,et al.Bandwidth inten-sive 3-D FFT kernel for GPUs using CUDA[C]//High Performance Computing,Networking,Storageand Analysis(SC2008).New York:ACM,2008:1-11.
    [6]NVIDIA Corporation.NVIDIA CUDA Compute Uni-fied Device Architecture Programming Guide[EB/OL].[2011-12-20].http://moss.csc.ncsu.edu/～mueller/cluster/nvidia/0.8/NVIDIA_CUDA_Pro-gramming_Guide_0.8.2.pdf.
    [7]张舒,褚艳利.GPU高性能运算之CUDA[M].北京:中国水利水电出版社,2009:1-13.Zhang Shu,Chu Yanli.GPU High Performance Com-putiong for CUDA[M].Beijing:China Water PorerPress,2009:1-13(Ch).
    [8]韩俊刚,刘有耀,张晓.图形处理器的历史现状和发展趋势[J].西安邮电学院学报,2011,16(3):61-64.Han Jungang,Liu Youyao,Zhang Xiao.GPU:Thepast,Present and Future[J].Journal of Xi’an Uni-versity of Post and Telecom,2011,16(3):61-64(Ch).
    [9]高小鹏,龙翔,万寒,等.通用计算中的GPU[J].中国计算机学会通讯,2009,5(11):43-49.Gao Xiaopeng,Long Xiang,Wan Han,et al.GPU inGeneral-Purpose computing[J].Communications ofthe CCF,2009,5(11):43-49(Ch).
    [10]今题网.从GPU诞生说起:AMD统一渲染架构回顾及发展[EB/OL].[2011-11-21].http://news.jinti.com/shangpin/828425.htm.Jinti.Speaking from the GPU was born:Review andExhibition of AMD Unified Shader Architecture[EB/OL].[2011-11-21].http://news.jinti.com/shang-pin/828425.htm(Ch).
    [11]Kirk B D,Hwu W W.Programming Massively Par-allel Processors[M].NVIDIA,2010:29-30.
    [12]NVIDIA.White paper NVIDIA’s Next GenerationCUDATM Compute Architecture:FermiTM.
    [13]TUTARTICAL.Flynn’s Classification of ComputerArchitectures[EB/OL].[2011-11-03].http://tutarti-cle.com/computer-architecture/flynns-classification-of-computer-architectures/.
    [14]荣振.浅议多核处理器技术[EB/OL].[2011-11-03].http://www.jpk.pku.edu.cn/pkujpk/course/wjyl/achievement/research3.pdf.
    [15]NVIDIA.TESLAAM C2050/C2070 GPU COMPUT-ING PROCESSOR[EB/OL].[2011-11-03].http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf.
    [16]Intel?.C++Compiler 12.1 User and ReferenceGuides[EB/OL].[2011-11-04].http://software.in-tel.com/sites/products/documentation/studio/compos-er/en-us/2011Update/compiler_c/index.htm.
    [17]Wolfe M.The PGI Accelerator Programming Modelon NVIDIA GPUs[EB/OL].[2011-11-05].http://www.pgroup.com/lit/articles/insider/v1n1a1.htm.
    [18]Leung A,Lashari G.Automatic parallelization forgraphics processing units[C]//Proceedings of the 7thInternational Conference on Principles and Practiceof Programming in Java.New York:ACM,2009:91-100.
    [19]MathWorks Inc.GPU Computing[EB/OL].[2011-12-1].http://www.mathworks.com/help/toolbox/distcomp/bsic3by.html.
    [20]NVIDIA.GPU Acceleration on MATLAB with Jacket[EB/OL].[2011-12-1].http://www.nvidia.cn/ob-ject/tesla-jacket-gpu-acceleration-cn.html.
    [21]Python Software Foundation.Python wrapper forNVIDIA CUDA[EB/OL].[2011-12-1].http://py-pi.python.org/pypi/pycuda.
    [22]Ryoo S,Rodrigues I C,Baghsorkhi S S,et al.Opti-mization principles and application performance evalua-tion of a multithreaded GPU using CUDA[C]//Pro-ceedings of the 13th ACM SIGPLAN Symposium onPrinciples and practice of parallel programming,2008:73-82.
    [23]Lee D,Dinov I,Dong B,et al.CUDA optimizationstrategies for compute-and memory-bound neuroimag-ing algorithms[EB/OL].[2011-12-15].http://www.ncbi.nlm.nih.gov/pubmed/21159404.
    [24]Moazeni M,Bui A,Sarrafzadeh M.A Memory Opti-mization Technique for Software-managed ScratchpadMemory in GPUs[J].2009 IEEE7th Symposium onApplication Specific Processors,2009:43-49.
    [25]Michalakes J,Vaehharajani M.GPU acceleration ofnumerical weather prediction[J].Parallel ProcessingLetters,2008,18(4):531-548.
    [26]刘钦,佟小龙.GPU/CPU协同并行计算(CPPC)在地震勘探资料处理中的应用[R].北京:北京吉星吉达公司,2008.Liu Qin,Tong Xiaolong.Applications of GPU/CPUCo-Processing Parallel Computing(CPPC)in Seis-mic Data Processing[R].Beijing:Beijing GeostarCompany,2008(Ch).
    [27]Bell N,Garland M.Implementing sparse matrix-vectormultiplication on throughput-oriented processors[C]//Proceedings of the Conference on High PerformanceComputing Networking,Storage and Analysis(SC2009).New York:ACM,2009:58-66.
    [28]Bolz J,Farmer I,Grinspun E,et al.Sparse matrixsolvers on the GPU:Conjugate gradients and multigrid[J].ACM Transaction on Graphics,2003,22(3):917-924.
    [29]Stone J,Phillips J,Hardy D,et al.Accelerating mo-lecular modeling applications with graphics processors[J].Journal of Computational Chemistry,2007,28(16):2618-2640.
    [30]Govindaraju N K,IAoyd B,Wang W,et al.Fastcomputation of database operations using graphics pro-cessors[C]//SIGMOD2004.Proceedings of the 2004ACM SIGMOD International Conference on Manage-ment of data.New York:ACM,2004:215-226.
    [31]Nukada A,Ogata Y,Endo T,et al.Bandwidth inten-sive 3-D FFT kernel for GPUs using CUDA[C]//High Performance Computing,Networking,Storageand Analysis(SC2008).New York:ACM,2008:1-11.
    [32]Govindaraju N K,IAoyd B,Dotsenko Y,et al.Highperformance discrete fourier transforms on graphicsprocessors[C]//SC2008.New York:ACM,2008:1-12.
    [33]李勇.基于GPU的实时红外图像生成方法研究[D].西安:西安电子科技大学,2007.Li Yong.Study on a Method of Real-time InfraredSynthetic Image Generation Based on GPU[D].Xi’an:Xidian University,2007(Ch).
    [34]Plaza A,Du Q.High performance computing for hy-perspectral remote sensing[J].Selected Topics in Ap-plied Earth Observations and Remote Sensing,2011,4(3):528-544.
    [35]Kirk B D.NVIDIA CUDA Software and GPUParallelComputing Architecture[EB/OL].[2011-11-05].ht-tp://www.nvidia.co.kr/content/cudazone/down-load/showcase/kr/Tutorial-DKIRK.pdf.
    [36]Pande lab Stanford University.Folding@Home Dis-tributed Computing[EB/OL].[2011-11-05].http://folding.stanford.edu/.
    [37]Michalakes J,Vachharajani M.GPU Acceleration ofNWP:Benchmark Kernels[EB/OL].[2011-11-06].ht-tp://www.mmm.ucar.edu/wrf/WG2/GPU/.
    [38]王握文,陈明.“天河一号”超级计算机系统研制[J].国防科技,2009,30(6):1-4.Wang Wowen,Chen Ming.The research and develop-ment of the super computer system tianhe-one[J].National Defense Science&Technology,2009,30(6):1-4(Ch).
    [39]葛震.GPU加速PQMRCGSTAB算法研究[D].长沙:国防科学技术大学,2009.Ge Zhen.Research on Accelerating PQMRCGSTABAlgorithm with GPU[D].Changsha:National Univer-sity of Defense Technology,2009(Ch).
    [40]李明.曙光“星云”超千万亿次的突破[J].网络安全技术与应用,2010,7:4-4.Li Ming.Sugon“Nebulae”achieved more than onePFlop/s[J].Network Security Technology&Appli-cation,2010,7:4-4(Ch).
    [41]Ogawa S,Aoki T.GPU computing for 2-dimensionalincompressible-flow simulation based on multigridmethod[C]//Transactions of the Japan Society forComputational Engineering and Science,2009:1-6.
    [42]Xian W,Takayuki A.Multi-GPU performance of in-compressible flow computation by lattice Boltzmannmethod on GPU cluster[J].Parallel Computing,2011,37(9):521-535.
    [43]NVIDIA.Tesla Bio Workbench—Enabling New Sci-ence[EB/OL].[2011-11-05].http://www.nvidia.com/object/tesla_bio_workbench.html.