摘要
针对通用矩阵乘(GEneralized matrix multiplication,GEMM)核心算法,提出了一种支持阻塞分段传输的直接存储访问控制器(direct memory access,DMA)结构.当有多个核进行核内到核外的数据传输时,阻塞分段传输机制可以替代软件锁同步的方式自动检测这些事务的状态,并在所有事务结束后启动分段传输事务.在NC-VERILOG仿真平台上的仿真结果表明,与软件锁同步方式相比,阻塞分段传输结构有2方面的优势:1)对单纯的数据传输,使用阻塞分段传输结构启动分段传输可以至少提前50拍;2)对GEMM核心算法,使用阻塞分段传输结构比使用软件锁同步的运行时间减少10 000拍以上.
To improve the performance of GEMM applications,this paper proposes a DMA structure with the support of blocking segment transmission.When multiple cores transmit data from external memory spaces to internal spaces,the proposed blocking segment transmission design can replace the software-based lock mechanism to automatically check the transmission status.Once all these transactions are finished,a segment transmission transaction starts.The simulation results on NCVERILOG indicate that the proposed blocking segment transmission has two main advantages over the software-based lock mechanism.Firstly,it starts the segment transaction 50 cycles earlier.Secondly,it reduces 10 000 cycles at least for the kernel algorithm of GEMM application.
引文
[1]陈任之,黄立波,陈顼颢,等.单节点多GPU集群下HPL动态负载均衡化.计算机科学,2013,40(3):107-110
[2]Nakasato N.A fast GEMM implementation on a Cypress GPU//Proc of the 1st Int Workshop on Performance Modeling,Benchmarking and Simulation of High Performance Computing Systems.New York:ACM,2010:13-19
[3]晏小波,唐滔,杨学军.FT64并行系统上的EP和GEMM并行算法设计与实现.计算机研究与发展,2008,45(Suppl.):87-92
[4]张文力,陈明宇,樊建平.HPL测试性能仿真与预测.计算机研究与发展,2006,43(3):557-562
[5]陈纪孝,李勇.软件流水循环缓冲的设计与实现.计算机科学,2013,40(4):35-37
[6]曾庆波,左晓英,陈秀英编著.微型计算机控制技术.成都:电子科技大学出版社,2013
[7]Hennessy J L,Patterson D A.Computer Architecture:A Quantitative Approach,5th ed.Beijing:China Machine Press,2011
[8]Woltger L.Optimal DDR controller.Holland:University of Twente,2005