大规模并行计算通信可扩展性—分析、优化与模拟

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

大规模并行计算通信可扩展性—分析、优化与模拟

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Communication Scalability of Large-scale Parallel Computing-Analysis, Optimization and Simulation
作者：林宇斐
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：并行计算 ; 通信可扩展性 ; 通信原语 ; 通信隐藏 ; 通信竞争 ; 作业分配 ; 通信协议 ; 可扩展性预测 ; 离散事件模拟
英文关键词：Parallel computing ; Communication scalability ; Communica-
英文关键词：tion primitive ; Communication hiding ; Communication contention ; Job allocation ; Communication protocol ; Scalability prediction ; Discrete event simulation
学位年度：2013
导师：杨学军
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2013-09-09

摘要

随着系统规模的扩大和结点计算能力的提高，通信已经成为制约并行计算可扩展性的重要瓶颈。通信可扩展性问题，即分析通信受何种因素影响并且该影响增大到何种程度会限制系统的可扩展性，是并行计算领域最具挑战性的理论问题之一。
     本文针对通信可扩展性问题，首次从性能加速比的角度量化了并行计算的通信墙，并建立了通信可扩展性模型。基于通信可扩展性模型的分析结论，本文分别针对程序优化和任务分配优化，提出了消息独立性指导下的程序优化技术和面向多作业的分配优化技术。最后，设计和实现了一款针对大规模并行计算的性能预测模拟器，该模拟器可用于验证通信可扩展性模型的正确性以及并行系统的各种相关优化技术的可扩展性。
     具体而言，本文的主要工作和创新点体现在：
     1.建立了通信可扩展性模型（第二章）
     目前，国际上对于通信可扩展性问题大多是感性上的认识，并未对其进行系统的定量研究。本文首次提出了通信墙的定量化描述，给出了通信墙存在性定理。由此，本文建立了通信可扩展性模型，提出了系统度量方法及基于通信可扩展性模型的并行系统分类方法，量化了系统的通信可扩展性强弱和广义通信可扩展性强弱。最后结合具体案例，分析了程序、并行机拓扑以及常见优化方法对通信可扩展性的影响，比较了常见的巨型机拓扑的广义通信可扩展性强弱，指出优化系统通信可扩展性和广义通信可扩展性的方向。
     2.提出了消息独立性指导下的程序优化技术（第三章）基于指令重排的通信隐藏技术是优化程序性能的主要手段之一，然而除去该技术自身面临的问题，它还会导致消息间产生严重的网络资源竞争。本文通过分析网络资源竞争的产生原因，首次提出了消息独立性的概念并研究了其具体涵义；然后针对MPI（Message Passing Interface）程序，建立了基于指令重排的消息独立性指导下的程序优化模型；基于上述优化模型，设计并实现了基于指令重排的消息独立性指导下的程序优化方法，该方法可以在保证通信隐藏最大化的前提下减少消息间的网络资源竞争；针对并行CFD（Computational Fluid Dynamics）应用的实验表明，该方法能够很好的减少程序的通信开销并提升程序的性能。
     3.提出了面向多作业的分配优化技术（第四章）合理地为多个作业分配计算资源以满足作业的性能需求，对于那些使用大规模并行计算系统的用户来说十分重要。本文首次提出将多作业分配优化问题分解为多作业分布优化和单作业任务映射优化两个子问题。针对多作业分布优化问题，本文首次提出闭合最小图划分模型，将多作业分布优化问题转化为闭合最小图划分问题；针对单作业任务映射优化问题，本文分析了通信协议对通信开销的影响，首次为MPI程序提出了协议感知的进程映射模型—PaPP。基于上述两个模型，本文设计并实现了面向多作业的分配优化方法。实验表明，对于NPB（NAS Parallel Benchmarks）测试集，面向多作业的分配优化方法有很好的性能优化效果。
     4.设计并实现虚实结合的执行驱动模拟器—VACED-SIM（第五章）离散事件模拟是大规模并行计算常用的性能预测方法之一。本文基于对离散事件模拟方法的深入分析，提出了虚模拟和实模拟的概念；通过对虚模拟和实模拟以及轨迹驱动和执行驱动方法的对比，首次从两个正交的角度（模拟机制和事件驱动方法）将基于离散事件模拟的性能预测方法分为四类；针对大规模并行计算可扩展性预测的特点，首次提出了第四类模拟方法—虚实结合执行驱动（VACED）模拟方法的模型。基于该模型，本文设计和实现了一款轻量级的虚实结合执行驱动模拟器—VACED-SIM。在该模拟器中，本文首次提出并采用了细粒度的活动和事件定义方法，从而提高模拟的精度。在Tianhe-1A子系统上的实验结果表明，VACED-SIM具有很高的准确性与效率。
With the increasing of system scale and the rising of processing node performance,communication has become the key bottleneck that limits the scalability of parallel com-puting. The communication scalability problem, which analyzes the factors affecting thecommunication and that to what extend the influence of these factors increases will limitthe system scalability, is one of the most challenging academic problems in the field ofparallel computing.
     Aiming at the communication scalability problem, this paper quantifies for the firsttime the communication wall for parallel computing from the perspective of the perfor-mance speedup, and builds the model of communication scalability. Based on the analyt-ical results of the model, this paper proposes the program optimization technique underthe guidance of the message independency and the optimization technique for multi-joballocation, for the program optimization problem and for the task allocation optimizationproblemrespectively. Atlast,thispaperdesignsandimplementsaperformancepredictionsimulatorforlarge-scaleparallelcomputing, whichcanbeusedtovalidatethecorrectnessof the model and the scalability of various system optimization techniques.
     Specifically, the main work and contributions of this paper are as follows:
     1. Building the model of communication scalability (Chapter2)
     Currently, there is only perceptual knowledge to rather than qualitative research onthe communication scalability problem in the world. This paper proposes for thefirst time the quantitative description of the communication wall and the commu-nication wall existing theorem. After that, this paper builds the model of commu-nication scalability, proposes the system measurement method and the classifica-tion of parallel systems based on the model, and then quantifies the extend of thecommunication scalability and the extend of the general communication scalability.Combined with some specific cases, we analyze the effects of program, topologyand optimization methods on the communication scalability, compare the extendof general communication scalability among the common topologies, and point outthe way to optimize the communication scalability and the general communicationscalability for the parallel systems.
     2. Proposing the program optimization technique under the guidance of the messageindependency (Chapter3)
     Thecommunicationhidingtechniqueusinginstructionreorderingisoneofthemainprogram performance optimization methods. Besides the problems of its own, it isalso incurs serious network resource contentions among messages. By analyzingthe reasons for the network resource contentions, we unprecedentedly propose theconcept of the message independency and study its specific meaning. As for MPI(Message Passing Interface) programs, we build the program optimization modelunder the guidance of message independency based on instruction reordering. Byusingthemodel, wedesignandimplementaprogramoptimizationapproach, whichcan reduce the network resource contentions among messages in the prerequisiteof the maximal communication hiding. The experimental results show that, forthe CFD (Computational Fluid Dynamics) applications, this approach can greatlyreduce the communication overhead and improve the program performance.
     3. Proposing the optimization technique for the multi-job allocation (Chapter4)
     Allocating the computing resources for multiple jobs to satisfy their performanceneeds is greatly desired by the users of large-scale parallel computing systems.In this paper, we unprecedentedly propose the idea that decomposes the multi-job allocation optimization problem to two sub-problems: multi-job assignmentoptimization and single-job task mapping optimization. For the multi-job assign-mentoptimizationproblem,weunprecedentedlyproposetheclosed-minimalgraph-partitioning model, which transforms the multi-job assignment optimization prob-lem to a closed-minimal graph-partitioning problem; for the single-job task map-ping optimization, we analyze the impacts of the communication protocols on thecommunication overheads, and unprecedentedly propose the protocol-aware pro-cess mapping model for MPI programs—PaPP. Based on the above two models,we design and implement the multi-job allocation optimization approach. The ex-perimental results show that, for the NPB (NAS Parallel Benchmarks) applications,this approach has a high performance optimization efficiency.
     4. Designing and implementing a virtual-actual combined execution-driven simulator—VACED-SIM (Chapter5)
     Discrete event simulation is one of the most common performance prediction ap-proaches in the field of large-scale parallel computing. By deeply analyzing thediscrete event simulation approaches, we propose the concepts of virtual simula-tion and actual simulation. Based on the comparison between actual and virtualsimulations and that between trace-driven and execution-driven methods, our pa-per categorizes for the first time the discrete event simulation approaches into fourkindsfromtwoorthogonalaspects(simulationmechanismandevent-drivenmecha-nism). Based on the characteristics of scalability prediction for large-scale parallelcomputing, we propose for the first time the model of the fourth kind of discretesimulation approach—the virtual-actual combined execution-driven simulationapproach. With the model, we design and implement a light-weight virtual-actualcombined execution-driven simulator—VACED-SIM. In this simulator, we un-precedentedly propose and use the fine-grained activity and event defining meth-ods to increase the precision of the simulation. The experiments on the Tianhe-1Asub-system shows that, VACED-SIM has high accuracy and efficiency.

引文

[1] Website. http://www.top500.org.
    [2] Jaros J, Ohlidal M, Dvorak V. An evolutionary approach to collective communi-cation scheduling [C]. In Proceedings of the9th annual conference on Genetic andevolutionary computation. New York, NY, USA,2007:2037–2044.
    [3] Amdahl G M. Validity of the single processor approach to achieving large scalecomputing capabilities [C]. In Proceedings of the April18-20,1967, spring jointcomputer conference. New York, NY, USA,1967:483–485.
    [4] Gustafson J L. Reevaluating Amdahl’s law [J]. Commun. ACM.1988,31:532–533.
    [5]杨晓东,陆松,牟胜梅.并行计算机体系结构—技术与分析[M].科学出版社,2009.
    [6] Levi A F J. Optical interconnects in systems [C]. June:750–757.
    [7] HaurylauM,ChenG,ChenH,etal.On-ChipOpticalInterconnectRoadmap:Chal-lenges and Critical Directions [J]. Selected Topics in Quantum Electronics, IEEEJournal of. Nov.-dec.,12(6):1699–1705.
    [8] Ernemann C, Krogmann M, Lepping J, et al. Scheduling on the top50ma-chines [C]. In Proceedings of the10th international conference on Job SchedulingStrategies for Parallel Processing. Berlin, Heidelberg,2005:17–46.
    [9] Pascual J A, Miguel-Alonso J, Lozano J A. Optimization-based mapping frame-work for parallel applications [J]. Journal of Parallel and Distributed Computing.2011,71(10):1377–1387.
    [10]王军委,赵荣彩,李妍.基于Define-Use分析的冗余通信消除算法[J].计算机工程.2009,35(4):85–87.
    [11] Khanna S, Naor J S, Raz D. Control Message Aggregation in Group Commu-nication Protocols [C]. In Proceedings of the29th International Colloquium onAutomata, Languages and Programming. London, UK, UK,2002:135–146.
    [12] Bassetti F, Davis K, Quinlan D. Improving Scalability with Loop Transformationsand Message Aggregation in Parallel Object-Oriented Frameworks for ScientificComputing. Los Alamos, NM, USA,1998.
    [13] Khunjush F, Dimopoulos N J. Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environments [J]. Microprocess.Microsyst.2009,33(7-8):430–440.
    [14] Das D, Gupta M, Ravindran R, et al. Compiler-controlled xtraction ofcomputation-communication overlap in MPI applications [C]. In Parallel and Dis-tributed Processing,2008. IPDPS2008. IEEE International Symposium on. April:1–8.
    [15]王涛.高性能计算在基础科学中的应用—从上海超级计算中心科学计算应用谈起[J].高性能计算发展与应用.2009,29:3–7.
    [16] Gepner P, Kowalik M. Multi-Core Processors: New Way to Achieve High Sys-tem Performance [C]. In Parallel Computing in Electrical Engineering,2006. PARELEC2006. International Symposium on. Sept.:9–13.
    [17] Geer D. Chip makers turn to multicore processors [J]. Computer. May,38(5):11–13.
    [18] Hoskote Y, Vangal S, Singh A, et al. A5-GHz Mesh Interconnect for a TeraflopsProcessor [J]. Micro, IEEE. Sept.-Oct.,27(5):51–61.
    [19] AMD Multi-Core Processors: Providing Multiple Benefits For The Future. Web-site. http://www.computerpoweruser.com/articles/archive/c0604/29c04/29c04.pdf.
    [20] Schaller R. Moore’s law: past, present and future [J]. Spectrum, IEEE. Jun,34(6):52–59.
    [21] Wulf W A, McKee S A. Hitting the memory wall: implications of the obvious [J].SIGARCH Comput. Archit. News.1995,23(1):20–24.
    [22] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach(4th edition)[M]. Morgan Kaufmann,2006.
    [23] Cristal A, Llosa J, Valero M, et al. Future ILP processors [J]. Int. J. High Perform.Comput. Netw.2004,2(1):1–10.
    [24] Panda P R, Silpa B V N, Shrivastava A, et al. Power-efficient system design [M].Springer,2010.
    [25] Olukotun K, Hammond L. The Future of Microprocessors [J]. Queue.2005,3(7):26–29.
    [26]徐新海.硬件故障在程序中的传播行为分析及容错技术研究[D].[S. l.]:国防科学技术大学,2012.
    [27] Ashby S, Beckman P, Chen J, et al. The Opportunities and Challenges of ExascaleComputing [R].2010.
    [28] Kessler R E, Schwarzmeier J. Cray T3D: a new dimension for Cray Research [C].In Compcon Spring’93, Digest of Papers. Feb:176–182.
    [29] Adams J, Vos D. Small-college supercomputing: building a Beowulf cluster at acomprehensive college [J]. SIGCSE Bull.2002,34(1):411–415.
    [30] Geist A. Paving the Roadmap to EXASCALE [J]. SciDAC Review, NUMBER16Special Issue.2010.
    [31] AmarasingheS,CampbellD,CarlsonW,etal.ExaScaleSoftwareStudy:SoftwareChallenges in Extreme Scale Systems, FA8650-07-C-7724[R].2009.
    [32] Barker K J, Davis K, Hoisie A, et al. Entering the petaflop era: the architecture andperformance of Roadrunner [C]. In Proceedings of the2008ACM/IEEE confer-ence on Supercomputing. Piscataway, NJ, USA,2008:1:1–1:11.
    [33] YangX-J,LiaoX-K,LuK,etal.TheTianHe-1ASupercomputer:ItsHardwareandSoftware[J].JournalofComputerScienceandTechnology.2011,26(3):344–351.
    [34] David M C, Crawford D A, Hertel E S, et al. ASCI Red-Experiences and LessonsLearned with a Massively Parallel TeraFLOP Supercomputer.1997.
    [35] Gara A, Blumrich M A, Chen D, et al. Overview of the Blue Gene/L systemarchitecture [J]. IBM Journal of Research and Development.2005,49:195–212.
    [36] Chen D, Eisley N A, Heidelberger P, et al. The IBM Blue Gene/Q interconnectionnetwork and message unit [C]. In Proceedings of2011International Conferencefor High Performance Computing, Networking, Storage and Analysis. New York,NY, USA,2011:26:1–26:10.
    [37] Dally W J. Interconnect-centric computing.2007. HPCA2007Keynote Speech.
    [38] Sapatnekar S, Roychowdhury J, Harjani R. High-speed interconnect technology:on-chip and off-hip [C]. In VLSI Design,2005.18th International Conference on.Jan.:7–.
    [39] Svensson C M, Caputa P. High-bandwidth low-latency global interconnect [J/OL].2003:126–134.+http://dx.doi.org/10.1117/12.499957.
    [40] Xie M, Lu Y, Liu L, et al. Implementation and Evaluation of Network Interfaceand Message Passing Services for TianHe-1A Supercomputer [C]. In Proceedingsof the2011IEEE19th Annual Symposium on High Performance Interconnects.Washington, DC, USA,2011:78–86.
    [41] Petracca M, Lee B, Bergman K, et al. Design Exploration of Optical Interconnec-tion Networks for Chip Multiprocessors [C]. In High Performance Interconnects,2008. HOTI’08.16th IEEE Symposium on. Aug.:31–40.
    [42] computer systems based on silicon photonic interconnects [C].2009.
    [43] Culler D, Karp R, Patterson D, et al. LogP: towards a realistic model of parallelcomputation [C]. In Proceedings of the fourth ACM SIGPLAN symposium onPrinciplesandpracticeofparallelprogramming.NewYork,NY,USA,1993:1–12.
    [44] Ge R, Feng X, Cameron K W. Improvement of Power-Performance Efficiency forHigh-End Computing [C]. In19th International Parallel and Distributed Process-ing Symposium CD-ROM (19th IPDPS’05). Denver, CO, USA, April2005.
    [45] Steinmacher-Burow B. Some Challenges on Road from Petascale to Exascale.2010.
    [46]黄铠,徐志伟.可扩展并行计算—技术、结构与编程[M].机械工业出版社,2000.
    [47] RattnerJ.DesktopsandTeraflops:ANewMainstreamforScalableComputing[J].IEEE Parallel Distrib. Technol.1993,1(3):5–6.
    [48] Bell G. Why there won’t be apps: The problem with MPPs [J]. Parallel DistributedTechnology: Systems Applications, IEEE. Autumn/Fall,2(3):5–6.
    [49]陈军.分布式存储环境下并行计算可扩展性的研究与应用[D].[S. l.]:国防科学技术大学,2000.
    [50]张林波,迟学斌,莫则尧, et al.并行计算导论[M].清华大学出版社,2006.
    [51] Grama A, Gupta A, Kumar V. Isoefficiency Function: A Scalability Metric forParallel Algorithms and Architectures [J]. IEEE Parallel and Distributed Technol-ogy, Special Issue on Parallel and Distributed Systems: From Theory to Practice.1993,1:12–21.
    [52] Sun X-H, Rover D. Scalability of parallel algorithm-machine combinations [J].Parallel and Distributed Systems, IEEE Transactions on.1994,5(6):599–613.
    [53]吴幸福.可扩展并行计算性能模型及其应用[D].[S. l.]:北京航空航天大学,1996.
    [54] Chen J, Li X. A practical scalability metric [C]. In High Performance Computingin the Asia-Pacific Region,2000. Proceedings. The Fourth International Confer-ence/Exhibition on. May:403–404vol.1.
    [55]迟丽华.大型稀疏线性方程组在分布式存储环境下的并行计算[D].[S. l.]:国防科学技术大学,1998.
    [56]迟利华,刘杰,胡庆丰.数值并行计算可扩展性评价与测试[J].计算机研究与发展.2005,42(6):1073–1078.
    [57] SunX-H,NiLM.Scalableproblemsandmemory-boundedspeedup[J].J.ParallelDistrib. Comput.1993,19:27–37.
    [58]黄铠.高等计算机系统结构—并行性、可扩展性、可编程性[M].清华大学出版社,2001.
    [59] Culler D E, Singh J P, Gupta A. Parallel Computer Architecture: A Hardware/-Software Approach (The Morgan Kaufmann Series in Computer Architecture andDesign)[M]. Morgan Kaufmann,1998.
    [60]杨晓东. MPP系统的粒度匹配加速比模型[J].计算机学报（并行与分布计算专辑）.1997,20(10):10–14.
    [61]朱福喜,何炎祥.并行分布计算中的调度算法理论与设计[M].武汉大学出版社,2003.
    [62] Zhang J, Zhai J, Chen W, et al. Process Mapping for MPI Collective Communica-tions[C].InProceedingsofthe15thInternationalEuro-ParConferenceonParallelProcessing. Berlin, Heidelberg,2009:81–92.
    [63] Bokhari S H. On the Mapping Problem [J]. IEEE Transcations on Computers.1981,30:207–214.
    [64] Roig C, Ripoll A, Guirado F. A New Task Graph Model for Mapping Mes-sage Passing Applications [J]. IEEE Trans. Parallel Distrib. Syst.2007,18(12):1740–1753.
    [65] Ucar B, Aykanat C, Kaya K, et al. Task assignment in heterogeneous computingsystems [J]. J. Parallel Distrib. Comput.2006,66(1):32–46.
    [66] RoigC,RipollA,LuqueE.ModelingClusteredTaskGraphsforSchedulingLargeParallel Programs in Distributed Systems [J]. SIMULATION.2004,80(4-5):243–254.
    [67] Stone H S. Multiprocessor Scheduling with the Aid of Network Flow Algorithm-s [J]. IEEE Trans. Softw. Eng.1977,3(1):85–93.
    [68] LeeDC.SomeCompartmentalizedSecureTaskAssignmentModelsforDistribut-ed Systems [J]. IEEE Trans. Parallel Distrib. Syst.2006,17(12):1414–1424.
    [69] Yadav P K, Singh M P, Kumar H. Scheduling algorithm: tasks scheduling algorith-m for multiple processors with dynamic reassignment [J]. J. Comp. Sys., Netw.,and Comm.2008,2008:2:1–2:9.
    [70] SenarMA,RipollA,CortésA,etal.ClusteringandReassignment-basedMappingStrategy for Message-Passing Architectures [C]. In in: Proceedings of IPPS/SPDP1998.1998:415–421.
    [71] Chen H, Chen W, Huang J, et al. MPIPP: an automatic profile-guided parallelprocess placement toolset for SMP clusters and multiclusters [C]. In Proceedingsof the20th annual international conference on Supercomputing. New York, NY,USA,2006:353–360.
    [72] Price C C. The assignment of computational tasks among processors in a distribut-edsystem[C].InProceedingsoftheMay4-7,1981,nationalcomputerconference.New York, NY, USA,1981:291–296.
    [73] Ma P-Y R, Lee E, Tsuchiya M. A Task Allocation Model for Distributed Comput-ing Systems [J]. Computers, IEEE Transactions on. Jan., C-31(1):41–47.
    [74] Billionnet A, Costa M C, Sutter A. An efficient algorithm for a task allocationproblem [J]. J. ACM.1992,39(3):502–518.
    [75] Menon S. Effective reformulations for task allocation in distributed systems with alargenumberofcommunicatingtasks[J].KnowledgeandDataEngineering,IEEETransactions on. Dec.,16(12):1497–1508.
    [76] Ernst A, Jiang H, Krishnamoorthy M. Exact Solutions to Task Allocation Prob-lems [J]. Management Science.2006,52(10):1634–1646.
    [77] Efe K. Heuristic Models of Task Assignment Scheduling in Distributed System-s [J]. Computer. Jun,15(6):50–56.
    [78] Kopidakis Y, Lamari M, Zissimopoulos V. On the Task Assignment Problem: T-wo New Efficient Heuristic Algorithms [J]. Journal of Parallel and DistributedComputing.1997,42(1):21–29.
    [79] Sarje A, Sagar G. Heuristic model for task allocation in distributed computer sys-tems [J]. Computers and Digital Techniques, IEE Proceedings E. Sep,138(5):313–318.
    [80] Lo V M. Heuristic Algorithms for Task Assignment in Distributed Systems [J].IEEE Trans. Comput.1988,37(11):1384–1397.
    [81] Lin F-T, Hsu C-C. Task assignment scheduling by simulated annealing [C]. InComputer and Communication Systems,1990. IEEE TENCON’90.,1990IEEERegion10Conference on. Sep:279–283vol.1.
    [82] Chen Y. Task Mapping on Supercomputers with Cellular Networks [D].[S. l.]:Stony Brook University,2008.
    [83] ChockalingamT,ArunkumarS.Arandomizedheuristicsforthemappingproblem:The genetic approach [J]. Parallel Computing.1992,18(10):1157–1165.
    [84] JITPAISARNSOOKU.ANEFFICIENTSTATICTASKMAPPINGMODELONCLUSTER SYSTEM AND COMPUTATIONAL GRID SYSTEM [D].[S. l.]:KASETSART UNIVERSITY,2002.
    [85] Girault A. Elimination of redundant messages with a two-pass static analysis al-gorithm [J]. Parallel Comput.2002,28(3):433–453.
    [86] Maydan D E, Amarasinghe S P, Lam M S. Array-data flow analysis and its usein array privatization [C]. In Proceedings of the20th ACM SIGPLAN-SIGACTsymposium on Principles of programming languages. New York, NY, USA,1993:2–15.
    [87] Amarasinghe S P, Lam M S. Communication optimization and code generation fordistributed memory machines [J]. SIGPLAN Not.1993,28(6):126–138.
    [88] AmarasingheSP,AndersonJ-AM,LamMS,etal.AnOverviewofaCompilerforScalable Parallel Machines [C]. In Proceedings of the6th International Workshopon Languages and Compilers for Parallel Computing. London, UK, UK,1994:253–272.
    [89]刘刚.多计算机互连网络上聚合通信算法的研究[D].[S. l.]:中国科学技术大学,2006.
    [90] AlexandrovA,IonescuMF,SchauserKE,etal.LogGP:IncorporatingLongMes-sagesintotheLogPModel—Onestepclosertowardsarealisticmodelforparallelcomputation [R]. Santa Barbara, CA, USA: University of California at Santa Bar-bara,1995.
    [91] Bruck J, Ho C-T, Kipnis S, et al. Efficient Algorithms for All-to-All Communi-cations in Multiport Message-Passing Systems [J]. IEEE Transactions on Paralleland Distributed Systems.1997,8:1143–1156.
    [92] Faraj A, Yuan X. An empirical approach for efficient all-to-all personalized com-munication on Ethernet switched clusters [C]. In Parallel Processing,2005. ICPP2005. International Conference on. June:321–328.
    [93] Website. http://www.mcs.anl.gov/research/projects/mpich2/.
    [94] Website. http://www.mcs.anl.gov/research/projects/mpi/mpich1-old/.
    [95] Website. http://www-unix.mcs.anl.gov/mpi.
    [96]都志辉.高性能计算并行编程技术：MPI并行程序设计[M].清华大学出版社,2001.
    [97] Kerbyson D J, Alme H J, Hoisie A, et al. Predictive performance and scalabilitymodeling of a large-scale application [C]. In Proceedings of the2001ACM/IEEEconference on Supercomputing (CDROM). New York, NY, USA,2001:37–37.
    [98] Barker K, Pakin S, Kerbyson D. A Performance Model of the Krak Hydrodynam-ics Application [C]. In Parallel Processing,2006. ICPP2006. International Con-ference on. aug.2006:245–254.
    [99] Fujimoto R M. Parallel discrete event simulation [J]. Commun. ACM.1990,33:30–53.
    [100] Labarta J, Girona S, Pillet V, et al. DiP: A parallel program development environ-ment [J]. Lecture Notes in Computer Science.1996,1124:665–665.
    [101] Snavely A, Carrington L, Wolter N, et al. A framework for performance modelingand prediction [C]. In Proceedings of the2002ACM/IEEE conference on Super-computing. Los Alamitos, CA, USA,2002:1–17.
    [102] Website. http://sourceforge.net/projects/sim-mpi/.
    [103] Zhai J, Chen W, Zheng W. PHANTOM: Predicting Performance of Parallel Appli-cationsonLarge-ScaleParallelMachinesUsingaSingleNode[C].InProceedingsoftheACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgram-ming, PPOPP. Bangalore, India,2010:305–314.
    [104] Reinhardt S K, Hill M D, Larus J R, et al. The Wisconsin Wind Tunnel: virtualprototyping of parallel computers [C]. In Proceedings of the1993ACM SIGMET-RICS conference on Measurement and modeling of computer systems. New York,NY, USA,1993:48–60.
    [105] Mukherjee S S, Reinhardt S K, Falsafi B, et al. Wisconsin Wind Tunnel II: A Fast,Portable Parallel Architecture Simulator [J]. IEEE Concurrency.2000,8:12–20.
    [106] Prakash S, Bagrodia R L. MPI-SIM: using parallel simulation to evaluate MPIprograms [C]. In Proceedings of the30th conference on Winter simulation. LosAlamitos, CA, USA,1998:467–474.
    [107] Zheng G, Kakulapati G, Kale L. BigSim: a parallel simulator for performance pre-diction of extremely large parallel machines [C]. In Parallel and Distributed Pro-cessing Symposium,2004. Proceedings.18th International. april2004:78.
    [108] Prakash S, Deelman E, Bagrodia R. Asynchronous Parallel Simulation of ParallelPrograms [J]. IEEE Trans. Softw. Eng.2000,26:385–400.
    [109] Hoefler T, Schneider T. Communication-centric optimizations by dynamically de-tectingcollectiveoperations[C].InProceedingsofthe17thACMSIGPLANsym-posium on Principles and Practice of Parallel Programming. New York, NY, USA,2012:305–306.
    [110] Website. http://www.sdsc.edu/PMaC/MAPS/.
    [111] Website. www.pallas.com/pages/pmb.htm.
    [112] Website. http://www.sdsc.edu/PMaC/MetaSim/.
    [113] MPIDtrace manual. Website. http://www.cepba.upc.es/dimemas/manuali.htm.
    [114] Zheng G, Wilmarth T, Lawlor O, et al. Performance modeling and programmingenvironments for petaflops computers and the Blue Gene machine [C]. In Paral-lel and Distributed Processing Symposium,2004. Proceedings.18th International.april2004:197.
    [115] Susukita R, Ando H, Aoyagi M, et al. Performance prediction of large-scale par-allell system and application using macro-level simulation [C]. In Proceedings ofthe2008ACM/IEEEconferenceonSupercomputing.Piscataway,NJ,USA,2008:20:1–20:9.
    [116] Chen W, Zhai J, Zhang J, et al. LogGPO: An accurate communication model forperformance prediction of MPI programs [J]. Science in China Series F: Informa-tion Sciences.2009,52:1785–1791.
    [117] Sancho J C, Barker K J, Kerbyson D J, et al. Quantifying the potential benefitof overlapping communication and computation in large-scale scientific applica-tions [C]. In Proceedings of the2006ACM/IEEE conference on Supercomputing.New York, NY, USA,2006.
    [118] Petrini F, Kerbyson D J, Pakin S. The Case of the Missing Supercomputer Perfor-mance:AchievingOptimalPerformanceonthe8,192ProcessorsofASCIQ[C].InProceedings of the2003ACM/IEEE conference on Supercomputing. New York,NY, USA,2003:55–71.
    [119] Wood D, Hill M. Cost-effective parallel computing [J]. Computer.1995,28(2):69–72.
    [120] Chakrabarti S, Gupta M, Choi J-D. Global communication analysis and optimiza-tion [C]. In Proceedings of the ACM SIGPLAN1996conference on Programminglanguage design and implementation. New York, NY, USA,1996:68–78.
    [121] Zhu Y, Hendren L J. Communication optimizations for parallel C programs [J].SIGPLAN Not.1998,33(5):199–211.
    [122] RashtiMJ,AfsahiA.AssessingtheAbilityofComputation/CommunicationOver-lap and Communication Progress in Modern Interconnects [C]. In Proceedings ofthe15th Annual IEEE Symposium on High-Performance Interconnects. Washing-ton, DC, USA,2007:117–124.
    [123] Barnes B J, Reeves J, Rountree B, et al. A regression-based approach to scalabilityprediction[C].InProceedingsoftheInternationalConferenceonSupercomputing.Island of Kos, Greece,2008:368–377.
    [124] Liu G. Challenges to Exascale Computing: Memory, Communication, Reliabilityand Energy. Keynote speech at IEEE International Conference on Cluster Com-puting (Cluster2011). Austin, Texas, Sep.2011.
    [125]唐滔.面向CPU-GPU异构并行系统的编程模型与编译优化关键技术研究[D].[S. l.]:国防科学技术大学,2011.
    [126] A Grama G K, A Gupta, Kumar V. Introduction to parallel computing (SecondEdition)[M]. Pearson Education,2003.
    [127]王之元.并行计算可扩展性分析与优化—能耗、可靠性与计算性能[D].[S. l.]:国防科学技术大学,2011.
    [128] Rudin W. Principles of Mathematical Analysis, Third Edition [M]. McGraw-Hill,1976.
    [129] Thakur R, Rabenseifner R. Optimization of Collective communication operationsinMPICH[J].InternationalJournalofHighPerformanceComputingApplication-s.2005,19:49–66.
    [130] Barnett M, Payne D G, van de Geijn R A, et al. Broadcasting on Meshes withWormhole Routing [J]. Journal of Parallel and Distributed Computing.1996,35(2):111–122.
    [131] Na’mneh R A, Pan W D, Yoo S-M. Two Parallel1-d Fft Algorithms without all-to-all Communication [J]. Parallel Processing Letters.2006,16:153–164.
    [132] Hendrickson B, Plimpton S. Parallel Many-Body Simulations without All-to-All Communication [J]. Journal of Parallel and Distribued computing.1995,27:15–25.
    [133] Tsai Y-j, McKinley P K. A Broadcast Algorithm for All-Port Wormhole-RoutedTorus Networks [J]. IEEE Trans. Parallel Distrib. Syst.1996,7:876–885.
    [134] Lam C C, Huang C-H, Sadayappan P. Optimal algorithms for all-to-all person-alized communication on rings and two dimensional tori [J]. J. Parallel Distrib.Comput.1997,43:3–13.
    [135] Scott D S. Efficient All-to-All Communication Patterns in Hypercube and MeshTopologies [C]. In The Sixth Distributed Memory Computing Conference1991Proceedings.1991:398–403.
    [136] Barnett M, Shuler L, van de Geijn R, et al. Interprocessor collective communi-cation library (InterCom)[C]. In Scalable High-Performance Computing Confer-ence,1994., Proceedings of the.1994:357–364.
    [137] Shroff M, Geijn R A V D. CollMark: MPI Collective Communication Bench-mark [C]. In Supercomputing,2000., International Conference on.2000.
    [138] Kale L V, Kumar S, Varadarajan K. A Framework for Collective PersonalizedCommunication [C]. In Proceedings of the17th International Symposium on Par-allel and Distributed Processing. Washington, DC, USA,2003:69.1–.
    [139] Iancu C, Husbands P, Hargrove P. HUNTing the overlap [C]. In Parallel Architec-tures and Compilation Techniques,2005. PACT2005.14th International Confer-ence on. Sept.:279–290.
    [140] Duato J, Yalamanchili S, Ni L M. Interconnection Networks–An EngineeringApproach [M]. Morgan Kaufmann,2004.
    [141] Tamir Y, Frazier G. High-performance multiqueue buffers for VLSI communica-tion switches [C]. In Computer Architecture,1988. Conference Proceedings.15thAnnual International Symposium on. May-2Jun:343–354.
    [142] Coarfa C. Portable High Performance and Scalability of Partitioned Global Ad-dress Space Languages [D].[S. l.]: RICE UNIVERSITY,2007.
    [143] Gropp W, Lusk E, Skjellum A. Using MPI (2nd ed.): portable parallel program-ming with the message-passing interface [M]. Cambridge, MA, USA: MIT Press,1999.
    [144] Panfeng W, Yunfei D, Hongyi F. Static Analysis for Application-Level Check-pointing of MPI Programs [C]. In High Performance Computing and Communi-cations,2008. HPCC2008.10th International Conference on. Sept.:548–555.
    [145] Website. http://www.openfoam.com.
    [146] Bush M, Tanner R, Phan-Thien N. A boundary element investigation of extru-date swell [J/OL]. Journal of Non-Newtonian Fluid Mechanics.1985,18(2):143–162. http://www.sciencedirect.com/science/article/pii/0377025785850187.
    [147] OpenFOAM Foundation. OpenFOAM:The Open Source CFD Toolbox [R].2011.
    [148]谢旻.高可用MPI并行编程环境及并行程序开发方法的研究与实现[D].[S. l.]:国防科学技术大学,2007.
    [149] Kim J, Dally W J, Scott S, et al. Technology-Driven, Highly-Scalable DragonflyTopology [J]. SIGARCH Comput. Archit. News.2008,36:77–88.
    [150] Moadeli M, Shahrabi A, Vanderbauwhede W, et al. Communication Modelling ofthe Spidergon NoC with Virtual Channels [C]. In Proceedings of the2007Inter-national Conference on Parallel Processing. Washington, DC, USA,2007:76–76.
    [151] Tr ff J L. Implementing the MPI process topology mechanism [C]. In Proceedingsof the2002ACM/IEEE conference on Supercomputing. Los Alamitos, CA, USA,2002:1–14.
    [152] Pant A, Jafri H. Communicating efficiently on cluster based grids with MPICH-VMI [C]. In Cluster Computing,2004IEEE International Conference on. sept.2004:23–33.
    [153] Website. http://software.intel.com/sites/products/documentation/hpc/itac/itc_reference_guide.pdf.
    [154] Website. http://www.vampir.eu/.
    [155] AGrove D. Performance Modelling of Parallel Programs [D].[S. l.]: University ofAdelaide,2003.
    [156] Moritz C A, Frank M I. LoGPC: Modeling Network Contention in Message-Passing Programs [J]. IEEE Trans. Parallel Distrib. Syst.2001,12:404–415.
    [157] Bosque J, Perez L. HLogGP: a new parallel computational model for heteroge-neous clusters [C]. In Cluster Computing and the Grid,2004. CCGrid2004. IEEEInternational Symposium on. April:403–410.
    [158] Martínez D, Blanco V, Cabaleiro J, et al. Automatic parameter assessment of logp-based communication models in MPI environments [J]. Procedia Computer Sci-ence.2010,1(1):2155–2164.
    [159] Agarwal T, Sharma A, Laxmikant A, et al. Topology-aware task mapping for re-ducing communication contention on large parallel machines [C]. In Parallel andDistributed Processing Symposium,2006. IPDPS2006.20th International. april2006:10pp.
    [160] Bhatele A, Kale L V. Benefits of Topology Aware Mapping for Mesh Intercon-nects [J]. Parallel Processing Letters.2008,18:549–566.
    [161] Senar M A, Ripoll A, Cortés A, et al. Clustering and reassignment-based mappingstrategy for message-passing architectures [J]. J. Syst. Archit.2003,48(8-10):267–283.
    [162] Kernighan B, Lin S. An effective heuristic procedure for partitioning graphs [J].The Bell System Technial Journal.1970,49(2):291–308.
    [163] Pellegrini F. Static mapping by dual recursive bipartitioning of process architec-turegraphs[C].InScalableHigh-PerformanceComputingConference,1994.,Pro-ceedings of the. may1994:486–493.
    [164] LeeCH,KimM,ParkCI.AnefficientK-waygraphpartitioningalgorithmfortaskallocation in parallel computing systems [C]. In Proceedings of the first interna-tional conference on systems integration on Systems integration’90. Piscataway,NJ, USA,1990:748–751.
    [165] Website. http://www.nas.nasa.gov/publications/npb.html.
    [166] Prakash S. Performance Prediction of Parallel Programs [D].[S. l.]: University ofCalifornia,1996.
    [167] Gropp W, Lusk E. MPICH Working Note: The implementation of the second gen-eration MPICH ADI [R].1996.
    [168] Ino F, Fujimoto N, Hagihara K. LogGPS: a parallel computational model for syn-chronization analysis [C]. In Proceedings of the eighth ACM SIGPLAN sympo-sium on Principles and practices of parallel programming. New York, NY, USA,2001:133–142.
    [169] Gropp W, Lusk E, Doss N, et al. A high-performance, portable implementationof the MPI message passing interface standard [J]. Parallel Comput.1996,22:789–828.
    [170] Liu J, Wu J, Kini S P, et al. High performance RDMA-based MPI implementationover InfiniBand [C]. In Proceedings of the17th annual international conferenceon Supercomputing. New York, NY, USA,2003:295–304.
    [171] Misra J. Distributed discrete-event simulation [J]. ACM Comput. Surv.1986,18:39–65.
    [172] Jha V, Bagrodia R L. Transparent implementation of conservative algorithms inparallelsimulationlanguages[C].InProceedingsofthe25thconferenceonWintersimulation. New York, NY, USA,1993:677–686.
    [173] JeffersonD,BeckmanB,WielandF,etal.Timewarpoperatingsystem[J].SIGOP-S Oper. Syst. Rev.1987,21:77–93.
    [174] Steinman J S. Breathing Time Warp [J]. SIGSIM Simul. Dig.1993,23:109–118.
    [175] Chandy K, Misra J. Distributed Simulation: A Case Study in Design and Verifi-cation of Distributed Programs [J]. Software Engineering, IEEE Transactions on.1979, SE-5(5):440–452.
    [176] Afsahi A. Design and Evaluation of Communication Latency Hiding/ReductionTechniques for Message-Passing Environments [D].[S. l.]: University of Victoria,2000.
    [177] Xu X, Yang X, Lin Y. WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs [J]. IEICE Transactions.2012,95-D (3):786–796.
    [178] Dickens P, Heidelberger P, Nicol D. Parallelized direct execution simulation ofmessage-passing parallel programs [J]. Parallel and Distributed Systems, IEEETransactions on.1996,7(10):1090–1105.
    [179] Ajima Y, Takagi Y, Inoue T, et al. The Tofu Interconnect [C]. In High PerformanceInterconnects (HOTI),2011IEEE19th Annual Symposium on. aug.2011:87–94.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700