高性能计算机若干关键问题研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

高性能计算机若干关键问题研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study of Some Key Problems of High Performance Computer
作者：李晖
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：高速缓存一致性 ; 多核处理器 ; 高性能计算机互连网络 ; 高性能计算机
英文关键词：Cache Coherence ; Chip Multiprocessor ; Parallel Computing ; High Performance Computer
学位年度：2009
导师：陈国良
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2009-05-01

摘要

高速缓存一致性问题不仅关系着系统的正确性,还对系统的性能有着重要影响。多核处理器的高速缓存一致性协议设计更为复杂和验证更加困难。使用多核处理器构建大规模并行计算系统已经成为主流。在此环境下,高速缓存一致性协议需要处理的一致性事务更多,涉及到处理器芯片内多个高速缓存之间、处理器芯片内高速缓存与片外高速缓存之间、处理器芯片之间的一致性问题等。所以研究多核处理器的高速缓存一致性问题具有重要的学术意义和应用背景。首先,本文对多核处理器中的高速缓存一致性协议进行了研究,重点研究了扩放性较好、能适应多核处理器本身特点的MOESI协议及其实现,并对该协议做出了优化;其次,本文研究了在由多核处理器构建的并行计算系统环境下的高速缓存一致性协议,实验证明本文工作能够有效减少片内高速缓存失效次数(13%到30%)和提高系统性能(运行时间最多能减少30%左右);最后,本文研究了片上高速缓存的包含与非包含策略,提出了一个基于不包含策略的片上高速缓存系统设计,从而提高了片上高速缓存容量的利用率和提升了多核处理器的性能。
     高性能计算机是一个国家的重要战略资源,其国产化水平是一个国家综合国力的集中体现。目前采用我国具有完全自主知识产权的龙芯多核处理器构建高性能计算机已经被一些研究单位或机构纳入日程。首先,本文特别针对科学计算领域应用,对龙芯体系结构的多核处理器的片上缓存系统的性能进行了评测,指出了龙芯多核处理器在科学计算领域中的一些性能特点。其次,依此进行了一些设计空间上的探索。实验证明,在某些情况下可以使片上二级高速缓存命中率提高50%以上,等。
     高性能计算机的网络,对于机器的整体性能具有至关重要的作用。首先,本文研究了一种先进的新型网络拓朴结构:MPU,研究内容包括其数学模型、网络拓扑、路由算法等。其次,本文从理论上对MPU与当前其他先进高性能计算机网络进行了对比分析。最后,本文还介绍了为MPU所开发的一个大型并行模拟器MPUS的原理、架构、工作流程,等。实验证明,MPU的设计正确,且具有良好的可扩放性。
     KD-50-Ⅰ万亿次计算机是首台基于龙芯通用高性能处理器的国产万亿次计算机。首先,本文围绕KD-50-Ⅰ的体系结构设计,研究实现了KD-50-Ⅰ的无盘启动技术、构建了KD-50-Ⅰ的高效操作系统和文件系统、优化了KD-50-Ⅰ通信库,从而提高了系统的性能和可用性,有利于KD-50-Ⅰ的推广应用。其次,本文研究了实际物理学研究中常用到的扫描电子显微成像模拟程序在KD-50-Ⅰ上的应用,并对其进行了优化。本项工作提高了应用程序运行效率,为KD-50-Ⅰ在不同领域的应用提供了示例。
The Cache Coherence Protocol is a key component towards the correctness and effciency of the computer system. It is more difficult to design and verify the cache coherence protocol for Chip Multiprocessor(CMP), as the protocol should deal with the interaction between CMPs, inter- and intra-CMPs, etc. High performance computers are important resources to our country. We have studied the cache coherence techniques for CMP. Taking into account the scalarbility and performance, our work focuses on the MOESI protocol and its implementation. We have studied the cache coherence protocol in multi-CMP (M-CMP) systems. Experiments show that our work can improve system performance up to 1. 5X, and reduce the times of on-chip cache misses (13% up to 30%). To utilize on-chip cache efficiently is key to CMP' s performance. We have studied the performance of inclusive and non-inclusive on-chip cache and proposed an on-chip cache architecture that is based on exclusive policy.
     Some companies and research institution have dicided to comstruct high performance computers that based on LoongSon multi-core CPU. To comprehend the feature of LoongSon CPU under scientific applications more precisely, we profiled its performance. Based on the research, we proposed some idea to optimise its architecture. Experiments showed that our work can reduce the L2 Cache miss rate up to 50%.
     The network of high performance computer is key to HPC' s performance. We have studied a new network—MPU, including its mathematic model, topology, routing algorithm, etc. We proved that MPU has better performance that some other modern networks theoretically. After that, we have studied a large scale parallel simulator for MPU—MPUS, including its architecture, working procedures, etc. Experiments thowed that the design of MPU is right, and is good at scalability.
     KD-50-I is the first totally made-in-China Tera-Flops high performance computer that based on LoongSon CPU. We have studied its architecture and optimization to it. Our wrok contributes to the development and use of totally made-in-China high performance computer.

引文

[陈国良 2002]陈国良,吴俊敏,章锋,章隆兵.并行计算机体系结构[M].北京:高等教育出版社.2002.
    [陈国良 2002]陈国良,安虹,陈崚,郑启龙,单久龙.并行算法实践[M].北京:高等教育出版社.2002.
    [陈国良 2003]陈国良.并行计算--结构.算法.编程[M].北京:高等教育出版社.2003.
    [邓越凡 2006]邓越凡.MPU系统设计报告[M].上海:上海红神信息技术有限公司.2006.
    [都志辉 2001]都志辉.高性能计算之并行编程技术--MPI并行程序设计[M].北京:清华大学出版社.2001.
    [冯昊 2008]冯昊,吴承勇.CMP体系结构上非包含高速缓存的设计及性能分析[J].计算机工程与设计.7(29):1595-1611.2008.
    [顾乃杰 2008]顾乃杰,李凯,陈国良,吴超.基于龙芯2F体系结构的BLAS库优化[J].中国科学技术大学学报.38(7):854-859.2008.
    [胡伟武 2001]胡伟武.共享存储系统结构[M].北京:高等教育出版社.2001.
    [王焕东 2008]王焕东,高翔,陈云霁,胡伟武.龙芯3号互连系统的设计与实现[J].计算机研究与发展.45(12):2001-2010.2008.
    [杨晓奇 2008]杨晓奇,郑启龙,陈国良,张俊霞.国产万亿次高性能计算机KD-50-I的通信优化[J].小型微型计算机系统.已录用,稿件编号:0800182.
    [袁伟 2005]袁伟,张云泉,孙家昶,李玉成.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展.42(6):1079-1084.2005.
    [张福新2007]张福新,章隆兵,胡伟武.基于SimpleScalar的龙芯CPU模拟器Sim-Godson[J].计算机学报.68-73.2007.
    [Agarwal 1988]Agarwal A,Lim B,Kranz D,Kubiatowicz J.APPIL:A processor Architecture for Multiprocessing[C].Proc of the 17th Annual International Symposium on Computer Architecture.1988.
    [Akhter 2007]S.Akhter,J.Roberts.Multi-Core programming---increasing performance through software multi-threading[M].北京:电子工业出版社.2007.
    [Almasi 1989]Almasi.G.S.,GA Highly Parallel Computing.Benjamin-Cummings publishers,Redwood,CA.1989.
    [AMD 20031AMD.http://www.amd.com/us-en.2003.
    [Ardenne 1938]M.von Ardenne.Z.Tech.Phys.109:553.1938.
    [Ashwini 2001]Ashwini K.Nanda,Anthony-Trung N.,M.M.Michael,D.J.Josep.High-Throughout Coherence Control and Hardware Messaging in Everest.IBM Journal of Research and Development.45(2):229-244.2001.
    [Baer 1988]J.L.Baer,W.H.Wang.On the inclusion properties for multi-level cache hierarchies[C].Proc of the 15th International Symposium on Computer Archtecture.73-80.1988.
    [Barroso 2000]Luiz Andre Barroso,Kourosh Gharachorloo,Robert McNamara,et.at.Piranha:A Scalable Architecture Based on Single-Chip Multiprocessing[C].ISCA 2000.2000.
    [Baskett 1988]Baskett F,jermoluk T,Solomon D.The 4D-MP graphics superworkstation:computing+graphics=40 MIPS+40 MFLOPS and 100,000lighted polygons per second[C].Proc 33rd IEEE Computer Society International Conference-Compcon'88. 468-471. 1988.

    [Beckmann 2006] B. M. Beckmann. Managing wire delay in Chip Multiprocessor caches[D]. PhD thesis. Uni of Wisconsin. 2006.

    [Bilas 1999] A. Bilas, C. Liao, J. P. Singh. Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems[J]. SIGARCH Comput. Archit. News. 27(2):282-293. 1999.

    [Binder 1992] K. Binder, Ed. The Monte Carlo Method in Condersed Matter Physics[M]. Springer-Verlag, Heidelberg. 1992.

    [Blumrish 2003] M.Blumrish, D.Chen, P.Coteus, A. Gara, M.Giampapa,P.Heidelberger, S.Singh, B. Steimmacher-Burrow, T. Takken and P. Vranas.Design and analysis of the BlueGene/L Torus Interconnection Network[J].Computer Science. 2003

    [Bradford 2006] Bradford M. Beckmann, M.R.Marty, D. A. Wood. ASR: adaptive selective replication for CMP caches[C]. In 39th Annual IEEE/ACM Symposium on Microarchitecture. 2006.

    [Briggs 2002] F. Briggs, M.Cekleov, K.Creta, M. Khare, S.Kulick, A.Kumar,L P. Looi, C. Natarajan, S.Radhakrishnan, L. Rankin. Intel 870: A Building Block for Cost-Effective, Scalable Servers. IEEE Micro. 22(2):36-47. 2002.

    [Bruck 1997] Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli Upfal, and Derrick Weathersby. Effcient algorithms for all-to-all communications in multiport messagepassing systems[J]. IEEE Transactions on Parallel and Distributed Systems. 8(11):1143-1156. 1997.

    [BusyBox 2007] BusyBox. http://busybox.net/. 2007.

    [Byrd 1999] G. Byrd, M. Flynn. Producer-Consumer Communication in Distributed Shared Memory Multiprocessors. Proc of the IEEE. 87(3):456-466.
    [Carter 1991] J. B. Carter, J. K. Bennett, W. Zwaenepoel. Implementation and Performance of Munin. In Proc of the 13th ACM Symposium on Operating System Principles. 152-164. 1991.

    [Censier 1978] Censier L M, Feautrier P. A new solution to coherence problems in multicache system[J]. IEEE Trans. Computers.C-27(12):1112-1118.1978.

    [Chaiken 1991] Chaiken D, kubiatowitz J, Agarwal A. Limitless directories:a scalable cache coherence scheme[C]. Proc of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. 1991.

    [Cheng 2006] L.Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian,J. B.Carter. Interconnect-Aware Coherence Protocols for Chip Multiprocessors. In Proc of the 33rd Annual International Symposium on Computer Architecture. 339-351. 2006.

    [Chishti 2005] Z.Chishti, M.D.Powell, T.N. Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In Proc of the 32~(nd) International Symposium on Computer Architecutre. 2005.

    [Cox 1993] A. Cox, R. Fowler. Adaptive Cache Coherency for Detecting Migratory Shared Data. In Proc of the 20th Annual International Symposium on Computer Architecture. 98-108. 1993.

    [Cray 1993] Cray Research. CRAY T3D System Architecture Overview. Cray Research HR-04033 edition. 1993.

    [Culler 2001] David E. Culler. Parrallel Computer Architecture[M]. Beijing:China Machine Press. 1999.
    [Dai 1999] D.Dai, D. K. Panda. Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectureal Alternatives and Performance Evaluation[J]. IEEE Transactions on Computers. 48 (2):236-244. 1999.

    [David 2001] F. David. F., S. Ravi, G. Serban, et al. Proposed NIST standard for role-based access control[J]. ACM Transactions on Information and System Security. 4(3):224-274. 2001.

    [Diefendorff 2000] Diefendorff K, Dubey P, Hochsprung R, et. al. AltiVec extension to powerPC accelerates media processing[J]. IEEE Micro.20(2): 85-95. 2000.

    [Ding 1990] Z.J.Ding. PhD thesis[D]. Osaka Univ. 1990.
    [Ding 1996] Z.J.Ding, R.Shimizu. Scanning[J]. 18(92). 1996.

    [Duato 2003] J. Duato, S. Yalamanchili, L.Ni. Interconnection Networks -and engineering approach[M]. Morgan Kaufmann Publishers. 2003.

    [Dybdahl 2006] Haakon Dybdahl, Per Stenstrom, Lasse Natvig. An LRU-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches[C]. MEDEA' 06. 2006.

    [Eckhardt 1987] R. Eckhardt. Special Issue[J]. Los Alamos Science. 15(131).1987.

    [Eisley 2006] N.Eisley, L. S. Pen, L. Shang. In-Network Cache Coherence. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. 321-332. 2006.

    [Everhart 1960] T. E. Everhart, R. F. M. Thornley. J.Sci. Instr[J]. 37(246).1960.

    [Falsafi 1994] B. Falsafi, A.R. Lebeck, S. K. Reinhardt, I.Schoinas. Application-Specific Protocols for User-Level Shared Memory. In Proc of the Supercomputing. 380-389. 1994.

    [Gharachorloo 2000] K. Gharachorloo, M.Sharma, S.Steely, S.V.Doren.Architecture and Design of AlphaServer GS320. In Proc of the 9th International Conference on Architecture Support for Programming Languages and Operating Systems. 13-24. 2000.

    [Gostin 2005] G.Gostin, J.-F. Collard, K.Collins. The Architecture of the HP Superdome Shared-memory Multiprocessor. In Proc of the 19th Annual International Conference on Supercomputing. 239-245. 2005.

    [Gu 2003] GU Yong-feng, CHEN Zhang-long. Approach to Tailoring Embedded Linux[J]. Journal of Chinese Computer Systems. 24(9): 1697-1700. 2003.

    [Gupta 1992] A.Gupta, W.Weber. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computer.41(7): 794-810. 1992.

    [Hammand 1997] L. Hammand, B. A. Nayfeh, K. Olukotun. A single-chip multiprocessor[J]. IEEE Computer, 30(9):79-85. 1997.

    [Hammond 2004] Hammond L, Wong V, Chen M. et.al. Transactional memory coherence and consistency[C]. the 31th Annual International Symposium on Computer Architecture. 32(2):102-114. 2004.

    [Hagersten 1999] E. Hagersten, M. Koster. WildFire: a scalable path for SMPs. Proc of the 5th IEEE Symposium on High-Performance Computer Architecture[C]. 172-181. 1999.

    [Hennessy 2007] John L. Hennessy, David A. Patterson. Computer Architecture A Quantitative Approach[M] (4th edition). 209. Beijing: China Machine Press, 2007.
    [Howell 1996] P.G.T.Howell. Scanning[J]. 18(428). 1996.

    [Hovington 1997] P.Hovington, D.Drouin, R.Gauvin. Scanning[J]. 19(1).1997.

    [Hu 2008] Hu W. Wang J. GaoX. etal. Miero-architecture of Godson-3 multi-core processor[C]. In Proc of the 20th Hot Chips. 2008[2008 — 11—20]. http://www. hotchips. org/hc20/main—page. htm

    [Hwang 1998] Kwang K., Xu Z. W. Scalable parallel computing: technology,architecture, programming[M]. MaGraw-Hill. 1998.

    [IEEE 1993] IEEE. IEEE Standard for Scalable Coherent Interface: IEEE Std.1993.

    [Intel 2003] Intel. http://www.intel.com. 2003.

    [Intel 2005] Intel. Intel pentium dual-core processor overview.http://www. intel. com. 2005.

    [IBM 2008]IMB. http://www. research. ibm. com/journal/rd/516/le. html

    [James 2005] James L, Daniel L. The SGI Origin: A cc-NUMA Highly Scalable Server. In Proc of the 24th Annual International Symposium on Computer Architecture. 1997.

    [JEOL 2005] JEOL. http://www. jeol. com/sem_/semprods/jsm7700f. html. 2005.

    [John 2001] John L. Hennessy, David A. Patterson. Computer Architecture A Quantitative Approach[M]. Beijing: China Machine Press. 2006.

    [Johnson 2002] J. J. Johnson. The AMD-760 MPX Platform for the AMD Athlon MP Processor. AMD White Paper. 2002.
    [Joshi 2003] R.Joshi, L.Lamport, J.Matthews, S.Tasiran, et.al. Checking Cache-Coherence Protocols with TLA+. Formal Methods in System Design[J],22(2): 125-131. 2003.

    [Jouppi 1993] N.P.Jouppi, S.J.E.Wilton. Tradeoffs in two-level on-chip caching[J]. WRL Research Report. 1993.

    [Kaxiras 1999] S.Kaxiras, J.R.Goodman. Improving cc-NUMA Performance Using Instruction-Based Prediction[C]. Proc of the 5th IEEE Symposium on High-Performance Computer Architecture. 161-170. 1999.

    [KD50 2009] KD-50-I. http://kd50.ustc.edu.cn. 2009.

    [Keltcher 2003] C.N.Keltche, K. J. McGrath, A.Ahmed, P. Conway. The AMD Opteron Processor for Multiprocessor Servers[J]. IEEE Micro. 23(2) :66-76.2003.

    [Keyes 2003] D. E. Keyes, A science-based case for large-scale simulation[J].Office of Science U. S. Department of Energy. 2003.

    [Kim 2002] C.Kim, D. Burger, S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated 0n-Chip Caches[C]. In Proc of ASPLOS.211-222. 2002.

    [Knoll 1935] M.Knoll. Z.Tech. Phys. 16:467. 1935.

    [Kongetira 2005] P. Kongetira, K. Aingaran, K. Olukotun. Niagara: A 32-Way Multi-threaded Sparc Processor[J]. IEEE Micro. 25(2):21-29. 2005.

    [Koufaty 1995] D.A. Koufaty, X.Chen, D.K.Poulsen, J.Torrellas. Data Forwarding in Scalable Shared-Memory Multiprocessors[C]. In Proc of the 9th International Conference on Supercomputing. 255-264. 1995.

    [Kumar 2005] R.Kumar, V. Zyuban, and D.Tullsen. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling[C]. Proc of the 32nd Annual International Symposium on Computer Architecture. 2005.

    [Kundu 2006] P. Kundu. On-die interconnects for next generation CMPs[C].Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems. Dec. 2006.

    [Kuskin 1998] J. Kuskin, D. Ofelt, M.Heinrich, J. Heinlein, R. Simoni, et. al.The Standford FLASH Multiprocessor[C]. In Proc of the 25th Annual International Symposium on Computer Architecture. 485-496. 1998.

    [Lai 1999] An-Chow Lai, B. Falsafi. Memory Sharing Predictor: The Key to a Speculative Coherent DSM[C]. In Proc of the 26th Annual International Symposium on Computer Architecture. 172-183. 1999.

    [Landin 1991] A. Landin, E. Hagersten, S. Haridi. Race-free Interconnection Networks and Multiprocessor Consistency[J]. SIGARCH Comput. Archit. News.19(3): 106-115. 1991.

    [Larus 1994] J. R. Larus. Compiling for Shared-Memory and Message-Passing Computers[J]. ACM Letters on Programming Languages and Systems.2(1-4): 165-180. 1994.

    [Lebeck 1995] A. R. Lebeck, D. A. Wood. Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors. In Proc of the 22nd Annual International Symposium on Computer Architecture[C]. 48-59. 1995.

    [Lenoski 1990] Lenoski D, Laudon J, Gharachorloo K, et. al. The directory-based cache coherence protocol for the DASH multiprocessors[C].Proc of the 17th Annual International Symposium on Computer Architecture.148-158. 1990.

    [Li 2005] Li Huiming. Studies on the Simulation of Scanning Electron Microscopy Image and of Electron Spectroscopy[D]. University of Science and Technology of China. 2005.

    [Li 2007] Li Hui, Wu Junming, Chen Guoliang. MPUS: a scalable parallel simulator for RedNeurons parallel computer[C]. Proceedings of the 2nd international conference on Scalable information systems. 2007.

    [Lowney 1995] J.R.Lowney. Scanning[J]. 17(281). 1995.

    [Ly 1995] T.D.Ly, D.G.Howitt, M.K.Farrens, A.B.Harker. Scanning[J].17(220). 1995.

    [Martin 2003] M.M.K.Martin, M.D. Hill and D.A. Wood. Token coherence:Decoupling Performance and Correctness[C]. Proc of the 30th Annual International Symposium on Computer Architecture. 182-193. 2003.

    [Martin 2005] M.M.K.Martin, D. J. Sorin, B. M. Beckmann, et.al. Multifacet's general execution-driven multiprocessor simulator(gems) toolset[J].Computer Architecture News(CAN). 92-99. 2005.

    [Marty 2005] Michael R.Marty, Jesse D.Bingham, Mark D. Hill, et.al.Improving Multiple-CMP Systems Using Token Coherence[C]. Proceedings of the 11th International Symposium on High-Performance Computer Architecture. Feb, 2005.

    [Marty 2008] Michael R.Marty. Cache Coherence Techniques for Multicore Processors[D]. University of Wisconsin-Madison. 2008.

    [Maui 2008] Maui - PBS Integration Guide. http://www.clusterresources. com/products/maui/docs/pbsintegration. shtm 1. 2008.

    [McMullen 1952] D.McMullen. PhD thesis. Cambridge Univ. 1952.
    [Mukherjee 1998] S. S. Mukherjee, M.D.Hill. Using Prediction to Accelerate Coherence Protocols. In Proc of the 25th Annual International Symposium on Computer Architecture. 179-190. 1998.

    [MPICH2 2009] MPICH2. http://www. mcs. anl. gov/research/projects/mpich2/.2009.

    [NASA 2008] http://www.nas.nasa.gov/Resources/Software/npb.html

    [Neil 2005] Neil Vachharajani, Matthew Iyer, Chinmay Ashok, et. al. Chip multi-processor scalability for single-threaded applications. ACM SIGARCH Computer Architecture News. 33(4):44-53. 2005.

    [Nesbit 2004] K. J. Nesbit, J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In Proc of the 10th International Symposium on High Performance Computer Architecture. 96-105. 2004.

    [Papamarcos 1984] Papamarcos M, patel J, A low overhead coherence solution for multiprocessors with private cache memories[C]. Proc of 11th Annual of International Symposium on Coumputer Architecture. 348-354. 1984.

    [Park 2001] Park J. S., S.Ravi, A.Gail-Joon. Role-based access control on the web[J]. ACM Transaction on Informantion and System Security.4(1) :37-71. 2001.

    [Paul 1988] Paul Sweazey. Shared memory systems on the futurebus [C]. the 33th IEEE Computer Society International Conference. 505-511. 1988.

    [Pease 1965] R.F.M. Pease, W.C.Nixon. J.Sci. Instr. 42 (81) . 1965.
    [PMON 2007] PMON2000 Boot Firmware, http://www.opsycon.se/pmonmain. 2007.
    [Radzimski 1995] Z. J. Radzimski, J.C. Russ. Scanning[J]. 17(276). 1995.
    [Rajeev 2005] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of Collective Communication Operation in Mpich[J]. the international journal of high performance computing application. 2005.

    [Ravi 2000] Sandhu Ravi, F. David, K.Richard. NIST model for role-based access control: towards a unified standard[C]. Proc of the ACM Workshop on Role-Based Access Control. 47-63. 2000.

    [Realtek 2007] Realtek Semiconductor Corp. Integrated Gigabit Ethernet Controller (LOM) (MiniPCI) Datasheet, Rev. 1.3[Z]. 2007.

    [Reinhardt 1994] S.K.Reinhardt, J. R. Larus, D. A. Wood. Tempest and Typhoon:User-Level Shared Memory[C]. In Proc of the 21st Annual International Symposium on Compouter Architecture. 325-336. 1994.

    [Rolf 1999] Rolf Rabenseifner, Automatic MPI counter pro_ling of all users: First results on a CRAY T3E 900-512. In Proceedings of the Message Passing [C]. Interface Developer' s and User' s Conference 1999 (MPIDC ' 99). 77-85.1999.

    [Rolf 2007] Rolf Rabenseifner. New optimized MPI reduce algorithm.http://www. hlrs. de/organization/par/services/models/mpi/myreduce. html

    [Sarangi 2006] S.R.Sarangi, A. Tiwari, and J. Torrellas. Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware[C]. Proceeding of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. 2006.

    [Seeger 2003] A.Seeger, C. Fretzagias, R.Taylo. Scanning[J]. 25(264).2003.

    [Shigehisa 1999] Shigehisa Satoh Kazuhiro, Kazuhiro Kusano, Yoshio Tanaka,Motohiko Matsuda, Mitsuhisa Sato. Parallelization of Sparse Cholesky Factorization on an SMP Cluster[C]. In Proc. HPCN Europe 1999, LNCS 1593.
    [Shimizu 1992] R.Shimizu, Z.J.Ding. Rep. Prog. Phys. 55(487). 1992.

    [Smith 1956] K.C. A. Smith. PhD thesis. Cambridge Univ. 1956.

    [Smith 1982] A. J. Smith, Cache memories[J]. ACM Computing Surveys. 1982.

    [SONG 2007] SONG Youquan, GAOXiaopeng, LONG Xiang. Design and Optimization of PCI Ethernet Adapter Driver Program in Embedded System[J]. Computer Engineering. 33(2): 264-266. 2007.

    [Strets 2000] R.Strets, S. Dwarkadas, L.Kontothanassis, U.Rencuzogullari,M.L.Scott. The Effect of Network Total Order, Broadcast, and Remote-write Capability on Network-based Shared Memory Computing[C]. In Proc of the 6th International Symposium on High Performance Computer Architecture.265-276. 2000.

    [Stenstrom 1993] P. Stenstrom, M. Brorsson, L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proc of the 20th Annual International Symposium on Computer Architecture. 109-118. 1993.

    [Sun 2004] Sun Microsystems. UltraSPARCIII Cu user's manual(version 2.2.1)[M]. http://www. sun. com/processors/manuals/usIIIv2.pdf.

    [Tendler 2001] J.Tendler, S.Dodson, S.Fields, H. Le, B. Sinharoy. P0WER4 System Microarchitecture. IBM Server Group Whitepaper. 2001.

    [Thapar 1993] Thapar M, Delagi B, Flynn M. Linked list Cache Coherence for scalable shared memory multiprocessors[C]. Proc of the 7th International Parallel Processing Symposium. 1993.

    [Thilo 1999] Thilo Kielmann, Rutger F. H. Hofman, MAGPIE: MPI's Collective Communication Operations for Clustered Wide A rea Systems[C]. In Proceeding of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1999.

    [Tao 2008] Jie Tao, Marcel Kunze, and Wolfgang Karl. Evaluating the Cache Architecture of Multicore Processors [C]. Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing. 2008.

    [TOP500 2008] T0P500. http://www.top500.org/stats/list/31/procgen. 2008.

    [Torque 2008] TORQUE Resource Manager.http://www. clusterresources. com/pages/products/torque-resource-manager.php. 2008.

    [Valkealahti 1984] S.Valkealahti, R.M.Nieminen. Appl. Phys[J]. 35(51).1984.

    [Virtutech 2008] Virtutech AB. Simics Full System Simulator.http://www. simics. com/

    [Wang 2007] Wang Haixia, Wang Dongsheng, Li Peng, Wang Jinglie, Li Congming.Reducing network traffic of token protocol using sharing relation cache[J].Tsinghua Science and Technology. 12(6):691-699. 2007.

    [Wells 1974] O.C.Wells, A. Boyde, E. Lifshin, A. Rezanowich. Scanning electron microscopy[M]. McGraw-Hill. 1974.

    [Wong 2002] T.M.Wong, Gregory R.G., J. Wilkes. My cache or yours? Making storage more exclusive[C]. In Proc of the 2002 USENIX Annual Technical Conference. 2002.

    [Woo 1995] Woo. S.C, Ohara.M, Torrie. E, et.al. The SPLASH-2 Programs:Characterization and Methodological Considerations[C]. ISCA' 95. 1995.

    [Yan 1998] H.Yan, M.M. EL. Gomati. Scanning[J]. 20(465). 1998.
    [Yanof 1989] A.W.Yanof, Ed. Electron-Beam, X-Ray, and Ion-Beam Technology[M]. Bellingham. Washington, D.C. 1989.

    [Yuan 2005] Yuan Wei et al. Performance Analysis of NPB Benchmark on Domestic Tera-Scale Cluster Systems[J]. Journal of Computer Research and Development, 2005, Vol.42, No6, pp. 1079-1084.

    [Zhang 2007] Li Zhang, Chris Jesshope, 0n-Chip COMA Cache Coherence Protocol for Microgrids of Multithreaded Cores [C]. Euro-Par 2007 Workshops,38-48. 2007.

    [Zhao 2008] Li Zhao, R. Iyer, M.Upton, D.Newell. Towards Hybrid Last Level Caches for Chip-Multiprocessors[M]. ACM Sigarch Computer Architecture News. 2008.

    [Zhou 2006] Zhou Ming-De, 64-bit Microprocessor System Programming[M].Beijing: TsingHua Press, 2006.

    [Zworykin 1942] V. K. Zworykin, J.Hiller, R. L. Snyder. ASTM Bull[M]. 117:15.1942.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700