用户名: 密码: 验证码:
异构并行计算机容错技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
并行计算是实现超高性能计算的主要技术手段。当前,随着GPGPU性能的不断提高,利用CPU和GPU构建的异构并行系统已经成为高性能计算机领域的研究热点。然而随着并行计算系统规模的不断增长,高性能计算机面临严峻的挑战。由于异构并行系统更为复杂的体系结构以及其特有的性质,且商用GPGPU容错能力较弱,所以由CPU和GPU构建的大规模异构并行系统的可靠性问题更为尖锐,尚缺乏实用的容错手段。
     本文针对异构并行计算机的容错技术展开研究,以异构并行系统硬件故障在软件中的传播行为为理论基础,对应用级checkpointing技术的保存数据量优化问题进行研究;分析了异构并行系统多checkpoint的全局开销最优化问题,并提出了设置方案;同时,针对异构并行系统提出了一种新的面向GPU的多副本容错技术RB-TMR,并对其所具备的关键机制进行了详细的研究与设计实现。本文的主要贡献如下:
     1、提出了一种面向一般计算系统的计算可接受模型。建立程序的执行结果可接受以及可接受度的定义,并进一步定义程序多次执行的可接受和多次执行的可接受度,以此为基础得到可接受度的相关定理和推论。针对异构并行系统将可接受度的相关定理和推论进行了扩展,并建立异构并行系统的可接受模型,同时进一步案例分析两种常见的容错技术checkpoint/restart和TMR应用到异构并行系统上时,对可接受模型的影响,从而给出容错机制的指导意见和优化方法。
     2、基于过程间相关性理论,提出了由CPU和GPU构成的异构并行系统中硬件故障在软件中传播行为描述方法,我们称其为故障传播模型。同时,根据故障传播模型,设计了针对该系统的checkpointing机制,并针对影响checkpoint/restart开销的主要问题之一——checkpoint保存数据量进行了优化。实验证明该优化方法可以有效的减小开销,提高容错性能。
     3、深入研究了面向异构并行系统的多个checkpoint的全局开销最小化问题,提出了面向异构并行系统的同步及异步两种机制的多checkpoint全局开销最小化的优化设置方法。首先提出了两个针对优化设置多个checkpoint位置的基本问题。然后通过对异构并行系统体系结构和程序特性的分析,提出了基于两种机制的异构并行系统的多checkpoint设置方法:同步及异步机制的checkpoint设置方法。同时,根据checkpoint优化设置的两个具体问题分别对这两种机制进行优化设置分析和数学建模,并给出了相应的求解算法。
     4、提出了一种回滚机制与TMR技术相结合的容错技术RB-TMR。这一技术可以有效应对fail-stop故障与瞬时故障两种类型的故障进行容错,我们给出了这一技术的实现方法,并针对异构并行系统体系结构及程序模型的特征对其中关键机制的设计进行了具体分析和讨论。同时,设计并实现了一个面向RB-TMR机制的源到源编译辅助工具,可以辅助用户面向CUDA程序完成RB-TMR机制的实现,减轻了用户实现RB-TMR机制的负担。实验结果表明RB-TMR技术能够实现较高的错误检出和纠正率,有效减小可能需要回滚恢复的概率,根据综合评定,其相对于传统checkpointing及TMR技术有更好的容错性能。
Parallel computing is a major ultra-high-performance computing technology. As the performance of GPGPU (General Purpose computation on Graphic Processing Units) keeps improving, heterogeneous parallel systems built based on CPU and GPU become a hot research field of high-performance computers. However, with the increase of the parallel computing system size, high-performance computers encounter serious challenges. Due to more complicated architecture and unique features of heterogeneous parallel systems and weak fault-tolerance of GPGPU, large scale heterogeneous parallel systems built based on CPU and GPU undergoes an acute reliability issue, which is lack of practical means.
     This paper studies the fault-tolerance technique of heterogeneous parallel systems. Based on the propagation behaviors of hardware error that propagates in software in heterogeneous parallel systems, this paper optimizes checkpoint size of application-level checkpointing, optimizes the global overhead of multiple checkpoints in heterogeneous parallel systems and proposes configuration solution, and explores a GPU-oriented multi-copies fault tolerance technique (RB-TMR). The main contributions of the paper are summarized as follows:
     1. A general computer oriented acceptance model is proposed. The acceptance and its degree of program results and multiple times of program execution are first defined. Based on them, theorems and corollaries regarding acceptance degree are obtained. This paper extends the theorems and corollaries in heterogeneous parallel systems and establishes the acceptance model of heterogeneous parallel systems. Cases are used to analyze the effect of two common fault-tolerance techniques (checkpoint/restart and TMR) on the acceptance model when the two techniques are applied in heterogeneous parallel systems. Therefore, constructive suggestions and optimization methods for fault-tolerance mechanism are obtained.
     2. Based on the theory about inter-procedural dependence, a method named error propagation model is proposed. It describes the prorogation behavior of hardware error in software in CPU-GPU heterogeneous parallel systems. Using this model, the system’s checkpointing mechanism is designed and the checkpoint size is optimized. Experimental results show that this method can effectively reduce the overhead and improve the fault tolerance performance.
     3. In order to minimize the global overhead of multiple checkpoints, this paper proposes a placement optimization method for both synchronization and asynchronization mechanisms in heterogeneous parallel systems. First of all, two essential issues of placement optimization of multiple checkpoint locations are proposed. Secondly, based on the analysis of architecture and program features, two methods of checkpoint placement in heterogeneous parallel systems are proposed: synchronous checkpoint placement and asynchronous checkpoint placement). Further, for the two issues, the two methods are analyzed and modeled and their solution algorithms are given.
     4. A fault-tolerance technique (RB-TMR) combining rollback mechanism and TMR technique is proposed. It can effectively offer fault-tolerance for fail-stop fault and transient fault. We implement this technique according to architecture and program features of heterogeneous parallel systems. Besides, a source-to-source compile assistant tool is designed for the RB-TMR technique. The tool can assist users in implementing the RB-TMR technique in CUDA programs, alleviating their burdens. Experimental results exhibit that the RB-TMR technique can achieve high error checking and correction rate as well as decreases the probability of rollback. It is concluded that the RB-TMR technique demonstrates better fault-tolerance performance than the conventional checkpointing and TMR technique.
引文
[1] Luebke, D., Harris, M., Krüger, J., Purcell, T., Govindaraju, N., Buck, I., Woolley, C., and Lefohn, A. 2004.GPGPU: general purpose computation on graphics hardware[C]. In ACM SIGGRAPH 2004 Course Notes (LosAngeles, CA, August 08 - 12, 2004).SIGGRAPH’04. ACM, New York, NY, 33.
    [2] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU cluster for high performance computing [C]. In SC’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. Washington, DC, USA:IEEE Computer Society, 2004:47.
    [3] W.J. Dally, P. Hanrahan, M. Erez, T.J. Knight, et'al, Merrimac: Supercomputing with Streams [C]. In SC'03: Proceedings of Supercomputing Conference 2003, 2003:35-42.5270.
    [4] TOP500 Supercomputing Site [EB/OL]. http://www.top500.org.
    [5] Google技术概述[EB/OL]. http://www.google.com/corporate/tech.html.
    [6] Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. Web Search for a Planet: The Google Cluster Architecture [J]. IEEE Micro, Vol. 23, No.2, pp.22-28, March 2003.
    [7]余一娇.Google Linux Cluster的系统结构分析[R].华中师范大学Technique Report, April 2005.
    [8] Williams S, Shalf J, Oliker L, et al. The potential of the cell processor for scientific computing [C]. In Proceedings of the 3rd conference on Computing frontiers. New York, NY, USA, 2006: 9–20.
    [9] ClearSpeed Processor [EB/OL]. http://www.clearspeed.com/. 2011.
    [10] Owens J D, Luebke D, Govindaraju N, et al. A Survey of General-Purpose Computation on Graphics Hardware [J]. Computer Graphics Forum. 2007, 26 (1): 80–113.
    [11] Jan A. Zverina. CenterGPUs versus CPUs: Apples and Oranges? [EB/OL]. https://www.teragrid.org/web/news/gpuvscpu, San Diego Supercomputer. March 22, 2011.
    [12] Intel@32nm Logic Technology [EB/OL]. http://www.intel.com/technology/ architecture-silicon/32nm/index.htm.
    [13] White paper: Introduction to Intel’s 32nm Process Technology [EB/OL]. http:// download.intel.com/pressroom/kits/32nm/westmere/Intel 32nm overview.pdf.
    [14] AMD Phenom Processors [EB/OL]. http://www.amd.com/us/products/desktop/ processors/phenom/Pages/AMD-phenom-processor-X4-X3-at-home.aspx.
    [15]富弘毅.OpenMP并行程序容错技术研究[D].湖南:国防科学技术大学研究生院, 2010:9.
    [16] H.H.K. Tang. Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicist's perspective [J]. IBM Journal of Research and Development. 1996, 40(1):91-108.
    [17] C. Hsu, W. Feng. A Power-aware Run-Time System for High-Performance Computing [C]. Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing. SC, 2005:1.
    [18]易会战.低功耗技术研究——体系结构和编译优化[D].湖南:国防科学技术大学研究生院,2006.
    [19] Charng-Da LU. Scalable Diskless Checkpointing for Large Parallel Systems [D]. University of Illinois at Urbana-Champaign, 2005.
    [20] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a scalable fault tolerant mpi for volatile nodes[C]. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp.1-18, Los Alamitos, CA, USA, 2002.
    [21] C. Engelmann and A. Geist. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform [C]. In CLADE 2003: 1st International Workshop on Challenges of Large Applications in Distributed Environments, pp.47-52, Seattle, WA, USA, June 2003.
    [22] B. Schroeder, G. Gibson. Understanding Failures in Petascale Computers [J]. SciDAC 2007. Journal of Physics: Conference Series 78. 2007.
    [23] Garth Gibson, Bianca Schroeder, Joan Digney. Failure Tolerance in Petascale Computers [J]. CTWatch Quarterly, vol. 3 no. 4. Volume on Software Enabling Technologies for Petascale Science. November 2007.
    [24] Conway S. Multicore Processing: Breaking through the Programming Wall [EB/OL]. http://www.scientificcomputing.com/.
    [25] Turek D. The Strategic Future Based on High Performance Computing - the Push to Exascale [R]. 2009.
    [26] Geist A. Paving the Roadmap to Exascale [J]. SciDACReview. 2010, 16: 52–59.
    [27] Green500 SuperComputer List [EB/OL]. http://www.green500.org/lists/ 2011/06/top/list.php/. June 2011.
    [28] Wikipedia. Random-access memory. 2009.9.14. http://en.wikipedia.org/ wiki/Random-access_memory#Memory_wall.
    [29] Hennessy J L, Patterson D A. Computer architecture (4th ed.): a quantitative approach [M]. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 2007.
    [30] Yi Y, Ping X, Jingfei K, et al. A GPGPU compiler for memory optimization and parallelism management [C]. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation. New York, NY, USA, 2010: 86–97.
    [31] Byunghyun J, Dana S, Perhaad M, et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures [J]. IEEE Transactions on Parallel and Distributed Systems. 2011, 22: 105–118.
    [32] Yelick K. Ten ways to waste a parallel computer [C]. In ISCA’09: Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA, 2009: 1.
    [33] Avizienis A. Design of Fault-Tolerant Computers[C]. In Proceedings of AFIPS Fall Joint Computer Conference, Vol. 31, Washington, D. C., USA: Thompson Books, pp.733-743,1967
    [34]胡谋.计算机容错技术[M].北京:中国铁道出版社, 1995.
    [35] Elena Dubrova. Fault Tolerant Design: An Introduction [EB/OL]. http://web.it.kth.se/~dubrova/draft.ps.gz. Department of Microelectronics and Information Technology Royal Institute of Technology Stockholm, Sweden. Kluwer Academic Publishers, 2006.
    [36] NASA Homepage. http://www.nasa.gov/mission_pages/swift/main/index.html
    [37] E. Dubrova. Fault Tolerant Design: An Introduction [M]. Kluwer Academic Publishers, 2006 (Draft).
    [38] D.P. Siewiorek, R.S. Swarz. Reliable Computer Systems: Design and Evaluation [M]. Digital Press, 1992.
    [39] R.D. Schlichting, F.B. Schneider. Fail-Stop Processors: An Approach To Designing Fault-Tolerant Computing Systems [J]. ACM Transactions on Computer System, 1983, (3):222-238.
    [40] M. Treaster. A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems [R]. Tech. Rep. cs.DC/0501002, ACM Computing Research Repository, 2005.
    [41] T. Nanya, H.A. Goosen. The Byzantine Hardware Fault Model [J]. Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1989, (11).
    [42] R. Bazzi. Synchronous Byzantine Quorum Systems [C]. Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing. ACM Press, 1997.
    [43] G. Bracha, S. Toueg. Asynchronous Consensus And Broadcast Protocols [J]. Journal of the Association for Computing Machinery, 1995.
    [44] M. Castro, B. Liskov. Practical Byzantine Fault Tolerance and ProactiveRecovery [J]. ACM Transactions on Computer Systems, 2002, (4):398-461.
    [45] D. Malkhi, M. Reiter. Byzantine Quorum Systems [J]. Distributed Computing, 1998, 11:203-213.
    [46] M. Reiter, D. Malkhi, A. Wool. The Load And Availability of Byzantine Quorum Systems [J]. SIAM Journal on Computing, 2000.
    [47] M. Castro, R. Rodrigues, B. Liskov. BASE: Using Abstraction To Improve Fault Tolerance [C]. Proceedings of the 18th ACM Symposium on Operating Systems Principles. ACM Press, 2001.
    [48] F. Schneider. Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial [J]. Computing Surveys, 1990, (3):299-319.
    [49] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults [J]. J. ACM, 1980, 27(2):228–234.
    [50] L. Lamport, R. Shostak, and M. Pease. The Byzantine general problem [J]. ACM Trans. Program. Lang. Syst., 1982, 4(3):382-401.
    [51] R. Canetti and T. Rabin. Optimal asynchronous Byzantine agreement [R]. Technique Report#92-15, Computer Science Department, Hebrew University, 1992.
    [52] M. Reiter. A secure group membership protocol [J]. IEEE Trans. Softw. Eng., 1996, 22(1):31-42.
    [53] D. Malkhi and M. Reiter. Unreliable intrusion detection in distributed computations [C]. In Proceedings of the Ninth Computer Security Foundations Workshop. Ireland: IEEE Computer Society Press, 1996:9-17.
    [54] J. Garay and Y. Moses. Fully polynomial Byzantine agreement for n>3t processors in t+1 rounds [J]. SIAM J. Comput., 1998, 27(1):247-290.
    [55] K. Kihlstrom, L. Moser, and P. Melliar-Smith. The SecureRing protocols for securing group communication [C]. In Proceedings of the Hawaii International Conference on System Sciences. Hawaii, 1998.
    [56] L. Lamport. Using time instead of timeout for fault-tolerant distributed systems [J]. ACM Trans. Program. Lang. and Syst., 1984, 6(2):254-280.
    [57] Leslie Lamport, Robert Shostak and Marshall Pease. The Byzantine Generals Problem [J]. ACM Transactions on Programming Languages and Systems, 1982, 4(3).
    [58] J. Yin, J. P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Byzantine fault-tolerant confidentiality [C]. In Proceedings of the International Workshop on Future Directions in Distributed Computing. 2002:12-16.
    [59] J. von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components [R]. In Automata Studies (C. E. Shannon and J. McCarthy, eds.), Princeton Univ. Press, Princeton, N.J., 1954:43-98.
    [60] Chowdhury, A. R. and P. Banerjee. Algorithm-Based Fault Location andRecovery for Matrix Computations [C]. In FTCS-24, Austin, Texas, 1994: 38-47.
    [61] Huang, K.-H. and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations [J]. In IEEE Transactions on Computers, vol. 33 no. 6, 1984: 518-528.
    [62] Rela, M., H. Madeira, and J. G. Silva. Experimental Evaluation of the Fail-Silent Behavior of Programs with Consistency Checks [C]. In FTCS-26, Sendai, Japan, 1996:394-403.
    [63] Silva, J. G., J. Carreira, H. Madeira, D. Costa, and F. Moreira. Experimental Assessment of Parallel Systems [C]. In FTCS-26, Sendai, Japan, 1996: 415-424.
    [64] Anderson, E., Z. Bai, C. Bischof, and e. al. LAPACK Users’Guide [M]. SIAM, 1995.
    [65] Jou, J.-Y. and J. A. Abraham. Fault-Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures [J]. In Proceedings of the IEEE, vol. 74, no. 5, pp. 732-741, 1986.
    [66] Banerjee, P., et al. Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor [J]. IEEE Transactions on Computers, 39, 9, pp. 1132-1144, Sep. 1990.
    [67] Chowdhury, A. R., N. Bellas, and P. Banerjee. Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations [J]. IEEE Transactions on Computers, vol. 45, no. 4, pp. 394-407, April 1996.
    [68] Silva, J. G., P. Prata, M. Rela, and H. Madeira. Practical Issues in the Use of ABFT and a New Failure Model[C]. In FTCS-28, Munich, Germany, 1998: 26-35.
    [69] Chowdhury, A. R. and P. Banerjee. Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques[C]. In FTCS-23, 1993: 290-298.
    [70] Zizhong Chen and Jack Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources [C]. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 25-29, 2006.
    [71] Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia, Zhiyuan Wang, and Guang. Suo.The fault tolerant parallel algorithm: the parallel recomputing based failure recovery [C]. The Sixteenth International Conference on Parallel Architectures and Compilation Techniques (PACT), Brasov, Romania, 2007.
    [72] Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu. FTPA: Supporting Fault Tolerant Parallel Computing through Parallel Recomputing [J]. Submitted to IEEE Transactions on Parallel and Distributed Systems. 2008.
    [73] Panfeng Wang, Yunfei Du, Hongyi Fu, Haifang Zhou, Xuejun Yang, and Wenjing Yang. A Novel Fault-Tolerant Parallel Algorithm [C]. 7th Advanced Parallel Processing Technologies, Guangzhou, China, p.18-27, 2007.
    [74] P. D. Hough, M. E. Goldsby, E. J. Walsh. Algorithm-dependent Fault Tolerance for Distributed Computing [R]. Technique Report. SAND2000-8219. February 2000.
    [75] K. P. Birman. The Process Group Approach to Reliable Distributed Computing [J]. Communications of the ACM, Vol. 36, no. 12, pp. 37-53, December 1993.
    [76] J. E. Dennis, Jr. and V. Torczon. Direct Search Methods on Parallel Machines [J]. SIAM Journal on Optimization, Volume 1, Number 4, pp. 448-474, 1991
    [77] P. D. Hough, T. G. Kolda, and V. J. Torczon. Asynchronous Parallel Pattern Search for Nonlinear Optimization [J]. SIAM Journal on Scientific Computing, Vol. 23(1): 134-156, 2001.
    [78] M. M. Johnson, A. S. Yoshimura, M. E. Goldsby, C. L. Janssen, and D. M. Nichol. Infrastructure for Distributed Enterprise Simulation [R]. Technical Report SAND98-8224, Sandia National Laboratories, Livermore, California, January 1998.
    [79] L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, and C. A. Lingley-Papadopoulos. Totem: A Fault-Tolerant Multicast Group Communication System [J]. Communications of the ACM, Volume 39, Number 4, April 1996.
    [80] M. W. Burns, A. D. George, B. A. Wallace. Simulative Performance Analysis of Gossip Failure Detection for Scalable Distributed Systems [J]. Cluster Computing, Volume 2, Number 3, 1999: 207-217.
    [81] Sridharan Ranganathan and Alan D. George and Robert W. Todd and Matthew C. Chidester. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters [J]. Cluster Computing, Vol. 4(3):197-209, 2001.
    [82] R. van Renesse, Y. Minsky, M. Hayden. A Gossip-Style Failure Detection Service [J]. In Proceedings of Middleware’98, 1998.
    [83] A. Geist and C. Engelmann. Development of naturally fault tolerant algorithms for computing on 100,000 processors [EB/OL]. http://www.csm.ornl.gov/ ~geist/Lyon2002-geist.pdf .2002.
    [84] G. Bosilca, Z. Chen, J. Langou, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment [R]. Technical Report UT-CS-04-538, University of Tennessee, Knoxville, Tennessee, USA, 2004.
    [85] E. N. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems [J]. ACM Computing Surveys., 34(3):375-408, 2002.
    [86] E. N. Elnozahy and James S. Plank. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery [J]. IEEE Transactions on Dependable and Secure Computing, Vol. 1, no. 2, April-June 2004.
    [87] N. Lynch. Distributed Algorithms [M]. Morgan Kaufmann, San Francisco, California, first edition, 1996.
    [88] M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs [C]. Supercomputing 2004, Pittsburgh, PA, USA, November 2004.
    [89] LAMPSON, B. W. AND STURGIS, H. E. Crash recovery in a distributed data storage system [R]. Technical Report, Xerox Palo Alto Research Center, 1979.
    [90] STROM, R. AND YEMINI, S. Optimistic recovery in distributed systems [J]. ACM Transactions on Computer Systems. Vol. 3, no. 3, 1985: 204-226.
    [91]邹逢兴,张湘平.计算机应用系统的故障诊断与可靠性技术基础[M].北京:高等教育出版社,1999.
    [92] R. Koo, S. Toueg. Checkpointing and rollback recovery for distributed systems [J]. IEEE Transactions. Software Eng. SE-13: 23-31, 1987.
    [93] S. Venkatesan. Message optimal incremental snapshots [C]. In Proc. IEEE 9th International Conference on Distributed Computing Systems (ICDCS 1989), 1989: 53-60.
    [94] Albert Y. H. Zomaya. Parallel and distributed computing handbook [J]. Mcgraw-Hill Computer Engineering Series, 1996.
    [95] K. Li, J. F. Naughton, J. S. Plank. Checkpointing multicomputer applications. In Proceedings [C]. IEEE Conference on Reliable Distributed Systems, 1991: 2-11.
    [96] J. L. Kim, T. Park. An efficient protocol for checkpointing recovery in distributed systems [J]. IEEE Transactions Parallel Distributed Systems, 1993(4):955-960.
    [97] B. Bhargava and S. R. Lian. Independent checkpointing and concurrent rollback for recovery—an optimistic approach [C]. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 1988:3–12.
    [98] B. Bhargava, S. R. Lian, P. J. Leu. Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms [C]. In Proceedings of the Sixth International Conference on Data Engineering, 1990:182–189.
    [99] Y. M. Wang. Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems [D]. University of Illinois, Department of Computer Science, 1993
    [100] Y. M. Wang, P. Y. Chung, W. K. Fuchs. Tight upper bound on usefuldistributed system checkpoints [R]. Technical Report, University of Illinois, 1995
    [101] E. N. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems [J]. ACM Computing Surveys., 34(3):375–408, 2002
    [102] G. Bronevetsky, M. Daniel, K. Pingali, and S. Paul. Automated application-level checkpointing of mpi programs [C]. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2003), San Diego, CA., June 2003:84-94.
    [103] John Paul Walters and Vipin Chaudhary. Application-Level Checkpointing Techniques for Parallel Programs [R]. Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006:221-234.
    [104] A. Beguelin, E. Seligman, P. Stephan. Application level fault tolerance in heterogeneous networks of workstations [J]. Journal of Parallel and Distributed Computing., vol.43, no.2, 1997:147-155.
    [105] S. Vadhiyar and J. Dongarra. Srs - a framework for developing malleable and migratable parallel software [J]. Parallel Processing Letters, 13(2):291–312, June 2003.
    [106] J. S. Plank. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In Proceedings of the Symposium on Reliable Distributed Systems, pages 76–85, 1996
    [107] J. S. Plank, Y. B. Kim, and Jack J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing, 43(2):125–138, June 1997
    [108] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., vol.9, no.10, pp.972-986, 1998.
    [109] C. J. Li, W. K. Fuchs. Catch: Compiler-Assisted Techniques For Checkpointing [C]. Proceedings of the 20th International Symposium Fault-Tolerant Computing. 1990:74-81.
    [110] R. Fernandes, K. Pingali D. Marques, G. Bronevetsky, P. Stodghil. Optimizing Checkpoint Sizes in the C3 System [C]. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium -Workshop 10. 2005, 226.
    [111] K. Li, M. Beck J. S. Plank, Y. Chen, G. Kingsley. Memory Exclusion: Optimizing the Performance of Checkpointing Systems [J]. Software Practice andExperience. 1999, (2):125-142.
    [112] P. Wang, J. Jia, H. Fu, Y. Du, X. Yang. GiFT: Automating FTPA Implementation for MPI Programs [C]. Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems. 2008, 91-98.
    [113] H. Fu, Y. Du, Z. Wang, X. Yang, P. Wang, J. Jia. Compiler-Assisted Application-Level Checkpointing for MPI Programs [C]. Proceedings of the 28th International Conference on Distributed Computing Systems. 2008, 251-259.
    [114] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca and J. Dongarra. Fault Tolerant High Performance Computing by a Coding Approach [C]. In Proc. the Seventeenth ACM SIGPLAN symposium on principles and practice of parallel programming, pp.213-223, June 2005.
    [115] G. Bosilca, Z. Chen, J. Langou, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment [R]. Technical Report UT-CS-04-538, University of Tennessee, Knoxville, Tennessee, USA, 2004.
    [116] Tanenbaum, A. Woodhull. Operating Systems: Design and Implementation [M]. New Jersey: Prentice-Hall, 1997, 2 edn.
    [117] Y. Li, Z. Lan. A Fast Restart Mechanism for Checkpoint/Recovery Protocols in Networked Environments [C]. Proceedings of IEEE International Conference on Dependable Systems and Networks. 2008:217-226.
    [118] K. Li, R. Strom A. Goldberg, A. Gopal, D. F. Bacon. Transparent Recovery of Mach Applications [C]. Proceedings of Usenix Mach Workshop. 1990:169-184.
    [119] J. F. Naughton, K. Li, J. S. Plank. Low-Latency, Concurrent Checkpointing for Parallel Programs [J]. IEEE Transactions on Parallel and Distributed System. 1994,5(8):874-879.
    [120] K. Li, R. Strom A. Goldberg, A. Gopal, D. F. Bacon. Real-time, Concurrent Checkpointing For Parallel Programs [C]. Proceedings of 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1990:79-88.
    [121] J. S. Plank. Improving the Performance of Coordinated Checkpointers on Networks of Workstations Using RAID Techniques [C]. Proceedings of the 15th Symposium on Reliable Distributed Systems. Washington, DC, USA: IEEE Computer Society, 1996:76.
    [122] R. A. Old.eld, M. R. Varela, P. J. Teller, S. Arunagiri, S. Seelam, R. Riesen. Impact of Checkpoint Latency on the Optimal Checkpoint Interval and Execution Time [R]. Tech. Rep. 07-55, University of Texas at El Paso, 2007.
    [123] E. Gabriel, J. Langou, T. Angskun, G. Bosilca, Z. Chen, G. E. Fagg, J. J. Dongarra. Fault Tolerant High Performance Computing By A Coding Approach [C]. Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, NY, USA: ACM, 2005:213-223.
    [124] J. S. Plank, K. Li. Faster Checkpointing with N + 1 Parity [C]. Proceedings of the 24th International Symposium on Fault-Tolerant Computing.1994:288-297.
    [125] J. S. Plank, J. Xu, R. H. B. Netzer. Compressed Di.erences: An Algorithm for Fast Incremental Checkpointing [R]. Tech. Rep. CS-95-302, University of Tennessee, August 1995.
    [126] S. Hong, H. Nam, J. Kim, S. Lee. Probabilistic Checkpointing [C]. Proceedings of International Symposium on Fault-Tolerant Computing. 1997:153-160.
    [127] S. Hong, H. Nam, J. Kim, S. Lee. Reliable Probabilistic Checkpointing [C]. Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing. Washington, DC, USA: IEEE Computer Society, 1999:153.
    [128] S. I. Feldman, C. B. Brown. IGOR: A System For Program Debugging Via Reversible Execution [J]. ACM SIGPLAN Notices. 1989, 24(1):112-123.
    [129] P. R. Wilson, T. G. Moher. Demonic Memory For Process Histories [J]. ACM SIGPLAN Notices. 1989, 24(7):330-343.
    [130] T. A. Welch. A Technique for High-Performance Data Compression [J]. Computer. 1984, 17(6):8-19.
    [131] B. Lampson, M. Burrows, C. Jerian, T. Mann. On-line Data Compression in A Log-Structured File System [C]. Proceedings of the 5th international Conference on Architectural Support For Programming Languages And Operating Systems. New York, NY, USA: ACM, 1992:2-9.
    [132] R. H. B. Netzer, M. H. Weaver. Optimal Tracing And Incremental Reexecution For Debugging Long-Running Programs [J]. ACM SIGPLAN Notice. 1994, 29(6):313-325.
    [133] J. S. Plank, M. Beck, G. Kingsley. Compiler-Assisted Checkpointing [R]. Tech. Rep. UT-CS-94-269, Knoxville, TN, USA, 1994.
    [134] J. Sheaffer, D. Luebke, and K. Skadron. A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. In Proceedings of Graphics Hardware 2007.
    [135] Nishant George, John Lach, and Sudhanva Gurumurthi. Towards Transient Fault Tolerance for Heterogeneous Computing Platforms [C]. The 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. In DSN 2008.
    [136] Dimitrov, M., Mantor, M., and Zhou, H. 2009. Understanding software approaches for GPGPU reliability [C]. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. Washington, D.C., March 08 - 08, GPGPU-2, vol. 383. ACM, New York, NY, 2009:94-104.
    [137] Anthony E. Gregerson and Ameya V. Abhyankar. Performance Cost Analysis of Software-Implemented Hardware Fault Tolerance Methods in General-Purpose GPU Computing [EB/OL]. http://homepages.cae.wisc.edu/ ~ece753/ papers/Paper_4.pdf, 2010.
    [138] Felix Loh, Matt Sinclair. G-CP Providing Fault Tolerance on the GPU through Software Checkpointing [R]. ECE 753 Project Progress Report. The University of Wisconsin-Madison, spring 2010.
    [139] Supada. L, Nichamon. N, Chokchai. L, Apurba. D, Clayton. C, Kasidit. C, Amir. F. Lightweight Checkpoint Mechanism and Modeling in GPGPU Environment [C]. 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), 2010.
    [140] G. Bronevetsky, M. Schulz, P. Szwed, S. Zaman, and K. Pingali. Application-level checkpointing for shared memory programs [C]. In Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2004.
    [141] Sung-Eun Choi and Steven J. Deitz. Compiler support for automatic checkpointing [C]. In HPCS’02: Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, Washington, DC, USA, IEEE Computer Society, 2002:213.
    [142] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson and J. D. Owens, Programmable Stream Processors [J]. IEEE Computer, Vol. 36, Iss. 8, 2003:54-62.
    [143] Advanced Micro Devices, Inc. AMD Brook+ [EB/OL]. http://ati.amd.com/ technology/streamcomputing/AMDBrookplus.pdf.
    [144] NVIDIA CUDA Compute Unified Device Architecture Programming Guide [M]. Version 2.1 Beta , 2008.
    [145] Open Computing Language [EB/OL]. http://www.khronos.org/.
    [146] CUDA Technical Training Volume I/II [M]. Prepared and Provided by NVIDIA, 2008.
    [147] NVIDIA CUDA Compute Unified Device Architecture Programming Guide [M]. Version 2.1 Beta, 2008.
    [148] Compute Visual Profiler 4.0 for NVIDIA CUDA User Guide [M]. DU-05162-001_v04, May 2011.
    [149] J.W. Young, A first order approximation to the optimum checkpoint interval, Commun [J]. ACM 17, 1974:530- 531.
    [150] John Daly. A higher order estimate of the optimum checkpoint interval for restart dumps [J]. Future Generation Computer Systems, 2006(22):303-312.
    [151] John Daly. A model for predicting the optimum checkpoint interval for restart dumps [C]. In Proceedings of the ICCS2003, Lecture Notes in Computer Science 2660, vol. 4, August 2003: 3-12.
    [152] John Daly. A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps [C]. In Proceedings of the 26th International Conference on Software Engineering, Edinburgh,Scotland, UK, May 2004:70-74.
    [153]刘品.可靠性工程基础[M].北京:中国计量出版社, 2002年.
    [154] Haque IS and Pande VS. Hard Data on Soft Errors - A Large-Scale Assessment of Real-World Error Rates in GPGPU [C]. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing (CCGrid 2010), May 2010: 691-696,17- 20.
    [155] Maruyama, N.; Nukada, A.; Matsuoka, S. A high-performance fault-tolerant software framework for memory on commodity GPUs [C]. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, vol., no., April 2010:1-12,19-23.
    [156] Chong Ding, Christer Karlsson, Hui Liu, Teresa Davies, and Zizhong Chen. Matrix Multiplication on GPUs with On-Line Fault Tolerance [C]. Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2011), Busan, Korea, May 2011:26-28,.
    [157] G. Shi, J. Enos, M. Showerman, V. Kindratenko, On Testing GPU Memory for Hard and Soft Errors [C]. The first annual 2009 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'09). 2009.
    [158] Bailey, T. Harris,W. Saphir, R. Van der Wijngaart,A. Woo,and M. Yarrow. The NAS parallel benchmarks 2.0 [R]. Technical Report NAS-95-020, NASA Ames Research Center, 1995.
    [159] J. Levine, T. Mason,and D. Brown. Lex&Yacc(Second Edition) [M].O Reilly, 1992.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700