片上多核处理器缓存子系统优化的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

片上多核处理器缓存子系统优化的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Optimizations of Memory Subsystem for Chip Multiprocessor Systems
作者：李建华
论文级别：博士
学科专业名称：计算机软件与理论
英文关键词：Chip Multiprocess ; On-Chip Networks ; Cache Coherence ; STT-RAM ; Multicast Routing ; Cache Memory ; Network Partition ; Hybrid Cache
学位年度：2013
导师：许胤龙 ; 薛春
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2013-05-01

摘要

当前的片上多核处理器需要大容量的缓存系统来降低快速的处理器和慢速的片下主存之间的性能差距。本文认为可以利用和挖掘片上多核处理器的特性来优化其缓存子系统的性能和功耗。本文的工作研究了几个优化片上多核处理器缓存子系统性能的机制。具体来说,本文的研究主题包含三个方面：1)研究和设计高效的多播路由算法来提升片上网络的性能；2)利用当前的新型的非易失性存储器来为片上多核处理器设计低功耗的缓存系统；3)挖掘利用线程的进度信息来设计更加高效的缓存一致性协议。
     针对第一个研究主题,我们提出了一种高效的片上网络多播路由机制。对于集成越来越多核的片上多核处理器来说,片上网络为其提供了一个高效的、可扩展的通信基础架构。对于多核架构下的片上网络来说,一对多的通信模式是很普遍的。没有有效的多播路由机制的支持,传统的基于单播的片上网络在处理这些多播通信时是很低效的。本文提出了一个基于网络划分的多播路由机制,简称DPM。DPM可以高效地减低片上网络中网络包的平均传输延迟以及降低片上网络的功耗。具体来说,DPM可以根据当前网络中负载均衡级别以及多播通信的链路共享特征来动态地进行路由选择。
     本文的第二个研究课题是利用一种新型的非易失性存储器(自旋转移矩随机访问存储器,STT-RAM)来为片上多核处理器设计低功耗的缓存。STT-RAM具有快速的访问速度、高存储密度以及可以忽略不计的泄露功率。然而,大规模地应用STT-RAM作为多核处理器的缓存受到STT-RAM的较长的写延迟以及较高的写功耗的约束。最近研究表明过降低STT-RAM的存储单元(磁性隧道结MTJ)的数据保持时间可以有效地提升其写性能。但是保持时间降低的STT-RAM是易失性的,需要通过周期性地刷新其存储单元来避免数据丢失。当这样的STT-RAM用于多核的最后一级缓存(LLC)时,频繁的刷新操作在加剧能量消耗的同时也会给系统的性能带来负面影响。文本提出了一种高效的刷新方案(简称CCear)可以最小化这类STT-RAM上的刷新操作。CCear主要通过与缓存一致性协议以及缓存管理算法进行交互来消除不必要的刷新操作。
     最后我们提出了一个高效的一致性协议的调整机制来优化运行在片上多核处理器上的并行程序的性能。片上多核处理器的一个主要目标就是通过挖掘线程级别的并行性来继续提升应用程序的性能。但是对于运行在这类系统上的多线程程序来说,由于不均匀的任务分配以及共享资源的冲突,不同的线程通常呈现出不同的执行进度。这种进度的不均匀性是多线程程序性能的最大的瓶颈之一。由于多线程程序内在的同步机制,如内存屏障和锁,运行具有较快进度的线程的核必须停下来等待进度较慢的核。这样的空等不仅会降低系统性能,也会导致功耗的浪费。本文提出了一种线程进度感知的一致性调整机制,简称TEACA。TEACA利用线程的进度信息来动态地调整每个线程的一致性策略,目的是提升片上网络带宽资源的使用效率以及降低功耗。具体来说,TEACA动态地将线程划分为二类：领导者线程与落后者线程。随后,TEACA会根据线程来类别信息为其一致性请求提供特定的一致性策略。
Modern chip multiprocessors (CMPs) employ large cache memories to reduce the performance gap between processors and off-chip memory. This thesis states that the particular characteristics of CMP system can be exploited to improve energy and performance in the memory hierarchy. The research presented in this thesis investigates several mechanisms to optimize the performance of CMP memory system. Specifically, we target three problems as our research topic:1) design efficient multicasting algorithm to improve the performance of on-chip network,2) exploit emerging non-volatile memories to design low power cache memory for CMP systems,3) exploit thread progress information to design high performance cache coherence protocols.
     For the first research topic, we propose an efficient multicast routing mech-anism for on-chip network. For CMP system with increasing core count, on-chip network provides an efficient and scalable interconnection paradigm, wherein one-to-many (multicast) communication is universal for such platforms. Without ef-ficient multicasting support, traditional unicasting on-chip networks will be low efficiency in tackling such multicast communication. In this thesis, we propose dual partitioning multicasting (DPM) which significantly reduces packet laten-cy and on-chip network power dissipation. Specifically, DPM scheme adaptively makes routing decision based on the network load-balance level as well as the link sharing patterns characterized by the distribution of the multicasting destinations.
     For our second research topic, we propose to exploit emerging non-volatile memory, such as spin-torque transfer RAM (STT-RAM), to design low power cache memories. STT-RAM has fast read access, high storage density and negli-gible leakage power. However, the wide adoption of STT-RAM as cache memories is impeded by its long write latency and high write power. The write performance of STT-RAM can be improved through relaxing the retention time of its cell, magnetic tunnel junction (MTJ). The resultant volatile STT-RAM needs to be periodically refreshed to prevent data loss. When applied as the large last-level cache in CMP systems, the frequent refresh operations could dissipate significant extra energy. In addition, the refreshes could severely conflict with the normal read/write operations to degrade the overall system performance. In this thesis, we propose cache coherence enabled adaptive refresh (CCear) to minimize the number of refresh operations for volatile STT-RAM. CCear can effectively mini-mize the number of refresh operations on volatile STT-RAM through interacting with cache coherence protocols and cache management policy.
     Finally, we propose an efficient coherence adaption mechanism to improve the performance of cache coherence protocol in CMP systems. One primary ob-jective of CMP system is to boost application execution by exploiting thread-level parallelism. In such systems, threads typically exhibit unbalanced progress stem-ming from unequal cache misses or task assignment. Load imbalance is one of the biggest roadblocks for parallel application performance. Because of the inherent synchronization primitives, such as barriers and locks, cores running fast thread have to waste pervious cycles waiting for slow cores. In this thesis, we propose thread progress aware coherence adaption (TEACA) which utilizes the thread progress information as the hints to adapt hybrid coherence protocols. Specifical-1y, TEACA fuses the memory system statistics to estimate the progress of threads. Based on the estimated thread progress information, TEACA dynamically catego-rizes threads into leader threads and laggard threads. The thread categorization decisions are then leveraged for efficient coherence adaption in hybrid coherence protocols.

引文

[1]Agarwal V, Hrishikesh M S, Keckler S W, et al. Clock rate versus IPC:the end of the road for conventional microarchitectures. Proceedings of Proceedings of the 27th annual international symposium on Computer architecture,2000.248-259.
    [2]Moore G. Cramming More Components Onto Integrated Circuits. Proceedings of the IEEE,1998, 86(1):82-85.
    [3]Bell S, Edwards B, Amann J, et al. TILE64-Processor:A 64-Core SoC with Mesh Interconnect. Proceedings of IEEE International Solid-State Circuits Conference,2008.88-90.
    [4]Dally W J, Towles B. Route packets, not wires:on-chip inteconnection networks. Proceedings of Proceedings of the 38th annual Design Automation Conference,2001.684-689.
    [5]Martin M M K, Sorin D J, Ailamaki A, et al. Timestamp snooping:an approach for extending SMPs. SIGPLAN Not.,2000,35:25-36.
    [6]Martin M M K, Hill M D, Wood D A. Token coherence:decoupling performance and correctness. Proceedings of Proceedings of the 30th annual international symposium on Computer architecture, 2003.182-193.
    [7]Bedford Taylor M, Lee W, Amarasinghe S, et al. Scalar Operand Networks:On-chip Intercon-nect for ILP in Partitioned Architectures. Proceedings of Proceedings of the Ninth International Symposium on High-Performance Computer Architecture,2003.341-353.
    [8]Kim N, Austin T, Baauw D, et al. Leakage current:Moore's law meets static power. Computer, 2003,36(12):68-75.
    [9]Wang L, Jin Y, Kim H, et al. Recursive Partitioning Multicast:A Bandwidth-Efficient Routing for Networks-on-Chip. Proceedings of The 3rd ACM/IEEE International Symposium on Networks-on-Chip,2009.64-73.
    [10]Seo D, Ali A, Lim W T, et al. Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks. Proceedings of Proceedings of the 32nd annual international symposium on Computer Architecture,2005.432-443.
    [11]Hosomi M, Yamagishi H, Yamamoto T, et al. A novel nonvolatile memory with spin torque transfer magnetization switching:spin-ram. Proceedings of IEEE International Electron Devices Meeting (IEDM'05),2005.459-462.
    [12]Smullen C, Mohan V, Nigam A, et al. Relaxing non-volatility for fast and energy-efficient STT-RAM caches. Proceedings of IEEE 17th International Symposium on High Performance Computer Architecture (HPCA'11),2011.50-61.
    [13]Sun Z, Bi X, Li H H, et al. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. Proceedings of Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,2011.329-338.
    [14]Sweazey P, Smith A J. A class of compatible cache consistency protocols and their support by the IEEE futurebus. SIGARCH Comput. Archit. News,1986,14:414-423.
    [15]Zhao H, Shriraman A, Dwarkadas S. SPACE:sharing pattern-based directory coherence for multicore scalability. Proceedings of Proceedings of the 19th international conference on Parallel architectures and compilation techniques,2010.135-146.
    [16]Kim D, Ahn J, Kim J, et al. Subspace snooping:filtering snoops with operating system support. Proceedings of Proceedings of the 19th international conference on Parallel architectures and compilation techniques,2010.111-122.
    [17]Martin M M K, Sorin D J, Hill M D, et al. Bandwidth Adaptive Snooping. Proceedings of Proceedings of the 8th International Symposium on High-Performance Computer Architecture, 2002.251-262.
    [18]Agarwal N, Peh L S, Jha N K. In-Network Snoop Ordering (INSO):Snoopy coherence on un-ordered interconnects. Proceedings of Proceedings of the 15th International Symposium on High-Performance Computer Architecture,2009.67-78.
    [19]Cuesta B A, Ros A, Gomez M E, et al. Increasing the effectiveness of directory caches by de-activating coherence for private memory blocks. Proceedings of Proceeding of the 38th annual international, symposium on Computer architecture,2011.93-104.
    [20]Ferdman M, Lotfi-Kamran P, Balet K, et al. Cuckoo directory:A scalable directory for many-core systems. Proceedings of The 17th IEEE International Symposium on High Performance Computer Architecture,2011.169-180.
    [21]Raghavan A, Blundell C, Martin M M K. Token tenure:PATCHing token counting using directory-based cache coherence. Proceedings of Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture,2008.47-58.
    [22]Cai Q, Gonzalez J, Rakvic R, et al. Meeting points:using thread criticality to adapt multicore hardware to parallel regions. Proceedings of Proceedings of the 17th international conference on Parallel architectures and compilation techniques,2008.240-249.
    [23]Taylor M, Kim J, Miller J, et al. The Raw microprocessor:a computational fabric for software circuits and general-purpose programs. Micro, IEEE,2002,22(2):25-35.
    [24]Intel. Single-chip Cloud Computer,2009. http://techresearch.intel.com/spaw2/uploads/files/SCC_Platform_Overview.pdf.
    [25]Wentzlaff D, Griffin P, Hoffmann H, et al. On-Chip Interconnection Architecture of the Tile Processor. Micro, IEEE,2007,27(5):15-31.
    [26]Intel. From a Few Cores to Many:A Tera-scale Computing Research Overview,2006. http: //download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.
    [27]Jerger N E, Peh L S, Lipasti M. Virtual Circuit Tree Multicasting:A Case for On-Chip Hardware Multicast Support. Proceedings of Proceedings of the 35th International Symposium on Computer Architecture,2008.229-240.
    [28]Strauss K, Shen X, Torrellas J. Uncorq:Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. Proceedings of Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture,2007.327-342.
    [29]Sorin D, Plakal M, Condon A, et al. Specifying and verifying a broadcast and a multicast s-nooping cache coherence protocol. IEEE Transactions on Parallel and Distributed Systems,2002, 13(6):556-578.
    [30]Sankaralingam K, Nagarajan R, Liu H, et al. Exploiting ILP, TLP, and DLP with the polymor-phous TRIPS architecture. SIGARCH Computer Architecture News,2003,23(6):46-51.
    [31]Lu Z, Yin B, Jantsch A. Connection-oriented Multicasting in Wormhole-Switched Networks on Chip. Proceedings of IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures,2006.205-210.
    [32]Abad P, Puente V, Gregorio J A. MRR:Enabling Fully Adaptive Multicast Routing for CMP Interconnection Networks. Proceedings of Proceedings of the 15th International Symposium on High Performance Computer Architecture,2009.355-366.
    [33]Rodrigo S, Flich J, Duato J, et al. Efficient unicast and multicast support for CMPs. Proceedings of Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008.364-375.
    [34]Li J, Xue C J, Xu Y. LADPM:Latency-Aware Dual-Partition Multicast Routing for Mesh-based Network-on-Chips. Proceedings of 16th International Conference on Parallel and Distributed Systems,2010.423-430.
    [35]Ma S, Jerger N E, Wang Z. Supporting efficient collective communication in NoCs. Proceedings of Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture,2012.1-12.
    [36]Dally W, Towles B. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc,2003.
    [37]Alexander M, Robins G. New performance-driven FPGA routing algorithms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,1996,15(12):562-567.
    [38]Sullivan H, Bashkow T R. A large scale, homogeneous, fully distributed parallel machine, I. Pro-ceedings of Proceedings of the 4th annual symposium on Computer architecture,1977.105-117.
    [39]Chiang C M, Ni L M. Multi-address Encoding for Multicast. Proceedings of Proceedings of the First International Workshop on Parallel Computer Routing and Communication,1994.146-160.
    [40]Kahng A, Lin B, Samadi K, et al. Trace-driven optimization of networks-on-chip configurations. Proceedings of 47th ACM/IEEE Design Automation Conference,2010.437-442.
    [41]Dally W. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems, 1992,3(2):194-205.
    [42]Chaudhuri M, Heinrich M. Exploring virtual network selection algorithms in DSM cache coherence protocols. IEEE Transactions on Parallel and Distributed Systems,2004,15(8):699-712.
    [43]Glass C, Ni L. The Turn Model for Adaptive Routing. Proceedings of Proceedings of the 19th Annual International Symposium on Computer Architecture,1992.278-287.
    [44]Duato J. A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems,1993,4:1320-1331.
    [45]Duato J. A new theory of deadlock-free adaptive multicast routing in wormhole networks. Pro-ceedings of Proceedings of the 5th IEEE Symposium on Parallel and Distributed Processing,1993. 64-71.
    [46]NIRGAM. A Simulator for NoC Interconnect Routing and Application Modeling,2007. http: //nirgam.ecs.soton.ac.uk/.
    [47]Sun C, Chen C H, Kurian G, et al. DSENT-A Tool Connecting Emerging Photonics with Elec-tronics for Opto-Electronic Networks-on-Chip Modeling. Proceedings of Sixth IEEE/ACM Inter-national Symposium on Networks on Chip,2012.201-210.
    [48]Martin M M K, Sorin D J, Beckmann B M, et al. Multifacet's general execution-driven multipro-cessor simulator (GEMS) toolset. SIGARCH Computer Architecture News,2005,33:92-99.
    [49]Magnusson P, Christensson M, Eskilson J, et al. Simics:A full system simulation platform. Computer,2002,35(2):50-58.
    [50]Agarwal N, Krishna T, Peh L S, et al. GARNET:A detailed on-chip network model inside a full-system simulator. Proceedings of Performance Analysis of Systems and Software,2009. ISPASS 2009. IEEE International Symposium on,2009.33^42.
    [51]Bienia C, Kumar S, Singh J P, et al. The PARSEC Benchmark Suite:Characterization and Architectural Implications. Proceedings of Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques,2008.72-81.
    [52]Li S, Ahn J H, Strong R D, et al. McPAT:an integrated power, area, and timing modeling framework for multicore and manycore architectures. Proceedings of Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture,2009.469-480.
    [53]Zuo W, Feng S, Qi Z, et al. Group-caching for NoC based multicore cache coherent systems. Proceedings of Proceedings of the Conference on Design, Automation and Test in Europe,2009. 755-760.
    [54]Meng Y, Sherwood T, Kastner R. On the Limits of Leakage Power Reduction in Caches. Pro-ceedings of Proceedings of the 11th International Symposium on High-Performance Computer Architecture,2005.154-165.
    [55]Flautner K, Kim N S, Martin S, et al. Drowsy caches:simple techniques for reducing leak-age power. Proceedings of Proceedings of 29th Annual International Symposium on Computer Architecture (ISCA'02),2002.148-157.
    [56]Hu Z, Kaxiras S, Martonosi M. Let caches decay:reducing leakage energy via exploitation of cache generational behavior. ACM Trans. Comput. Syst.,2002,20:161-190.
    [57]Kin J, Gupta M, Mangione-Smith W H. The filter cache:an energy efficient memory struc-ture. Proceedings of Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture (MICRO'97),1997.184-193.
    [58]Chen E, Apalkov D, Diao Z, et al. Advances and Future Prospects of Spin-Transfer Torque Random Access Memory. IEEE Transactions on Magnetics,2010,46(6);1873-1878.
    [59]Xue C J, Zhang Y, Chen Y, et al. Emerging non-volatile memories:opportunities and chal-lenges. Proceedings of Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis,2011.325-334.
    [60]Sun G, Dong X, Xie Y, et al. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. Proceedings of IEEE 15th International Symposium on High Performance Computer Architecture (HPCA'09),2009.239-249.
    [61]Wu X, Li J, Zhang L, et al. Design exploration of hybrid caches with disparate memory technolo-gies. ACM Trans. Archit. Code Optim.,2010,7:15:1-15:34.
    [62]Jadidi A, Arjomand M, Sarbazi-Azad H. High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement. Proceedings of Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design,2011.79-84.
    [63]Chen Y T, Cong J, Huang H, et al. Dynamically reconfigurable hybrid cache:An energy-efficient last-level cache design. Proceedings of Design, Automation Test in Europe Conference Exhibition (DATE),2012,2012.45-50.
    [64]Li Q, Li J, Shi L, et al. MAC:migration-aware compilation for STT-RAM based hybrid cache in embedded systems. Proceedings of Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design,2012.351-356.
    [65]Li J, Shi L, Xue C, et al. Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. Proceedings of Embedded Systems for Real-Time Multimedia (ESTIMedia),2011 9th IEEE Symposium on,2011.19-28.
    [66]Jog A, Mishra A K, Xu C, et al. Cache revive:architecting volatile STT-RAM caches for en-hanced performance in CMPs. Proceedings of Proceedings of the 49th Annual Design Automation Conference,2012.243-252.
    [67]J Sorin D, D Hill M, A Wood D. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, Morgan and Claypool,2011.
    [68]Khan S M, Tian Y, Jimenez D A. Sampling Dead Block Prediction for Last-Level Caches. Pro-ceedings of Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Mi-croarchitecture,2010.175-186.
    [69]Khan S M, Jimenez D A, Burger D, et al. Using dead blocks as a virtual victim cache. Proceedings of Proceedings of the 19th international conference on Parallel architectures and compilation techniques,2010.489-500.
    [70]Liu H, Ferdman M, Huh J, et al. Cache bursts:A new approach for eliminating dead blocks and in-creasing cache efficiency. Proceedings of Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture,2008.222-233.
    [71]Jaleel A, Theobald K B, Steely S C, et al. High performance cache replacement using re-reference interval prediction (RRIP). Proceedings of Proceedings of the 37th annual international sympo-sium on Computer architecture,2010.60-71.
    [72]Dong X, Wu X, Sun G, et al. Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. Proceedings of DAC'08:Proceedings of the 45th annual Design Automation Conference,2008.554-559.
    [73]Tehrani S, Slaughter J, Deherrera M, et al. Magnetoresistive random access memory using mag-netic tunnel junctions. Proceedings of the IEEE,2003,91(5):703-714.
    [74]Zhou P, Zhao B, Yang J, et al. Energy reduction for STT-RAM using early write termination. Pro-ceedings of IEEE/ACM International Conference on Computer-Aided Design-Digest of Technical Papers (ICCAD'09),2009.264-268.
    [75]Rasquinha M, Choudhary D, Chatterjee S, et al. An energy efficient cache design using spin torque transfer (STT) RAM. Proceedings of Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design,2010.389-394.
    [76]Ghosh M, Lee H H S. Smart Refresh:An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. Proceedings of Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture,2007.134-145.
    [77]Zhong Y, Shen X, Ding C. Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst.,2009,31(6):20:1-20:39.
    [78]Keramidas G, Petoumenos P, Kaxiras S. Cache replacement based on reuse-distance prediction. Proceedings of 25th International Conference on Computer Design,2007.245-250.
    [79]Qureshi M K, Jaleel A, Patt Y N, et al. Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching. IEEE Micro,2008,28(1):91-98.
    [80]Muralimanohar N, Balasubramonian R, Jouppi N. Architecting Efficient Interconnects for Large Caches with CACTI 6.0. Micro, IEEE,2008,28(1):69-79.
    [81]Kahng A, Li B, Peh L S, et al. ORION 2.0:A Power-Area Simulator for Interconnection Networks. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,2012,20(1):191-196.
    [82]Micron Technology I. Calculating Memory System Power for DDR3,2007. http://download.micron. com/pdf/technotes/ddr3/TN41_01DDR3Power.pdf.
    [83]Cho C B, Li T. Complexity-based program phase analysis and classification. Proceedings of Pro-ceedings of the 15th international conference on Parallel architectures and compilation techniques, 2006.105-113.
    [84]Olukotun K, Nayfeh B A, Hammond L, et al. The case for a single-chip multiprocessor. Proceedings of Proceedings of the seventh international conference on Architectural support for programming languages and operating systems,1996.2-11.
    [85]Bhattacharjee A, Martonosi M. Thread criticality predictors for dynamic performance, pow-er, and resource management in chip multiprocessors. SIGARCH Comput. Archit. News,2009, 37:290-301.
    [86]Suleman M A, Mutlu O, Qureshi M K, et al. Accelerating critical section execution with asym-metric multi-core architectures. Proceedings of Proceedings of the 14th international conference on Architectural support for programming languages and operating systems,2009.253-264.
    [87]Joao J A, Suleman M A, Mutlu O, et al. Bottleneck Identification and Scheduling in Multithreaded Applications. Proceedings of Proceedings of the 17th international conference on Architectural support for programming languages and operating systems,2012.
    [88]Ebrahimi E, Miftakhutdinov R, Fallin C, et al. Parallel Application Memory Scheduling. Pro-ceedings of The 44th Annual IEEE/ACM International Symposium on Microarchitecture,2011.
    [89]Protic J, Tomasevic M, Milutinovic V. Distributed shared memory:concepts and systems. Parallel Distributed Technology:Systems Applications, IEEE,1996,4(2):63-71.
    [90]Lenoski D, Laudon J, Gharachorloo K, et al. The directory-based cache coherence protocol for the DASH multiprocessor. Proceedings of Proceedings of the 17th annual international symposium on Computer Architecture,1990.148-159.
    [91]Golomb S. Shift register sequences. Aegean Park Press,1982.
    [92]Tullsen D M, Eggers S J, Levy H M. Simultaneous multithreading:maximizing on-chip parallelism. Proceedings of Proceedings of the 22nd annual international symposium on Computer architecture, 1995.392-403.
    [93]Wentzlaff D, Griffin P, Hoffmann H, et al. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro,2007,27:15-31.
    [94]Garey M R, Johnson D S. The Rectilinear Steiner Tree Problem is NP-Complete. SI AM Journal on Applied Mathematics,1977,32(4):826-834.
    [95]Lai A C, Falsafi B. Selective, accurate, and timely self-invalidation using last-touch prediction. Proceedings of Proceedings of the 27th annual international symposium on Computer architecture, 2000.139-148.
    [96]Lebeck A R, Wood D A. Dynamic self-invalidation:reducing coherence overhead in shared-memory multiprocessors. Proceedings of Proceedings of the 22nd annual international symposium on Computer architecture,1995.48-59.
    [97]Li J, Martinez J F, Huang M C. The Thrifty Barrier:Energy-Aware Synchronization in Shared-Memory Multiprocessors. Proceedings of Proceedings of the 10th International Symposium on High Performance Computer Architecture,2004.14-23.
    [98]Dally W, Towles B. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc.,2003.
    [99]Chandra R, Gharachorloo K, Soundararajan V, et al. Performance evaluation of hybrid hard-ware and software distributed shared memory protocols. Proceedings of Proceedings of the 8th international conference on Supercomputing,1994.274-288.
    [100]Li J, Shi L, Li Q, et al. TEACA:Thread ProgrEss Aware Coherence Adaption for Hybrid Coher-ence Protocols. Proceedings of Embedded Systems for Real-Time Multimedia (ESTIMedia),2012 10th IEEE Symposium on,2012.
    [101]Levy M, Conte T M. Embedded Multicore Processors and Systems. IEEE Micro,2009,29:7-9.
    [102]Heinrich M, Soundararajan V, Hennessy J, et al. A quantitative analysis of the performance and s-calability of distributed shared memory cache coherence protocols. Computers, IEEE Transactions on,1999,48(2):205-217.
    [103]Kumar R, Tullsen D, Jouppi N, et al. Heterogeneous chip multiprocessors. Computer,2005, 38(11):32-38.
    [104]Jin Y, Wang R, Choi W, et al. Thread criticality support in on-chip networks. Proceedings of Proceedings of the Third International Workshop on Network on Chip Architectures,2010.5-10.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700