片上互连网络跨层交互的应用层优化框架

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

片上互连网络跨层交互的应用层优化框架

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Cross-Layered Application Layer Optimization Framework for Networks-on-Chips
作者：王小航
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：片上网络 ; 应用映射 ; 多播路由 ; 死锁避免
英文关键词：networks-on-chips ; application mapping ; multicasting ; deadlock avoidance
学位年度：2011
导师：刘鹏 ; 杨梅
学科代码：081002
学位授予单位：浙江大学
论文提交日期：2011-04-01
答辩委员会主席：石教英

摘要

随着集成电路技术的发展,芯片上的集成晶体管数目呈指数增加。当集成电路工艺进入到65 nm以下,线延时不再随着工艺特征尺寸缩小而减少。芯片的性能无法随着器件数目和时钟频率的增加而成比例的提升。随着系统应用复杂度的增加,可以将多个处理器核、存储器以及其他的知识产权核(intellectual propertycore,简称IP核),通过片上网络互连的方式,集成到单一芯片中。片上网络设计需要考虑低能耗、高带宽、低传输延迟、可扩展、可靠性等设计要求和挑战。从系统的角度综合考虑这些设计挑战和要求,基于层次交互构件方法,重点研究片上网络设计亟待解决的三个关键问题：应用映射、多播路由,以及消息依赖造成的死锁。
     首先,针对二维片上网络,本论文提出一种感知应用通信特性和拓扑结构的应用映射算法：基于应用模板的高效应用映射算法(template-aware efficient mapping,简称TEM)。TEM按照应用的通信轨迹图的特性,将应用分为两大类,1)存在通信热点的应用和2)通信比较均匀的应用。对于第一类应用,TEM将热点及其相连的节点映射到靠近的资源节点；对于第二类应用,TEM采用图划分的方式进行映射。TEM可以被用在二维网格、环绕网格、蝶形胖树等拓扑结构。将TEM映射算法的结果作为遗传算法(genetic algorithm,简称GA)的初始群体,得到更为优化的映射算法GA+TEM。采用SPLASH-2基准测试的通信轨迹作为Noxim片上网络仿真器的输入,实验结果表明,GA+TEM比单纯使用GA算法明显降低了通信能耗(5%-20%)。
     本论文进一步提出了一种针对三维集成片上网络的增量式应用映射算法：高能效的运行时增量式应用映射算法框架(energy efficient run-time incremental mapping framework,简称ERIM)。ERIM根据新到达应用的任务图的特性,将应用分为两种类型,1)通信密集型和2)计算密集型。对于这两种类型的应用,ERIM首先找到一个长方体形状的资源节点区域,以减少对未到达应用的影响。对于通信密集型的应用,ERIM通过有效地利用垂直方向所增加的连接度,来降低通信能耗。而对于计算密集型的应用,ERIM尽量平衡在每一个垂直堆叠(vertical stack)上运行任务的处理器核之间的温度来避免温度超过门限值。实验结果表明,ERIM产生的映射结果比两个贪婪式映射算法的映射结果的能耗低达15%。
     其次,我们考虑多个应用被分配到同一个片上网络系统中而每一个应用的子区域可能不规则时,如何进行多播路由。本论文提出了一种面向不规则子区域的多播路由策略,其原理如下：基于一个现有的多播路由算法,比如多播XY路由,当发现一个输出端口所连接的网络节点不在同一个子区域内,就选择另外一个方向(称为替换方向,alternative).基于这个策略,提出了一个面向二维子区域的替换多播XY路由算法(alternative multicasting XY routing,简称AL+XY)和一个面向三维子区域的替换多播XYZ路由算法(alternative multicasting XY routing,简称AL+XYZ)。实验结果表明AL+XY比多次单播和区域内广播两种方法,能耗和延迟都低。当多播对单播比例为0.3、注入率为0.4(flit/cycle)的时候,多次单播和子区域内广播的能耗分别为AL+XY的2.2倍和2倍。在同样的多播对单播比例和注入率下,多次单播和子区域内广播的延迟分别为AL+XY的11和1.2倍。AL+XY可以被扩展为适用于三维集成片上网络子区域多播的AL+XYZ多播路由。AL+XY和AL+XYZ路由器在TSMC 65nm工艺下综合,工作频率为800MHz。AL+XY比二维网格片上网络的单播路由器面积增加了3%, AL+XYZ的面积比三维集成片上网络单播路由器面积增加了7%。
     最后,本论文中提出了一种可以避免在点对点流式传输系统中可能出现的请求-请求类型消息依赖造成的死锁的方法。消息依赖造成的死锁产生的原因是网络中的消息不能被目的节点消耗而驻留在网络中,这些消息相互依赖,从而造成死锁。本论文从理论上证明了一个避免请求一请求类型消息依赖造成的死锁的充分条件,并提出可以通过增加非均匀虚通道(即路由器的每个端口所配置的虚通道数目可以不一样)来避免这种死锁。基于该理论,本论文进一步证明了寻找最小数目的非均匀虚通道的问题是一个NP完全问题,并提出一个基于线性规划的近似算法：路径选择和最少虚通道分配方法(path selection and minimum virtual channel allocation,简称PSMV)。PSMV算法可以和现有的应用映射算法集成在一起,产生没有死锁的映射结果。PSMV产生的结果延迟低,使用额外缓冲开销较少。
With the continuous scaling of CMOS technologies, the number of integrated transistors in a chip grows exponentially. When the CMOS technology scales below 65nm, the wire delay does not decrease although the feature size keeps shrinking. The chip performance cannot improve alone with the increase in the number of transistors and clock frequency. With the growing number of transistors integrated in a chip, multiple processor cores, memory units or other intellectual property core (IP core) could be integrated into a single chip and connected by networks-on-chip (NoC). NoC design rises several challenges, including lower energy consumption, high bandwidth/ low transmission latency, scalability, and reliability. To meet the challenges in a systematic manner, we propose a layered interactive building block (LIB) methodology. Focusing on the application layer of the LIB methodology, we solve three outstanding problems in NoC design:application mapping, multicasting, and message dependent deadlock avoidance.
     This thesis first proposes an application communication behavior-aware and topology-aware application mapping algorithm for 2D NoCs. The mapping algorithm is named as template-aware efficient mapping (TEM). TEM classify the applications into two types according to their communication trace graphs:1) application with communication hot nodes and 2) application without communication hot nodes. TEM maps the hot nodes and their neighbors to tiles with minimum hop count distance for the first type application. TEM partitions both the NoCs and the application communication trace graphs in the mapping process for the second type applications. TEM can be used in 2D mesh, torus, and butterfly fat tree topologes. The result from TEM can also be used as the initial population of a genetic algorithm (GA) to get a further optimized algorithm, GA+TEM. By evaluating the communication traces from SPLASH-2 benchmarks on the Noxim NoC simulator, GA+TEM can reduce communication energy by 5%～20% compared to GA only.
     This thesis also proposes a runtime incremental mapping heuristic for 3D NoC. The mapping heuristic is named as energy efficient run-time incremental mapping framework (ERIM). ERIM classifies the application into two types:1) communication centric and 2) computation centric. For communication centric applications, ERIM utilizes the increased degree in the vertical direction to reduce communication energy consumption. For computation centric applications, ERIM balances the temperature of the processors running tasks to avoid thermal violation. The experiment results confirmed that the mapping result from ERIM can reduce energy consumption by 15% compared to two other greedy based heuristics.
     Next, this thesis researches on multicasting in irregular regions in a NoC system when multiple applications are allocated to the same NoC system. This thesis proposes an irregular region oriented multicasting strategy with the following idea. Based on an existing multicasting algorithm, e.g. multicasting XY, when the output channel of a node is connected to another node which is not in the same region, an alternative direction is selected. Based on this strategy, a 2D region oriented alternative multicasting XY routing algorithm (AL+XY) is proposed. The experimental results confirm that AL+XY can reduce both communication energy consumption and average latency. When the injection rate is 0.4 flit/cycle and the multicasting to unicasting ratio is 0.3, the communication energy consumption values of multiple unicasting and a region based broadcasting are 2.2x and 2x over that of AL+XY. At the same multicasting to unicasting rate and injection rate, the average latency of multiple unicasting and a region based broadcasting are 11x and 1.3x over that of AL+XY. AL+XY could be extended to be AL+XYZ which supports region based 3D NoC multicasting. AL+XY and AL+XYZ routers are synthesized with TSMC 65nm library and can work at 800 MHz. The area of an AL+XY router increases by 3% compared to a 2D unicasting router while the area of an AL+XYZ router increases by 7% compared to a 3D unicasting router.
     Finally, this thesis proposes a request-request type message dependent deadlock avoidance method in peer-to-peer streaming systems. The cause of message dependent deadlock is that, the messages cannot be consumed by the destinations and are stored in the network such that the inter-dependency of the messages causes deadlock. This thesis proves in theory a sufficient condition to avoid request-request type message dependent deadlock by increasing non-uniform virtual channels (e.g. the numbers of virtual channels at each input port of routers are different). Based on this theory, this thesis proves that finding the minimal number of non-uniform virtual channel is an NP-complete problem and thus an integer linear programming based algorithm is proposed. The algorithm is named as path selection and minimum virtual channel allocation (PSMV). PSMV can be integrated with existing mapping algorithms to generate deadlock free mapping result. The result from PSMV has low latency and low additional buffer cost.

引文

[1](2010). International technology roadmap for semiconductors (ITRS). [Online]. Available: http://www.itrs.net/links/2010itrs/home2010.htm.
    [2]S. Borkar, "Thousand core chips:a technology perspective," in Proceedings of the Design Automation Conference, pp.746-749,2007.
    [3]C. Nicopoulos, V. Narayanan, and C. R. Das, Network-on-chip architectures:a holistic design exploration:Springer Verlag,2009.
    [4]E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin, "An asynchronous NoC architecture providing low latency service and its multi-level design framework," in Proceedings of the IEEE International Symposium on Asynchronous Circuits and Systems, pp. 54-63,2005.
    [5]D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, and Y. Masubuchi, "The design and implementation of a first-generation CELL processor," in Proceedings of the International Conference on Integrated Circuit Design and Technology, pp.184-592,2005.
    [6]P. J. Tan, T. Le, K. H. Ng, P. Mantri, and J. Westfall, "Testing of UltraSPARC T1 microprocessor and its challenges," in Proceedings of the IEEE International Test Conference, pp.1-10,2006.
    [7]Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, "A 5-GHz mesh interconnect for a teraflops processor," IEEE Micro, vol.27, no.5, pp.51-61,2007.
    [8]项纯昶,“片上互连网络组件设计及其验证研究,”硕士学位论文,浙江大学,2008.
    [9]L. Benini and G. De Micheli, "Networks on chip:a new paradigm for systems on chip design," in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 418-419,2002.
    [10]V. F. Pavlidis and E. G. Friedman, "3-D topologies for networks-on-chip," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.15, no.10, pp.1081-1090,2007.
    [11]A. Sheibanyrad, P. F., and A. Jantsch,3D integration for NoC-based SoC architectures: Springer Publication,2011.
    [12]R. Marculescu, U. Y. Ogras, L. S. Peh, N. E. Jerger, and Y. Hoskote, "Outstanding research problems in NoC design:system, microarchitecture, and circuit perspectives," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.28, no.1, pp. 3-21,2009.
    [13]S. Heo and K. Asanovi, "Replacing global wires with an on-chip network:a power analysis," in Proceedings of the International Symposium on Low Power Electronics and Design, pp. 369-374,2005.
    [14]M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J. W. Lee, and W. Lee, "The Raw microprocessor:A computational fabric for software circuits and general-purpose programs," IEEE Micro, vol.22, no.2, pp.25-35,2002.
    [15]L. S. Peh and N. E. Jerger, On-chip networks:Morgan and Claypool Publishers,2009.
    [16]P. P. Pande, C. Grecu, and M. Jones, "Performance evaluation and design trade-offs for network-on-chip interconnect architectures," IEEE Transactions on Computers, vol.54, no.8, pp.1025-1040,2005.
    [17]L. Wang, J. Hao, and F. Wang, "Bus-based and NoC infrastructure performance emulation and comparison," in Proceedings of the International Conference on Information Technology:New Generations, pp.855-858,2009.
    [18]H. Matsutani, M. Koibuchi, and H. Amano, "Tightly-coupled multi-layer topologies for 3-D NoCs," in Proceedings of the International Conference on Parallel Processing, pp.75-85, 2007.
    [19]J. Duato, S. Yalamanchili, and L. M. Ni, Interconnection networks:an engineering approach: Morgan Kaufmann Publishers,2003.
    [20]W. J. Dally and B. Towles, Principles and practices of interconnection networks:Morgan Kaufmann,2004.
    [21]S. Stergiou, F. Angiolini, S. Carta, L. Raffo, D. Bertozzi, and G. D. Micheli, "× pipes lite:a synthesis oriented design library for networks on chips," in Proceedings of the Design, Automation and Test in Europe, pp.1188-1193,2005.
    [22]J. Hu and R. Marculescu, "Energy-and performance-aware mapping for regular NoC architectures," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.24, no.4, pp.551-562,2005.
    [23]陈庚生,陈亦欧,胡剑浩,”面向实时数字信号系统关键链路延时的NoC映射方法研究,”电子与信息学报,vol.32,no.7,pp.1638-1643,2010.
    [24]葛芬,吴宁,”功耗优化的片上网络协同映射,”应用科学学报,vol.26,no.006,pp.606-612,2008.
    [25]林桦,张良,佟冬,李险峰,程旭,”面向Mesh片上网络的快速层次化多目标映射方法,”北京大学学报(自然科学版),vol.44,no.5,pp.711-720,2008.
    [26]林桦,李险峰,佟冬,程旭,”保证QoS的片上网络低能耗映射与路由方法,”计算机辅助设计与图形学学报,vol.20,no.4,pp.425-431,2008.
    [27]L. Wang, Y. Jin, H. Kim, and E. J. Kim, "Recursive partitioning multicast a bandwidth-efficient routing for networks-on-chip," in Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip, pp.64-73,2009.
    [28]J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, M. S. Yousif, and C. R. Das, "A novel dimensionally-decomposed router for on-chip communication in 3D architectures," in Proceedings of the International Symposium on Computer Architecture, pp.138-149,2007.
    [29]G. Ascia, V. Catania, M. Palesi, and D. Patti, "Neighbors-on-path:a new selection strategy for on-chip networks," in Proceedings of the IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia, pp.79-84,2006.
    [30]T. C. Huang, U. Y. Ogras, and R. Marculescu, "Virtual channels planning for networks-on-chip," in Proceedings of the International Symposium on Quality Electronic Design pp.879-884, March 2007.
    [31]B. S. Feero and P. P. Pande, "Networks-on-chip in a three-dimensional environment:a performance evaluation," IEEE Transactions on Computers, vol.58, no.1, pp.32-45,2009.
    [32]齐星云,窦强,冯权友,陈永然,窦文华,”无缓冲光互连网络的延时性能分析及优化,”计算机工程,vol.36,no.5,pp.12-14,2010.
    [33]R. Holsmark, S. Kumar, M. Palesi, and A. Mejia, "HiRA:a methodology for deadlock free routing in hierarchical networks on chip," in Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip, pp.2-11,2009.
    [34]T. Dumitra and R. M rculescu, "On-chip stochastic communication," Master thesis, Carnegie Mellon University,2003.
    [35]张磊,李华伟,李晓维,“用于片上网络的容错通信算法,”计算机辅助设计与图形学学报,vol.19,no.4,pp.508-514,2007.
    [36]M. Palesi, R. Holsmark, X. Wang, S. Kumar, M. Yang, Y. Jiang, and V. Catania, "An efficient technique for in-order packet delivery with adaptive routing algorithms in networks on chip," in Euromicro Conference on Digital System Design:Architectures, Methods and Tools, pp. 37-44,2010.
    [37]M. Arjomand and H. Sarbazi-Azad, "Voltage-frequency planning for thermal-aware, low-power design of regular 3-D NoCs," in International Conference on VLSI Design, pp. 57-62,2010.
    [38]C. Zhu, Z. Gu, L. Shang, R. P. Dick, and R. Joseph, "Three-dimensional chip-multiprocessor run-time thermal management," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.27, no.8, pp.1479-1492,2008.
    [39]X. Zhou, J. Yang, Y. Xu, Y. Zhang, and J. Zhao, "Thermal-aware task scheduling for 3D multi-core processors," IEEE Transactions on Parallel and Distributed Systems, vol.21, no.1, pp.60-71,2010.
    [40]C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu, and A. Y. Wu, "Traffic-and thermal-aware run-time thermal management scheme for 3D NoC systems," in Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip, pp.223-230,2010.
    [41]J. Donald and M. Martonosi, "Techniques for multicore thermal management:classification and new exploration," in Proceedings of the International Symposium on Computer Architecture, pp.78-88,2006.
    [42]P. Liu, B. Xia, C. Xiang, X. Wang, W. Wang, and Q. Yao, "A networks-on-chip architecture design space exploration-the LIB," Computers & Electrical Engineering, vol.35, no.6, pp. 817-836,2009.
    [43]G. Ascia, V. Catania, and M. Palesi, "Mapping cores on network-on-chip," International Journal of Computational Intelligence Research, vol.1, no.1-2, pp.109-126,2005.
    [44]K. Srinivasan and K. S. Chatha, "A technique for low energy mapping and routing in network-on-chip architectures," Proceedings of the International Symposium on Low Power Electronics and Design, pp.387-392,2005.
    [45]S. Murali and G. De Micheli, "Bandwidth-constrained mapping of cores onto NoC architectures," in Proceedings of the Conference on Design, Automation and Test in Europe, pp.896-901,2004.
    [46]C. L. Chou and R. Marculescu, "Run-time task allocation considering user behavior in embedded multiprocessor networks-on-chip," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.29, no.1, pp.78-91,2009.
    [47]N. E. Jerger, L. S. Peh, and M. Lipasti, "Virtual circuit tree multicasting:a case for on-chip hardware multicast support," in Proceedings of the International Symposium on Computer Architecture, pp.229-240,2008.
    [48]S. Rodrigo, J. Flich, J. Duato, and M. Hummel, "Efficient unicast and multicast support for CMPs," in Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pp. 364-375,2008.
    [49]P. Abad, V. Puente, and J. A. Gregorio, "MRR:enabling fully adaptive multicast routing for CMP interconnection networks," in Proceedings of the IEEE International Symposium on High Performance Computer Architecture, pp.355-366,2009.
    [50]A. Hansson, K. Goossens, and A. Radulescu, "Avoiding message-dependent deadlock in network-based systems on chip," VLSI Design, vol.2007, pp.1-10,2007.
    [51]F. Petrot, A. Greiner, and P. Gomez, "On cache coherency and memory consistency issues in NoC based shared memory multiprocessor SoC architectures," in Proceedings of the IEEE Conference on Digital System Design:Architectures, Methods and Tools, pp.53-60,2006.
    [52]M. Bekooij, R. Hoes, O. Moreira, P. Poplavko, M. Pastrnak, B. Mesman, J. D. Mol, S. Stuijk, V. Gheorghita, and J. van Meerbergen, Dataflow analysis for real-time embedded multiprocessor system design:Springer,2005.
    [53]H. Nikolov, T. Stefanov, and E. Deprettere, "Multi-processor system design with ESPAM," in Proceedings of the International Conference on Hardware/software Codesign and System Synthesis, pp.211-216,2006.
    [54]F. Steenhof, H. Duque, B. Nilsson, K. Goossens, and R. P. Llopis, "Networks on chips for high-end consumer-electronics TV system architectures," in Proceedings of the IEEE Design Automation & Test in Europe Conference, p.29,2006.
    [55]N. Concer, L. Bononi, M. Soulie, R. Locatelli, and L. P. Carloni, "CTC:an end-to-end flow control protocol for multi-core systems-on-chip," in Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip, pp.193-202,2009.
    [56]A. Lankes, T. Wild, A. Herkersdorf, S. Sonntag, and H. Reinig, "Comparison of deadlock recovery and avoidance mechanisms to approach message dependent deadlocks in on-chip networks," in Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip, pp.17-24,2010.
    [57]J. Xue, A. Garg, B. Ciftcioglu, J. Hu, S. Wang, I. Savidis, M. Jain, R. Berman, P. Liu, M. Huang, H. Wu, E. G. Friedman, G. Wicks, and D. Moore, "An intra-chip free-space optical interconnect," in Proceedings of the International Symposium on Computer architecture pp. 94-105,2010.
    [58]H. M. Harmanani and R. Farah, "A method for efficient mapping and reliable routing for NoC architectures with minimum bandwidth and area," in Proceedings of the Conference on Circuits and Systems, pp.29-32,2008.
    [59]C. A. M. Marcon, E. I. Moreno, N. L. V. Calazans, and F. G. Moraes, "Evaluation of algorithms for low energy mapping onto NoCs," in Proceedings of the IEEE International Symposium on Circuits and Systems, pp.389-392,2007.
    [60]G. Ascia, V. Catania, and M. Palesi, "Multi-objective mapping for mesh-based NoC architectures," in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp.182-187,2004.
    [61]W. Zhou, Y. Zhang, and Z. Mao, "Pareto based multi-objective mapping IP cores onto NoC architectures," in Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, pp.331-334,2006.
    [62]A. Hansson, K. Goossens, and A. Radulescu, "A unified approach to constrained mapping and routing on network-on-chip architectures," VLSI Design, vol.2007, pp.75-80,2005.
    [63]A. Mehran, S. Saeidi, A. Khademzadeh, and A. Afzali-Kusha, "Spiral:a heuristic mapping algorithm for network on chip," IEICE Electronics Express, vol.4, no.15, pp.478-484,2007.
    [64]X. Wang, M. Yang, Y. Jiang, and P. Liu, "Power-aware mapping for network-on-chip architectures under bandwidth and latency constraints," in Proceedings of International Conference on Embedded and Multimedia Computing, pp.1-6,2009.
    [65]X. Wang, M. Yang, Y. Jiang, and P. Liu, "A power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints," ACM Transactions on Architecture and Code Optimization, vol.7, no.1, pp.1-30,2010.
    [66]F. Fazzino, M. Palesi, and D. Patti. Noxim:network-on-chip simulator. [Online]. Available: http://sourceforge.net/projects/noxim.
    [67]K. Samadi, A. Kahng, B. Li, and L. S. Peh, "Orion 2.0:a fast and accurate NoC power and area model for early-stage design space exploration," in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, pp.423-428,2009.
    [68]G. Ascia, V. Catania, and M. Palesi, "Multi-objective mapping for mesh-based NoC architectures," in Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp.182-187,2004.
    [69]R. P. Dick, D. L. Rhodes, and W. Wolf, "TGFF:task graphs for free," in Proceedings of the 6th International Workshop on Hardware/Software Codesign, pp.97-101,1998.
    [70]S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2 programs: characterization and methodological considerations," in Proceedings of the International Symposium on Computer Architecture pp.24-36,1995.
    [71]PopNet. [Online]. Available:http://www.ee.princeton.edu/lshang/popnet.html.
    [72]J. Hu and R. Marculescu, "Energy-aware mapping for tile-based NoC architectures under performance constraints," in Proceedings of Design Automation Conference, pp.233-239, 2003.
    [73]H. Lin, X. Li, D. Tong, and X. Cheng, "A low energy mapping and routing approach for network-on-chip with QoS guarantees," Journal of Computer-Aided Design & Computer Graphics, vol.20, no.4, pp.425-431,2008.
    [74]严明,杨子煜,赵鹏,李思昆,”拓扑结构感知的片上网络体系结构应用映射与优化,”计算机工程与科学,vol.31,no.A01,pp.109-111,2009.
    [75]Z. Lu, L. Xia, and A. Jantsch, "Cluster-based simulated annealing for mapping cores onto 2D mesh networks on chip," in Proceedings of Design and Diagnostics of Electronic Circuits and Systems,pp.1-6,2008.
    [76]C. Addo-Quaye, "Thermal-aware mapping and placement for 3-D NoC designs," in Proceedings of the IEEE International SoC Conference, pp.25-28,2005.
    [77]Y. Liu, H. Yang, R. P. Dick, H. Wang, and L. Shang, "Thermal vs energy optimization for dvfs-enabled processors in embedded systems," in Proceedings of the International Symposium Quality Electronic Design, pp.204-209,2007.
    [78]J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The impact of technology scaling on lifetime reliability," in Proceedings of International Conference on Dependable Systems and Networks, pp.177-187,2004.
    [79]W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan, "HotSpot:a compact thermal modeling methodology for early-stage VLSI design," IEEE Transactions on Very Large Scale Integration Systems, vol.14, no.5, pp.501-513,2006.
    [80]Managing Power in 45nm and 65nm Designs. [Online]. Available: www.dianzichan.com/anonymous/ic/arm07conf mgm pwr45nm.pdf.
    [81]M. Lin, A. El Gamal, Y. C. Lu, and S. Wong, "Performance benefits of monolithically stacked 3D-FPGA," IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, vol.26, no.2, pp.216-229,2006.
    [82]V. Lo, K. J. Windisch, W. Liu, and B. Nitzberg, "Noncontiguous processor allocation algorithms for mesh-connectedmulticomputers," IEEE Transactions on Parallel and Distributed Systems, vol.8, no.7, pp.712-726,1997.
    [83]F. A. Samman, T. Hollstein, and M. Glesner, "Planar adaptive router microarchitecture for tree-based multicast network-on-chip," in Proceedings of the International Workshop on Network on Chip Architectures, pp.6-13,2009.
    [84]I. V. Senin, L. Mhamdi, and K. Goossens, "Efficient multicast support in buffered crossbars using networks on chip," in Proceedings of the IEEE Global Telecommunications Conference, pp.1-7,2009.
    [85]A. Gavrilovska, S. Kumar, H. Raj, K. Schwan, V. Gupta, R. Nathuji, R. Niranjan, A. Ranadive, and P. Saraiya, "High-performance hypervisor architectures:virtualization in HPC systems," in Workshop on System-level Virtualization for High Performance Compulin, pp.1-8,2007.
    [86]S. Rodrigo, J. Flich, and J. Duato, "Efficient unicast and multicast support for CMPs," in Proceedings of the International Conference on IEEE/ACM International Symposium on Microarchitecture, pp.364-375,2008.
    [87]X. Wang, M. Yang, Y. Jiang, and P. Liu, "On an efficient NoC multicasting scheme in support of multiple applications running on irregular sub-networks," Microprocessors and Microsystems, vol.35, no.2, pp.119-129,2011.
    [88]X. Wang, M. Yang, Y. Jiang, and P. Liu, "Efficient multicasting scheme for irregular mesh-based NoCs," in Proceedings of the IEEE International SoC Conference, pp.384-387, 2010.
    [89]V. Varavithya and P. Mohapatra, "Asynchronous tree-based multicasting in wormhole-switched MINs," IEEE Transactions on Parallel and Distributed Systems, vol.10, no.11, pp. 1159-1178,1999.
    [90]C. L. Chou and R. Marculescu, "User-aware dynamic task allocation in networks-on-chip," in Proceedings of the International Conference on Design, Automation and Test in Europe, pp. 1232-1237,2008.
    [91]Z. Lu, B. Yin, and A. Jantsch, "Connection-oriented multicasting in wormhole-switched networks on chip," Proceedings of Emerging VLSI Technologies and Architectures, pp. 205-211,2006.
    [92]Y. H. Song and T. M. Pinkston, "A progressive approach to handling message-dependent deadlock in parallel computer systems," IEEE Transactions on Parallel and Distributed Systems, vol.14, no.3, pp.259-275,2003.
    [93]X. Wang, P. Liu, M. Yang, and Y. Jiang, "Resolving deadlocks for pipelined stream applications on network-on-chips," in Proceedings of the IEEE International Conference on Computer Science and Information Technology, pp.93-97,2010.
    [94]A. H. Ghamarian, M. Geilen, S. Stuijk, T. Basten, A. Moonen, M. Bekooij, B. Theelen, and M. Mousavi, "Throughput analysis of synchronous data flow graphs," in Proceedings of Application of Concurrency to System Design, pp.25-34,2006.
    [95]M. R. Garey and D. S. Johnson, Computers and intractability:a guide to the theory of NP-completeness:WH Freeman and Company,1979.
    [96]M. Skutella, "Approximating the single source unsplittable min-cost flow problem," Mathematical Programming, vol.91, no.3, pp.493-514,2002.
    [97]lp solve 5.5. [Online]. Available:www.lpsolve.sourceforge.net/5.5/.
    [98]J. Delorme and D. Houzet, "An automatic design flow for mapping application onto a 2D mesh NoC architecture," Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, vol.4644, pp.31-42,2007.
    [99]D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, "NoC synthesis flow for customized domain specific multiprocessor systems-on-chip," IEEE Transactions on Parallel and Distributed Systems, vol.16, no.2, pp.113-129,2005.
    [100]M. Palesi, R. Holsmark, S. Kumar, and V. Catania, "Application specific routing algorithms for networks on chip," IEEE Transactions on Parallel and Distributed Systems, vol.20, no.3, pp.316-330,2008.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700