并行计算平台上的数据索引技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

并行计算平台上的数据索引技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Data Index Technology Research Based on Parallel Computing Platform
作者：龙柏
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：并行计算平台 ; 数据索引 ; 一体化研究方法 ; 高效能 ; 功耗
英文关键词：Parallel computing platform ; data index ; integration of research methods ; inverted index ; high efficient ; low power
学位年度：2011
导师：陈国良
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2011-04-01

摘要

随着计算机的普及和网络规模的不断扩大,数字化信息爆炸式的增长。信息的产生、传播、搜集与查询成为当今世界人类最基本生活需求。如何在浩瀚的数据信息中,为用户更快、更准确的索引信息成为亟待解决的问题。与此同时,人类社会对高性能计算的应用需求也在不断推动并行计算平台的普及和提高。数据检索最主要的方法就是构建各种适合的索引,并通过对索引高效的检索数据信息。随着检索数据和用户查询的规模增大,信息检索系统要提升处理能力和处理规模,需要通过对于并行计算平台的良好应用来解决。那么如何充分的利用并行计算平台来提高信息检索系统的性能,提高系统处理数据能力,成为该领域研究的重要问题。
     随着并行计算平台的发展,将索引应用于并行计算平台是一个必然的趋势。本文采用并行计算“结构—算法—编程—应用”一体化的研究方法,围绕如何解决并行计算平台上的数据索引系统的运行效率、如何完成数据索引系统从串行到并行的良好过渡、如何实现数据索引系统的高效能运作等问题展开深入分析研究；有效的解决了此类问题从串行计算时代到并行计算时代过渡中出现的障碍,有助于建立并行计算的科学研究体系,增强并行计算平台的实际应用能力。本文针对目前互联网上应用最广泛,最普遍的高维数据、文本数据和时间序列数据,分别提出了基于并行计算平台的HKD-tree混合索引结构、并行计算平台上的可实时更新索引结构和基于并行计算平台的时间序列索引结构。通过将KD-tree和LSH的有效结合提出一个有效的混合索引结构HKD-tree,并且适时予以并行化使其与SMP机群系统结构相匹配,从而提出了并行计算平台上的高维数据索引问题的一个有效解决方案。通过改进传统倒排索引结构的单一模式,利用由主、辅倒排索引和内容过滤索引构,满足了索引的实时性要求,并在一定程度上实现了索引过程的高效能。通过对时间序列数据的分析,提出一种可应用于并行计算平台的时间序列索引结构并进行相应的功耗分析,打破了传统索引方式只注重索引效率而忽视索引效能的单一思路,实现了系统索引过程的高效能低功耗。
     综上所述,本文针对并行计算平台上的数据索引技术的研究,可以有效地提高数据索引的运行效率和并行性能,充分发挥并行计算平台的计算能力,具有一定的理论意义和广泛的应用前景。具体而言,本文的主要研究成果、贡献和创新点可概括为以下几点：
     1.基于并行计算平台提出HKD-tree混合索引结构。该结构将KD-tree和LSH两种索引结构进行组合,利用KD-tree作为上层结构的主干而LSH充当叶子节点,从而可以利用多核机群系统的层次并行结构特性,与传统的索引结构相比,该混合索引结构具有高效并行处理、可扩展性好等特点,适于SMP机群系统平台上的高维数据索引。
     2.基于国产并行计算平台KD60提出一种可实时更新的倒排索引结构。该方案打破了传统倒排索引结构的单一模式,由主、辅倒排索引和内容过滤索引构,满足了索引的实时性要求。该方案成功应用于国产万亿次高性能绿色计算平台KD60,实现了索引过程的低功耗,同时解决了搜索引擎的功耗问题。实验证明,基于KD60平台的倒排索引结构很好的解决了索引的实时更新问题,并在绿色计算的实际应用中具有良好的高效能表现。
     3.基于并行计算平台的时间序列索引。提出一种基于并行计算平台的时间序列索引,并针对该索引结构提出相应的并行功耗分析模型。该方案在解决时间序列索引问题的同时,打破了传统索引方式只注重索引效率而忽视索引效能的单一思路,实现了系统索引过程的高效能低功耗。实验证明,基于并行计算平台的时间序列索引很好的解决了时间序列索引的性能问题,在索引的功耗方面也有良好表现。
With the popularity of personal computers and the expansion of computer networks, digital information grows explosively. The generation, dissemination, collection and retrieval of information are now the most basic need of the human life. As is known, data index is the critical component of information retrieval system, which is used to manage and query data effectively and efficiently. However, with the requirement of energy-efficient computing as well as the growth of data and user queries, conventional index technique which improves the processing capacity by means of increasing the number of ordinary computers encounters an unavoidable bottleneck. Meanwhile, the high productivity parallel computing systems are greatly pushed forward due to the development of high performance computing applications related with people's daily lives, such as web search engines and social networks. Hence, how to make use of low-power parallel computing platforms to improve the performance and capacity of information retrieval systems attracts the growing attention of researchers in this field.
     With the development of parallel computing platforms, there is an inevitable trend that building index application on the parallel computing platforms. In this paper, we use the integration research methods of parallel "Architecture - Algorithms Programming - Application", and deeply study how to promote the performance of data intensive index, how to parallelize the data intensive index system form the serial version, and how to achieve the good system cost-efficiency. We overcome the research barriers that appeared in the transition from the serial era to the parallel era, and our methods are helpful in refining the parallel computing research system and enhance the application power of parallel computing platforms. Meanwhile, using high-dimensional data indexing as a typical application in parallel computing platform, we propose the HKD-tree hybrid index structure and the time series index, which solve data index problems on parallel computing platform. On the other hand, focusing the hotspot in data retrieval system, we deeply research the real-time timeliness, the performance and power consumption problem. By improving the traditional inverted index and solve the real-time problem in data-intensive system on parallel computing, the high effectiveness of index process is achieved. In summary, this paper study data index on parallel computing platforms, which can enhance the index efficiency and parallel performance, so that the parallel computing power is fully utilized. Our methods have some significance and wide application prospects. The main research contributions and innovations can be summarized as follows:
     1. HKD-tree hybrid index structure based on parallel computing platform. This structure combines KD-tree and Local Sensitive Hash (LSH), uses KD-tree structure as the upper trunk and LSH structures as leaf, which can take advantage of the hierarchical feature of parallel structure of multi-core cluster system. Compared with the traditional index structure, this hybrid index structure has good parallel efficiency and scalability, and is suitable for multi-core cluster system and high-dimensional data indexing problem.
     2. A real-time updating inverted index based on KD-60 domestic parallel platform. This scheme uses a main and an auxiliary inverted index together with the content filtering index, and realizes the real-time query and the index of the combination of content filtering, enabling a search process in real time. At the same time, we applied this index on high-performance green computing platform KD-60, so that the high cost-efficiency is achieved in a certain degree.
     3. A time series index structure based on parallel computing platform. We proposed a time series index on parallel computing platform and the power analysis model of it. Besides solve the time series index problem, this scheme realizes high efficiency and low power consume, instead of consider performance only. Experiments show that the time series index on parallel computing platform gives a good solution to the performance problem of time series indexing, as well as low power consumption.

引文

[1]陈国良,“并行计算——结构·算法·编程,”[M].北京：高等教育出版社,1999.
    [2]B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," Dependable and Secure Computing, IEEE Transactions on, vol.7, pp.337-351,2010.
    [3]A. B. M. Kristensen, "Parallel Computing Platform," 2011.
    [4]A. Grama and S. T. B.O.. Introduction to parallel computing vol.752: Addison-Wesley,2003.
    [5]P. S. Pacheco, Parallel programming with MPI:Morgan Kaufmann,1997.
    [6]G. Golub and J. M. Ortega, Scientific computing:an introduction with parallel computing:Academic Press Professional, Inc. San Diego, CA, USA, 1993.
    [7]J. J. Dongarra, et al., Sourcebook of parallel computing:Morgan Kaufmann Publishers,2003.
    [8]陈国良,并行计算机体系结构：高等教育出版社,2002.
    [9]A. K. Jain, et al., "Data clustering:a review," ACM Computing Surveys (CSUR), vol.31, pp.264-323,1999.
    [10]R. F. Rosin, "Von Neumann machine," 2003, p.1842.
    [11]P. A. Wilkinson, et al., "Floating point for simid array machine," ed:Google Patents,1998.
    [12]夏培肃and唐志敏,“高度并行计算机体系结构,”计算机世界,vol.440,1993.
    [13]Z. Zhang, et al., "PVP protective mechanism of ultrafine silver powder synthesized by chemical reduction processes," Journal of Solid State Chemistry, vol.121, pp.105-110,1996.
    [14]李国杰and陈鸿安,“曙光一号并行计算机,”计算机学报,vol.17,pp.882-889,1994.
    [15]孙凝晖and刘宏,“曙光1000大规模并行计算机系统软件的设计,”计算机学报,vol.20,pp.259-268,1997.
    [16]孙凝晖and徐志伟,“曙光2000超级计算机系统软件的设计,”计算机学报,vol.23,pp.9-20,2000.
    [17]D. E. Culler, et al., "Parallel computing on the Berkeley NOW," 1997.
    [18]W. M. Cardoza, et al., "Design of the TruCluster multicomputer system for the Digital UNIX environment," Digital Technical Journal, vol.8, pp.5-17,1996.
    [19]陈国良,并行算法的设计与分析.高等教育出版社,2009.
    [20]J. P. Nolan, "Numerical calculation of stable densities and distribution functions," Stochastic Models, vol.13, pp.759-774,1997.
    [21]X. Wang, et al., "Flexible cycle synchronized algorithm in parallel and distributed simulation," Distributed Computing-IWDC 2004, pp.40-45,2005.
    [22]P. E. Heegaard, et al., "Distributed asynchronous algorithm for cross-entropy-based combinatorial optimization," Rare Event Simulation and Combinatorial Optimization (RESIM/COP 2004), p.7¨C8.
    [23]陈国良,并行算法实践：高等教育出版社,2004.
    [24]W. Stallings, Computer organization and architecture:desianina for performance:Prentice hall,2009.
    [25]S. Ghemawat, et al., "The Google file system," 2003, pp.29-43.
    [26]J. Dean and S. Ghemawat, "MapReduce:Simplified data processing on large clusters," Communications of the ACM, vol.51, pp.107-113,2008.
    [27]F. Chang, et al., "Bigtable:A distributed storage system for structured data," ACM Transactions on Computer Systems (TOCS), vol.26, pp.1-26, 2008.
    [28]M. Isard, et al., "Dryad:distributed data-parallel programs from sequential building blocks," 2007, pp.59-72.
    [29]陈国良,et al.,“并行算法研究方法学,”计算机学报，vol.31,pp.1493-1502,2008.
    [30]G. Chen, "A partitioning selection algorithm on multiprocessors," Journal of Computer Science and Technology, vol.3, pp.241-250,1988.
    [31]陈国良and梁维发，“并行图论算法研究进展,”计算机研究与发展，vol.32,pp.1-16,1995.
    [32]安虹and陈国良,“并行程序设计模型和语言,”软件学报，vol.13,pp.118-124,2002.
    [33]毛睿and黄刘生,“淮河中上游群库联合优化调度算法及并行实现,”小型微型计算机系统，vol.21,pp.603-607,2000.
    [34]G. Chen, et al., Proceedings of the Inaugural Symposium on Parallel Algorithms, Architectures and Programming:Sept.16-Sept.18,2008, Hefei, China:University of Science and Technology of China Press,2008.
    [35]R. Datta, et al., "Image retrieval:Ideas, influences, and trends of the new age," ACM Computing Surveys (CSUR), vol.40, p.5,2008.
    [36]薛向阳and罗航哉,“LIFT：一种用于高维数据的索引结构,”电子学报，vol.29,pp.192-195,2001.
    [37]M. Gschwind, et al., "Synergistic processing in cell's multicore architecture," Micro, IEEE, vol.26, pp.10-24,2006.
    [38]A. Hung, et al., "Symmetric multiprocessing on programmable chips made easy," 2005.
    [39]J. L. Bentley, "Multidimensional binary search trees in database applications," Software Engineering, IEEE Transactions on, pp.333-340, 1979.
    [40]A. Andoni and P. Indyk, "Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions," 2006.
    [41]M. Datar, et al., "Locality-sensitive hashing scheme based on p-stable distributions," 2004, pp.253-262.
    [42]P. Indyk, "Stable distributions, pseudorandom generators, embeddings, and data stream computation," Journal of the ACM (JACM), vol.53, pp. 307-323,2006.
    [43]J. T. Robinson, "The KDB-tree:a search structure for large multidimensional dynamic indexes," 1981, pp.10-18.
    [44]B. Yu, et al., "Kdb kd-tree:A compact kdb-tree structure for indexing multidimensional data," 2003.
    [45]T. Sellis, et al., "The R-tree:A dynamic index for multi-dimensional objects," The VLDB Journal, p.507¨C518,1987.
    [46]Y. Theodoridis and T. Sellis, "A model for the prediction of R-tree performance," 1996, pp.161-171.
    [47]U. Deppisch, "S-tree:a dynamic balanced signature index for office retrieval," 1986, pp.77-87.
    [48]M. Hrkic and J. Lillis, "S-Tree:a technique for buffered routing tree synthesis," 2002.
    [49]T. K. Dang, et al., "The sh-tree:A super hybrid index structure for multidimensional data," 2001, p.340.
    [50]D. T. Khanh, "The sh-tree:A novel and flexible super hybrid index structure for similarity search on multidimensional data," International Journal of Computer Science & Applications, vol.3,2006.
    [51]王韬and李晓明,"SMPCluster:如何开发两级并行,”计算机工程与科学，vol.24,pp.78-80,2002.
    [52]孙琦,et al.,"特征向量自适应估计算法及在谱估计中的应用,”吉林大学学报：工学版，vol.36,pp.766-771,2006.
    [53]L. Dagum and R. Menon, "OpenMP:an industry standard API for shared-memory programming," Computational Science & Engineering, IEEE, vol.5, pp.46-55,1998.
    [54]朱霞,“线程级并行的硬件技术研究,”西北工业大学,2003.
    [55]P. Indyk and R. Motwani, "Approximate nearest neighbors:towards removing the curse of dimensionality," 1998, pp.604-613.
    [56]R. Chandra, Parallel programming in OpenMP:Morgan Kaufmann,2001.
    [57]L. Smith and M. Bull, "Development of mixed mode MPI/OpenMP applications," Scientific Programming, vol.9, pp.83-98,2001.
    [58]F. Cappello and D. Etiemble, "MPI versus MPI+ OpenMP on the IBM SP for the NAS Benchmarks," 2000.
    [59]刘忠伟and章毓晋,“利用局部累加直方图进行彩色图象检索,”中国国家图形学报：A 辑，vol.3,pp.533-537,1998.
    [60]段瑞玲,et al.,“图像边缘检测方法研究综述,”光学技术，vol.31,pp.415-419,2005.
    [61]刘明霞,et al.,“新的纹理图像特征提取方法,”计算机应用，pp.3434-3436,2009.
    [62]S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine* 1," Computer networks and ISDN systems, vol.30, pp. 107-117,1998.
    [63]S. Chakrabarti, et al., "Focused crawling:a new approach to topic-specific Web resource discovery," Computer Networks, vol.31, pp.1623-1640,1999.
    [64]Y. K. Lee, et al., "Index structures for structured documents," 1996, pp. 91-99.
    [65]C. Consel and O. Danvy, "For a better support of static data flow," 1991, pp. 496-519.
    [66]A. Rogers, et al., "Supporting dynamic data structures on distributed-memory machines," ACM Transactions on Program Languages and Systems (TOPLAS), vol.17, pp.233-263,1995.
    [67]E. Santos Jr, et al., "I-FGM as a real time information retrieval tool for E-governance,"2010.
    [68]J. S. Culpepper and A. Moffat, "Efficient set intersection for inverted indexing," ACM Transactions on Information Systems (TOIS), vol.29, p.1, 2010.
    [69]D. Zhang, et al., "Semantic image retrieval using region based inverted file," 2009 Digital Image Computing:Techniques and Applications, pp.242-249, 2009.
    [70]J. W. Barrus, et al., "Document collection manipulation," ed:Google Patents,2010.
    [71]S. Campinas, et al., "SkipBlock:Self-indexing for Block-Based Inverted List," Advances in Information Retrieval, pp.555-561,2011.
    [72]J. Zobel and A. Moffat, "Inverted files for text search engines," ACM Computing Surveys (CSUR), vol.38, pp.6-es,2006.
    [73]E. W. Brown, et al., "Fast incremental indexing for full-text information retrieval," 1994, pp.192-192.
    [74]A. Tomasic, et al., "Incremental updates of inverted lists for text document retrieval," 1994, pp.289-300.
    [75]R. Lempel, et al., "Just in time indexing for up to the second search," 2007, pp.97-106.
    [76]T. Chiueh and L. Huang, "Efficient real-time index updates in text retrieval systems," 1999.
    [77]J. Battelle, "The search:How Google and its rivals rewrote the rules of business and transformed our culture," 2005.
    [78]X. Meng, "OVERVIEW OF GOOGLE SEARCH ENGINE."
    [79]P. Kurp, "Green computing," Communications of the ACM, vol.51, pp. 11-13,2008.
    [80]D. Brooks, et al., "Wattch:a framework for architectural-level power analysis and optimizations," 2000, pp.83-94.
    [81]D. H. Nowlin Jr, "CPU power management in non-APM systems," ed: Google Patents,1998.
    [82]J. H. Ewertz, "Method of dynamically changing the lowest sleeping state in ACPI," ed:Google Patents,2002.
    [83]赵荣彩and G. R. Gao,"编译指导的多线程低功耗技术研究,”计算机研究与发展,vol.39,pp.1572-1579,2002.
    [84]R. R. Schaller, "Moore's law:past, present and future," Spectrum, IEEE, vol.34, pp.52-59,1997.
    [85]王握文and陈明,”“天河一号”超级计算机系统研制,”国防科技pp.1-4,2009.
    [86]张俊霞,et al.,”基于龙芯2F的国产万亿次高性能计算机KD-50-I的研制,”中国科学技术大学学报，vol.38,pp.105-108,2008.
    [87]李毅,et al.,“多核龙芯3A上二级BLAS库的优化,”计算机系统应用，pp.163-167,2011.
    [88]秦玲,et al.,“浮动车交通信息处理与应用系统核心功能及实现,”公路交通科技，vol.7,p.44,2006.
    [89]谭衢霖and邵芸,“雷达遥感图像分类新技术发展研究,”国土资源遥感，pp.1-7,2001.
    [90]T. Austin, et al., "SimpleScalar:An infrastructure for computer system modeling," Computer, vol.35, pp.59-67,2002.
    [91]M. Nourani and J. Chin, "Power-time tradeoff in test scheduling for SoCs," 2003.
    [92]S. Mohagheghi, et al., "Optimal neuro-fuzzy external controller for a STATCOM in the 12-bus benchmark power system," Power Delivery, IEEE Transactions on, vol.22, pp.2548-2558,2007.
    [93]R. E. Bellman, Dynamic programming:Courier Dover Publications,2003.
    [94]I. Oseledets and E. Tyrtyshnikov, "Breaking the curse of dimensionality, or how to use SVD in many dimensions," SIAM J. Sci. Comput, vol.31, pp. 3744-3759,2009.
    [95]M. Chen, et al., "Efficient gamma index calculation using fast Euclidean distance transform," Physics in medicine and biology, vol.54, p.2037,2009.
    [96]X. L. Dong, et al., "Research on shape-based time series similarity measure," 2009, pp.1253-1258.
    [97]D. Ramanan and S. Baker, "Local distance functions:A taxonomy, new algorithms, and an evaluation," IEEE Transactions on Pattern Analysis and Machine Intelligence,2010.
    [98]D. Sart, et al., "Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs," 2010, pp.1001-1006.
    [99]M. Wollmer, et al., "A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams," Neurocomputing, vol.73, pp.366-380,2009.
    [100]E. Keogh and C. A. Ratanamahatana, "Exact indexing of dynamic time warping," Knowledge and Information Systems, vol.7, pp.358-386,2005.
    [101]D. Papadias, et al., Topological relations in the world of minimum bounding rectangles:a study with R-trees vol.24:ACM,1995.
    [102]N. Beckmann, et al., The R*-tree:an efficient and robust access method for points and rectangles vol.19:ACM,1990.
    [103]R. Buyya, "High Performance Cluster Computing:Architectures and Systems, Volume 1," Prentice Hall PTR, vol.82, pp.327-350,1999.
    [104]M. Sato, et al., "Design of OpenMP compiler for an SMP cluster," 1999, pp.32-39.
    [105]D. Bovet, et al., Understanding the Linux kernel:O'Reilly & Associates, Inc.,2002.
    [106]B. Hill, "Linux Operating System Software Debugging Techniques with Xilinx Embedded Development Platforms," 2009.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700