用户名: 密码: 验证码:
基于MapReduce的信息检索相关算法并行化研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet的日益普及与迅速发展,互联网上的信息量呈几何级数增长,信息爆炸已成为当今网络时代的特征之一。作为访问互联网的重要入口,搜索引擎在帮助用户从浩如烟海的Internet中快速准确地获得所需信息方面起到了日益重要的作用,人们的生产生活已经越来越依赖搜索引擎。搜索引擎检索的对象是整个互联网上的全部数据,包括网页、图片、音乐、视频、FTP资源等。这些海量的数据对信息检索系统的高效运行提出了新的挑战:一方面,单台计算机的处理能力受到CPU时钟频率、内存容量、磁盘读写速度和网络带宽等因素的制约,无法在理想的时间内独自处理全部的数据;另一方面,这些海量数据并非存储在单台计算机上或者单个数据库中,而是分布在整个Internet上,这就需要成千上万台计算机以“相互合作”的方式对这些海量数据进行处理。因此,为搜索引擎设计能够高效地处理海量Internet数据的并行算法成为了学术界和工业界共同的研究方向与追求目标。在过去的数十年中,并行计算领域的研究取得了长足的进步,一些经典的并行计算平台相继出现,如MPI、OpenMP、OpenCL、CUD A等,特别是Google于2004年提出的MapReduce并行计算模型,以其良好的可扩展性、可靠性和易用性,为并行计算提供了简单、高效的计算模型和运行环境,降低了并行计算从理论向应用转化的难度,为并行计算的实际应用提供了一个简单易用的平台。
     信息检索领域的传统算法发展至今已日趋成熟,然而,有些算法并非是专为并行环境设计的,面临着无法直接处理大规模的海量数据或者无法在有效的时间内完成对海量数据的计算的窘境。因此,如果能够将这些算法加以改造,使其能够分布在多台计算机上并行地运行,则可以大大提高对海量数据的处理效率,更加快速地响应人们的搜索需求,改善用户的搜索体验。在信息检索领域中,查询推荐(Query Suggestion)与网页排序(Page Rank)是两项重要的研究内容:查询推荐可以帮助用户更加精确有效地查询并节省搜索时间,而网页排序则可以改善搜索质量、帮助用户更容易地找到所需的网页。如果能够对这两个领域中的一些串行算法进行并行化改造,使其能够并行地运行于计算机集群中,则能够有效提升搜索引擎对大规模数据的处理能力,加快搜索引擎在查询推荐和网页排序方面的更新速度,提高用户对检索的满意度。
     本文研究了查询推荐领域的QUBIC算法和基于频繁项集挖掘的网页排序算法,以对海量Internet数据的并行处理作为研究背景,基于MapReduce并行计算模型对QUBIC算法和基于频繁项集挖掘的网页排序算法进行了并行化改造,使得QUBIC算法和基于频繁项集挖掘的网页排序算法能够运行于MapReduce并行计算框架之中,并利用Hadoop并行计算软件框架实现了一个原型系统。具体而言,本文的主要研究工作包含以下方面:
     (1)对QUBIC算法进行基于MapReduce模型的并行化改造,提出了数据分布和并行计算的具体方法,包括:搜索引擎日志文件的分布存储,Query-URL二部图的构造,Jaccard相似系数的计算,QAG的生成,QAG中连通分量的计算以及对Query的排序。
     (2)对传统的SON频繁项集挖掘算法进行基于MapReduce模型的并行化改造,提出频繁项集并行挖掘的PSON算法,并将其应用于对频繁URL的挖掘。在计算出搜索引擎返回结果中关联性较大的一组URL后,按照其重要程度降序呈现给用户。
     本文在Hadoop并行计算平台上实现了本文对原算法进行并行化改造的思想,并进行了实验。实验表明,本文提出的对相关算法进行并行化改造的方法是行之有效的,并且具有良好的可扩展性能和加速比性能。最后,本文实现了一个原型系统,从整体上演示了QUBIC并行算法和频繁URL并行挖掘算法的运行效果,验证了这两类算法的正确性和有效性。
With the growing population and fast development of Internet, the scale of data on Internet is increasing exponentially and information explosion has been one of the epochal features. As an important access point of Internet, search engine is playing an more and more important role in helping users find information they need precisely from massive scale of Internet, and people depend increasingly on search engine in life and production. The searching object of search engine is the total Internet, including web pages, graphs, music, vedio, FTP resource and so on. The massive data on Internet has brought a big challenge for the efficient running of information retrieval system:on one side, limited by the CPU clock frequency, memory capacity, disk I/O rate, and network band width, etc.,a single computer can not finish processing all the data alone; on the other side, the massive data is distributed over the whole Internet instead of stored in a single computer or database, which means that thousands or even tens of thousands of computers need to process these data together in a coorperational way. As a result, proposing parallel algorithms for search engine that can process massive data on Internet efficiently has become a common target of academia and industry. The research in parallel computing domain has made a great progress in the past decades, and some classical parallel computing platforms have been proposed, including MPI, OpenMP, OpenCL and CUDA, etc. Among these platforms, the MapReduce parallel computing paradigm proposed by Google in2004provides parallel computing a simple and efficient computing paradigm and environment by its excellent scalability, reliability and ease of use. MapReduce makes it easier to transform parallel computation from theory to application.
     The traditional algorithms in information retrieval domain have become increasingly mature, however, some algorithms are not designed especially for parallel computing environment, which means they cannot process massive data directly or finish processing massive data in an acceptable time period. They will run on multiple computers in parallel when parallelized, which can enhance the efficiency of processing massive data, respond to the searching needs of users faster and improve uses'searching experience. Query suggestion and page rank are two important research issues in information retrieval domain:query suggestion can help users get search results more precisely in a shorter time period, while page rank can help improve searching experience and help users find their target pages easily. If we can parallelize some of these serial algorithms in information retrieval domain and make them run in a cluster in parallel, then the processing capacity of search engine for massive data can be enhanced effectively, the update cycle of query suggestion and page rank can be shorter, and users will be more satisfied with the searching process.
     In this paper, we studied the QUBIC algorithm in query suggestion domain and page rank algorithms based on frequent sets mining in the research background of parallel processing of massive Internet data, and parallelized QUBIC and frequent sets mining based page rank algorithm based upon MapReduce parallel computing paradigm, which makes QUBIC and page rank algorithms be able to run in MapReduce parallel computing framework. Besides we implemented a prototype system in Hadoop. Specifically, our work mainly consists of these parts below:
     (1) We parallelized QUBIC algorithm based on MapReduce parallel computing paradigm and proposed specific methods for data distribution and parallel computation, including the distributed storage of search engine log files, the construction of Query-URL bipartite, the computation of Jaccard similarity coefficient, the generation of QAG, the computation of connected components in QAG and the ranking of Query.
     (2) We parallelized traditional SON frequent sets mining algorithm based on MapReduce parallel computing
     paradigm and proposed PSON algorithm which is later applied to mine frequent URLs by computing a
     set of related URLs from searching results and displaying them according to their order of importance.
     We implemented our parallelization design of original algorithms in Hadoop parallel computing platform and conducted experiments. Experimental results proved that our parallelization of original algorithms is effective which can achieve good performance in scale-up and speed-up. Finally we implemented a prototype system where we demonstrated the running procedure of parallel QUBIC algorithm and frequent URL sets mining algorithm, which proved the correctness and efficiency of these two algorithms.
引文
[1]Gray J. What next?:A dozen information-technology research goals[J]. Journal of the ACM.2003,50(1):2-3.
    [2]http://www.cs.indiana.edu/-cgiannel/assoc_gen.html. IBM Quest Market-Basket Synthetic Data Generator.
    [3]Computing Parallel. http://en.wikipedia.org/wiki/Parallel_computing[M]. Wikimedia Foundation, Inc.,2012.
    [4]武华北.混合并行计算环境多级并行化编程模式的研究[D].博士,天津大学,2009.
    [5]陈国良.并行计算——结构·算法·编程[M].3.北京:高等教育出版社,2011.182.185.
    [6]张华杰.基于MPI的并行算法的研究[D].硕士,浙江师范大学,2010.
    [7]Dean J, Ghemawat S. MapReduce:Simplified Data Processing on Large Clusters[J]. Communications of the ACM-50th anniversary issue:1958-2008.2004,51(1):107-113.
    [8]薛振华,杨艳娟.分布式搜索引擎的研究[J].计算机信息与技术.2008(7).
    [9]Distributed search engine. http://en.wikipedia.org/wiki/Distributed_search_engine[M]. Wikimedia Foundation, Inc, 2012.
    [10]Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval[M]. New York:ACM Press,1999.
    [11]王树梅.信息检索相关技术研究[D].博士,南京理工大学,2007.
    [12]Cao H, Jiang D, Pei J, et al. Context-Aware Query Suggestion by Mining Click-Through and Session Data[J]. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008.
    [13]Baeza-Yates R, Hurtado C, Mendoza M. Query Recommendation Using Query Logs in Search Engines[C]. In: CURRENT TRENDS IN DATABASE TECHNOLOGY-EDBT 2004 WORKSHOPS.2004.395-397.
    [14]Wen J, Nie J, Zhang H. Clustering User Queries of a Search Engine[Z]. Hong Kong:2001162-168.
    [15]Beeferman D, Berge A. Agglomerative Clustering of A Search Engine Query Log[C]. In:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston, MA, USA:2000.
    [16]Fonseca B M, Golgher P B, de Moura E S, et al. Using Association Rules to Discover Search Engines Related Queries[Z].2003.
    [17]Zaiane O R, Strilets A. Finding Similar Queries to Satisfy Searches Based on Query Traces[C]. In:ADVANCES IN OBJECT-ORIENTED INFORMATION SYSTEMS.2002.349-359.
    [18]Li H, Wang Y, Zhang D, et al. Pfp:parallel fp-growth for query recommendation[J]. Proceedings of the 2008 ACM conference on Recommender systems.2008.
    [19]Brin S, Page L. The anatomy of alarge-scale hypertextual Web search engine[J]. Proceedings of the Seventh International World Wide Web Conference.1998,30(1-7):107-117.
    [20]Han J, Pei J, Yin Y. Mining Frequent Patterns without Candidate Generation[J]. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000:1-20.
    [21]张砚明.基于链接结构分析的Web页面排序算法[J].西安电子科技大学.2010:15.16.
    [22]Yu P S, Li X, Liu B. On the Temporal Dimension of Search[J]. Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters.2004.
    [23]王崝.基于时间链接分析的页面排序优化算法研究[D].硕士,江苏大学,2008.
    [24]Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM.1999,46(5).
    [25]Lin J, Dyer C. Data-Intensive Text Processing with MapReduce[M]. Maryland:University of Maryland, College Park,2010.
    [26]Kohlschiitter C, Chirita P, Nejdl W. Efficient Parallel Computation of PageRank[J].28th European Conference on IR Research.2006(3936):241-252.
    [27]Li L, Yang Z, Liu L, et al. Query-URL Bipartite Based Approach to Personalized Query Recommendation[J]. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence.2008.
    [28]Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to Algorithms[M].2 ed.55 Hayward Street, Cambridge, MA 02142-1493, USA:The MIT Press,2002.1083.
    [29]Zhang B, Li H, Liu Y, et al. Improving Web Search Results Using Affinity Graph[J]. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval.2005.
    [30]Rajaraman A, Ullman J D. Mining of Massive Datasets[M]. West Nyack, NY:Cambridge University Press,2011. 75-76.
    [31]Baase S, Van Gelder A. Computer Algorithms:Introduction to Design and Analysis[M].3 ed. Beijing:PEARSON EDUCATION NORTH AISA LIMITED and HIGHER EDUCATION PRESS,2001.
    [32]Ackermann W. Zum Hilbertschen Aufbau der reellen Zahlen[J]. Mathematische Annalen.1928(99):118-133.
    [33]Hirschberg D S. Parallel Algorithms for the Transitive Closure and the Connected Component Problems[J]. STOC '76 Proceedings of the Eighth Annual ACM Symposium on Theory of Computing.1976:55-57.
    [34]屈碗玲,耿素云,张立昂.离散数学[M].2.北京:清华大学出版社,2008.
    [35]B. J D, P. M. Connected Components in O(log(3/2)|V|) Parallel Time for the CREW PRAM[J]. Proceedings of the FOCS'91.1991:688-697.
    [36]Chong, W. K, Lam, et al. Connected Components in O(lognloglogn) Time on CREW PRAM[J]. Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms.1993:11-20.
    [37]许胤龙,万颖瑜,顾晓东,等.虫孔路由Mesh上的连通分量算法及其应用[J].软件学报.2001,12(2):233.240.
    [38]Kantardzic M. Data Mining:Concepts, Models, Methods, and Algorithms[M]. Wiley-IEEE Press,2002.360.
    [39]Goethals B. Efficient Frequent Pattern Mining[D]. PhD, Kennistechnologie, Informatica, Wiskunde, ICT,2002.
    [40]Goethals B, Zaki M J. Advances in frequent itemset mining implementations:report on FIMI'03[Z]. New York, USA:2004.
    [41]R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules, in Proceedings of 20th International Conference on Very Large Data Bases[Z]. Chile:1994.
    [42]Han J, Pei J, Yin Y. Mining Frequent Patterns without Candidate Generation[Z].2000.
    [43]Savasere A, Omiecinski E R, Navathe S B. An Efficient Algorithm for Mining Association Rules in Large Databases[Z]. Swizerland:1995.
    [44]Venner J. Pro Hadoop[M].1 ed. New York, USa:Springer-Verlag New York Inc.,2009.407.
    [45]http://www.almaden.ibm.com/cs/disciplines/iis/SassocSynData. Intelligent Information System.2012

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700