面向移动搜索的WAP页面消重技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向移动搜索的WAP页面消重技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Mobile Search Oriented WAP Duplicate Data Detection
作者：蔡敏
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：网页消重 ; 特征提取 ; WAP页面 ; SimHash ; VIPS
英文关键词：Duplicate Detection ; Feature Extraction ; WAP page ; SimHash ; VIPS
学位年度：2010
导师：于剑
学科代码：081201
学位授予单位：北京交通大学
论文提交日期：2010-06-01
答辩委员会主席：王志海

摘要

随着科学技术和网络通讯的发展,特别是Internet的应用和普及,电子信息资源呈现出爆炸式增长趋势,数据量已从GB级到TB级,再到PB级。海量数据在给人们获取信息带来便捷的同时,存在着大量重复的现象。由于信息来源的多样性,以及对不同用户群体的针对性,使得相同信息可能以多种形态在不同页面出现,导致互联网上存在大量重复信息。这些重复网页的存在,严重影响了用户上网的体验,增加了互联网的成本,所以网页消重成为一个亟待解决的问题。
     目前的消重技术主要集中于针对以PC为终端访问的Web页面,对于面向手机等移动终端的WAP (Wireless Application Protocol,无线应用协议)页面的消重技术鲜有涉及。然而,近年来随着移动互联网的迅速发展,手机等移动设备迅速普及,手机WAP页面海量增长,对WAP页面进行消重变得尤为迫切。
     本文针对WAP页面的特点以及WAP页面消重的特定需求,提出了面向不同类别的WAP网页的特征提取方法,然后,将其与SimHash算法结合,从而得到面向WAP页面的消重方法,并且将其应用到真实数据中。
     本文的主要贡献如下：
     1.提出了一种面向WAP页面的特征提取方法,包含两个步骤：一是对WAP页面进行特征提取,二是针对不同类别的WAP页面利用基于视觉的网页结构分割算法(VIPS)识别的行信息对特征进行过滤。该方法既能反映不同类别页面重复方式的差异,又能充分考虑语义信息,提取的特征粒度大小适中,计算复杂度低且具有代表性；
     2.提出了一个面向WAP页面的消重方法,该方法集成了前面提出的特征提取方法和网页相似度计算的SimHash算法。于此同时,本文设计了消重效果的评价准则,从而指导页面间相似度的阈值设定,有效去除WAP中重复页面；
     3.本文将提出的WAP页面消重方法应用到真实的数据集上,在数据集上取得了优异的性能,同时也验证了本文提出的特征提取算法及整个消重方法的有效性。
With the rapid development of information technology and the rapid growth of World Wide Web, electronic information resources have shown explosive growth trend and data size has increased from GB level to TB level and even to PB level. Due to the diversity of resources, the information sharing similar content may be exhibited in all kinds of pages in terms of various forms, result in a large redundancy of information. The existence of duplicate information seriously affects the experience of users and also increases the cost of internet. While existing methods of duplicate data detection are mostly proposed for web pages on personal computers, the methods to deal with duplicated WAP pages on mobile phones are largely under-explored.
     This paper proposes a WAP oriented duplicate data detection method, including a feature extraction method and a WAP oriented SimHash algorithm. The duplicate data detection method was validated on a real world data set and excellent performance was observed and analyzed. The work and contributions are the following:
     1. This paper proposes a category wise feature extraction method, which includes two parts:a feature extraction step which draws features according to the category of WAP pages; a feature filtering step which select features based on Visual Based Page Segment Algorithm (VIPS) algorithm. The method not only takes into account the different definition of duplication in different categories of WAP pages, but also considers the semantics information of WAP pages. The obtained features have an appropriate size, a low computational complexity as well as a good representation.
     2. This paper presents a WAP oriented duplicate data detection method. The method combines the category wise feature extraction method with the SimHash algorithm. We propose a measure to evaluate the performance of the detection method. The measure benefits for the setting of the threshold of similarity of WAP pages.
     3. The duplicate data detection method was validated in a real world data, and excellent performance was observed in our experiments. This demonstrates the effectiveness of the WAP oriented duplicate data detection method.

引文

[1]McKnight J, Asaro T, Babineau B. Digital archiving:End-User survey and market forecast 2006-2010.2006. Breiman L. Random Forests. Machine Learning.2001.45(1).5-32.
    [2]Broder A Z. On the resemblance and containment of documents. In:Carpentieri B, De Santis A, Vaccaro U, Storer JA, eds. Proc. of the Compression and Complexity of SEQUENCES 1997. Washington:IEEE Computer Society Press.1997.21-29.
    [3]Broder A Z, Glassman S C, Manasse M S. Syntactic Clustering of the Web. Proceedings of the 6th International Web Conference.1997.29(8).1157-1166.
    [4]Rabin M O. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, Report TR-15-81.1981.
    [5]Chowdhury A, Frieder O. Collection Statistics for Fast Duplicate Document Detection. ACM Trans. on Information System.2002.20(2).171-191.
    [6]Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Trans. on Storage.2006.2(4).424-448.
    [7]Rivest R. The MD5 message-digest algorithm.1992. http://www.python.org/doc/current/lib/ module-md5.html.
    [8]U.S. National Institute of Standards and Technology (NIST). Federal Information Processing Standards (FIPS) Publication 180-1:Secure Hash Standard.1995.
    [9]U.S. National Institute of Standards and Technology (NIST). Federal Information Processing Standards (FIPS) Publication 180-2:Secure HashStandard.2002.
    [10]Muthitacharoen A, Chen B, and Mazieres D. A low-bandwidth network file system. Proceedings of the eighteenth ACM symposium on Operating systems principles.2001. 174-187.
    [11]Kulkarni P, Douglis F, LaVoie J D, and Tracey J M. Redundancy elimination within large collections of files. In USENIX Annual Technical Conference, General Track.2004.59-72.
    [12]Chowdhury A, Frieder O. Collection Statistics for Fast Duplicate Document Detection. ACM Trans. on Information System.2002.20(2).171-191.
    [13]Broder A Z, Charikar M, and Frieze A M. Min-Wise Independent Permutations. Journal of Computer and System Sciences 60.2000.630-659.
    [14]Jain N, Dahlin M, Tewari R. Taper:Tiered approach for eliminating redundancy in replica synchronization. In:Proc. of the 4th Usenix Conf. on File and Storage Technologies (FAST 2005). Berkeley:USENIX Association.2005.281-294.
    [15]Han B, Keleher P. Implementation and performance evaluation of fuzzy file block matching. In: Proc. of the 2007 USENIX Annual Technical Conf. (USENIX 2007). Berkeley:USENIX Association.2007.199-204.
    [16]Broder A Z. Identifying and filtering near-duplicate documents. In:Giancarlo R, Sankoff D, eds. Proc. of the 11th Annual Symp. on Combinatorial Pattern Matching. London:Springer-Verlag. 2000.1-10.
    [17]Ouyang Z, Memon N, Suel T, Trendafilov D. Cluster-Based delta compression of a collection of
    files. In:Proc. of the 3rd Int'l Conf. on Web Information Systems Engineering. Washington: IEEE Computer Society Press.2006.257-266.
    [18]Kulkarni P, Douglis F, Lavoie JD, Tracey JM. Redundancy elimination within large collections of files. In:Proc. of the 2004 Usenix Annual Technical Conf. (USENIX 2004). Berkeley: USENIX Association.2004.59-72.
    [19]Bloom B H. Space/time trade-offs in Hash coding with allowable errors. In Communications of the ACM.1970.422-426.
    [20]Broder A. Z., Mitzenmacher M. Network applications of bloom filters:A survey. In Allerton'02.2002.
    [21]Jain N, Dahlin M, Tewari R. Using Bloom Filters to Refine Web Search Results. Eighth International Workshop on the Web and Databases (WebDB 2005).2005. Baltimore, Maryland.
    [22]Charikar M. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual Symposium on Theory of Computing(STOC).2002.
    [23]Manku G S, Jain A, Sarma A D. Detecting Near-Duplicates for Web Crawling. In Proceedings of 16th International World Wide Web Conference (WWW).Canada.2007.
    [24]Cai D, Yu S P, Wen J R, and Ma W Y. a Vision-based Page Segmentation Algorithm. Technical Report. MSR-TR-2003-79.2003.
    [25]Henzinger M R. Finding near-duplicate web pages:a large-scale evaluation of algorithms. In SIGIR 2006.2006.284-291.
    [26]张刚,刘挺,郊实福等.大规模网页快速去重算法.中国中文信息学学会二十周年学术.会论文集(续集).2001.18-25.
    [27]陈锦言,孙济洲,张亚平.基于傅立叶变换的网页去重算法.计算机应用.2008.28(4).948-949.
    [28]中国互联网络发展状况统计报告.中国互联网络中心(CNNIC).2010.
    [29]Manber U. Finding similar files in a large file system. Presented at Usenix Winter 1994 Technical Conference, San Francisco.1994.
    [30]Breiman L. Random Forests. Machine Learning.2001.45(1).5-32.
    [31]Brin S, Davis J, Garcia-Molina H. Copy Detection Mechanisms for Digital Documents. Proceedings of the ACM SIGMOD Annual Conference.1995.
    [32]G. Salton, A. Wong, C. Yang. A vector space model for automatic indexing. Communications of the ACM.1975.18 (11):613-620.
    [33]Shivakumar N, Garcia-Molina H. Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of the 1st ACM Conference on Digital Libraries.1996.
    [34]Shivakumar N, Garcia-Molina H. Finding near-replicas of documents on the web. International Workshop on the World Wide Web and Databases, WebDB.1998.204-212
    [35]Garcia-Molina H, Gravano L, Shivakumar N. dSCAM:Finding Document Copies Across Multople Databases. Proceeding of the 4th International Conference on Parallel and Distributed Sysems.1996.
    [36]Heintze N. Scalable Document Fingerprinting. Proceedings of the 2nd USENIX Workshop on Electronic Commerce.1996.
    [37]Antonic S, Leong H V, Rynson W H L. CHECK:A document plagiarism detection system. Proceedings of the ACM Symposium for Applied Computing.1997.70-77.
    [38]Monostori K, Zaslavsky A, Schmidt H. Match Detect Reveal:Finding Overlapping and Similar Digital Documents. Proceedings of the Information Resource Management Association International Conference.2000.
    [39]Monostori K, Zaslavsky A, Vajk I. Suffix vector:A space-efficient representation of a suffix tree. Technical Report.2001.
    [40]Yang H, Callan J. Near-Duplicate Detection for eRulemaking. Language Technology Institute Carnegie Mellon University.2005.
    [41]Minsky M, Papert S. Perceptrons. MIT Press,1969.19882(20).
    [42]连浩,刘悦,许洪波,程学旗.改进的基于布尔模型的网页查重算法.计算机应用研究.2007.36-39.
    [43]Shivakumar N, Garcia-Molina H. Finding near-replicas of documents on the web. Proceeding of The World Wide Web and Databases.1998.204-212.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700