用户名: 密码: 验证码:
基于链接相似度的网页排序算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文主要讨论网页排序相关算法,重点讨论了链接分析技术。
     首先,介绍了网页排序的基本原理,对几种较为常用的网页排序技术进行了对比分析;着重剖析了两种典型的链接分析算法:PageRank和HITS,分析了它们各自的优劣。
     PageRank算法主要缺陷是将PageRank值在所有的出链接上进行平均分配,没有很好地考虑语义信息,很容易受到无关链接的影响,产生主题漂移。本文设计了一个简单的计算模型改进PageRank算法,该计算模型在PageRank算法平均分配的基础之上,考虑了链接相似度信息,并利用朴素贝叶斯模型对链接相似度信息进行评估。由于考虑了出链接与目标网页相似度信息,使得那些没有价值的页面(广告页面)被分得较少的PageRank值,提升了真正有价值的页面所分得的PageRank值。
     最后,本文应用上述模型实现了一个模拟的搜索引擎。该模拟系统包含了搜索引擎的几乎全部功能,并在互联网真实环境下请一些用户进行实际测试,对上述算法进行验证。小范围用户测试结果表明:融入了链接相似度信息之后,提升了搜索结果的用户满意度。
This paper focuses on the relevant page sorting algorithms. We discuss the link analysis technique with emphasis.
     First of all, we have introduced the basic principle of page sorting algorithms. We have carried on the contrastive analysis to several kinds of more commonly used page sorting technologies. We have analyzed two kinds of typical link analysis algorithm emphatically: PageRank and HITS, and have analyzed their respective advantages and disadvantages.
     A major flaw of the PageRank algorithm is that the algorithm distributes the PageRank value to all out-links equally. It does not consider the semantic information very well, so it will be influenced by the irrelevant links and bring the subject drifting.
     In this paper, we design a simple model to improve PageRank algorithm. We consider the similarity of links based on the original PageRank algorithm with average distribution and evaluate the link similarity with the naive Bayesian model. With consideration of the similarity between the link and the target page, we give less PageRank to those pages with less value (such as advertisement pages), and promote the PageRank of truly valuable pages.
     Finally, we construct a simulative search engine with the improved model above. The simulation system includes almost all of the features of a search engine. We invite some users to test the system in the real Internet environment for validation. The small-scale test results show that it enhances the customer satisfaction when we use the link similarity.
引文
[1] Katz L. A new status index derived from sociometric analysis [J]. Psychometricka, 1953,18: 39-43
    [2] Hubbell C H. an input-output approach to clique identification [J]. Sociometry, 1965,28:377-399
    [3] Garfield E. Citation analysis as atool in journal evaluation [J]. Science, 1972, 178:471-479
    [4] Pinski G, Narin F. citation influence for journal aggregates of journal aggregates of scientific publications: theory, with application to the literature of physics [J]. Inf Proc and Management, 1976, 12: 297-312
    [5] Geller N. On the citation influence methodology of Pinski and Narin [J]. Inf Proc and Managemenet, 1978,14: 93-95
    [6] Doreigan P. A measure of standing for citation networks within a wider environment [J]. Inf Proc and Management, 1994, 30: 21-31
    [7] Carriere J, Kazman R. WebQuery: searching and visualizing the Web through connectivity [OL].http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html, 1997.
    [8] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine [J]. Computer Networks and ISDN Systems, April, 1998,30(1-7): 107-117
    [9] Kleinberg J. Authoritative sources in a hyper linked environment [C]. In Proceedings of the 9th Annual ACM_SIAM Symposium on Discrete Algorithms, San Francisco, California, United States, January 1998, 668-677.
    [10] Lempel R, Moran S. The stochastic approach for link-structure analysis (SALSA) and the TKC effect [J]. Computer Networks, June, 2000, 33(1-6): 387-401
    [11] Cohn Dm Chang H. Learning to probabilistically identify authoritative documents [C]. In Proceedings of the 17th International Conference on Machine Learning (ICML-2000), Stanford University, United States, June 2000,167-174.
    [12] Borodin A, Roberts G O, Rosenthal J S, et al. Finding Authorities and Hubs From Link Structures on the World Wide Web [C]. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, May 2001, 415-429
    [13] Chakrabarti S, Dom B, Raghavan P, et al. Automatic resource compilation by analyzing hyperlink structure and associated text [J]. Computer Networks and ISDN Systems,April,1998,30(1-7):65-74
    [14]Gevery J,Ruger S.Link-based approaches for text retrieval Proceedings of TREC-10,NIST(Gaithersburg,MD,13-16 Nov 2001)[M].NIST Special Publication 500-250,2002,279-285
    [15]凤元杰,刘正春,王坚毅.搜索引擎主要性能评价指标体系研究[J].情报学报,2004,(1):63-68
    [16]赖茂生等.计算机情报检索.北京:北京出版社,1993
    [17]http://it.sohu.com/2004/06/61/article220406162.shtml(Accessed Sept.2004)
    [18]傅欣.第三代搜索引擎的智能化趋势研究.现代图书情报技术,2002(6):28-30
    [19]http://www.baidu.com/products/01_2.html(Accessed Sept.2004)
    [20]http://www.enet.com.cn/enews/inforcenter/A20040325297513.html(Accessed Sept.2004)
    [21]常璐,夏祖奇.搜索引擎的几种常用排序算法[J].图书情报工作,2003年第6期.70-73
    [22]杨思洛.搜索引擎的排序技术研究[J].信息检索技术,2005年第1期.43-47
    [23]什么是SEO?什么是SEM?[2006-11-11].http://seobbs.spaceslive.com/blog/
    [24]李彦宏.一“键”中的.中国计算机用户,2000(6):54-55
    [25]S.Brin,L.Page The Anatomy of a Large-scale Hypertextual Web Search Engine Computer Networks and ISDN Systems,1998
    [26]Arvind Arasu,Junghoo Cho.Hector Garcia-Molina,Andreas Paepcke,Sriram Raghavan,Searching the Web.ACM Transactions on Internet Technology.2001,1(1)
    [27]Taher Haveliwala.Efficient Computation of Pagerank.Technical Report 1999-31,Database Group,Computer Science Department,Stanford University,February 1999.http://dbpubs.stanford.edu/pub/1999-31
    [28]L.Page,S.Brin,R.Motwani,T.Winograd.The PageRank Citation Ranking:Bringing order to the Web.http://www-db.stanford.edu/~backrub/pageranksub.ps,January,1998
    [29]Chakrabarti S,Dom B,Raghavan P,et al.Automatic resource compilation by analyzing hyperlink structure and associated text[J].Computer Networks and ISDN Systems,April,1998,30(1-7):65-74
    [30]宋聚平,王永成,尹中航,等。对网页PageRank算法的改进[J]。上海交通大学学报,2003(3):397-400
    [31]Bharat K,Mihaila G A.Hilltop:A Search Engine based on Expert Documents[DB/OL].2000-10[2006-06].http://www.cs.toronto.edu/georgem/hilltop,2000-10/.
    [32]Haveliwala T H.Topic-Sensitive PageRank[DB/OL].2002-02[2006-06].http://net.pku.edu.cn/wbia/2004/public-html/Readings/mining/Topic-Sensitive%20Pa geRank.pdf
    [33]姜鑫维,赵岳松.Topic PageRank-一种基于主题的搜索引擎[J].计算机技术与发展.第17卷第5期2007年5月238-241
    [34]李凯,赫枫龄,左万利PageRank-Pro-一种改进的网页排序算法[J]吉林大学学报(理学版)第41期2003年4月175-179
    [35]陈再良,凌力,周强dPageRank-一种改进的分布式PageRank算法[J].计算机应用第26卷第1期21-24
    [36]MANASKASEMSAK B,RUNGSAWANG A.Parallel PageRank computation on a gigabit PC cluster[A].Proceedings of the 18th International Conference on Advanced Information Networking and Application[C],2004
    [37]Richardson M,Domingos P.The intelligent surfer:probabilistic combination of link and content information in PageRank[J].Advances in Neural Information Processing Systems,2002,14,1441-1448.
    [38]王建勇,单松巍,雷鸣,谢正茂,李晓明 海量Web搜索引擎系统中用户行为的分布特征及其启示[J].中国科学(E辑)第31卷第4期372-384
    [39]杨彬,康慕宁.基于概念的权重PageRank改进算法[J]。情报杂志.2006年第11期70-72
    [40]肖明军,黄刘生,罗永龙SHITS:一种基于超链接和内容的网页排序方法[J]小型微型计算机系统第27卷第12期 2177-2182
    [41]余艳 搜索引擎原理剖析及其技术发展[J]图书馆学刊2004年第1期58-60

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700