基于关键资源的网站分类研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于关键资源的网站分类研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Research of Web Site Classification Based on Key Resources
作者：郜鑫博
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：网站分类 ; 关键资源 ; 网页内容 ; 网站拓扑结构
英文关键词：web site classification ; key resources ; contents of web pages ; topology of web site
学位年度：2009
导师：王宇颖
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2009-06-01

摘要

互联网中含有大量的有用信息,且其信息量每天都在以很高的速度增长着。互联网为人们提供了一个极具价值的信息源。但是由于互联网信息的海量性、异构性、易变性、非语义性等特点,人们要快速准确地在信息海量中寻找到所需的信息并不容易,因此人们迫切需要一种自动化的工具来帮助我们有效的获取互联网上信息。
     因此网站的自动分类有着广阔的应用前景,然而目前国内外对网站分类的研究还比较少。有很多方法的都是将网页表示成普通的文本,然后利用文本分类的方法来对网站进行分类。但是,不同于普通的文本,网站是一个由大量网页通过超级链接结合而成的整体,因此将文本分类的方法直接应用在网站分类上就不那么合适了。
     目前针对网站分类的方法主要有超页法以及主题频次向量法等。这两种方法的不足在于,分类的时候都需要获得网站全部的网页信息,这样就影响了分类的效率。
     本文提出了一种基于关键资源的网站分类方法,并且从网页内容以及网站拓扑结构两个角度出发,重点讨论了两种不同的网站关键资源选取方法:
     (1)基于网页内容的关键资源选取方法:利用网站及网页的结构特点,将网站表示成一棵多粒度树,通过合适的剪枝策略选取类别特征明显的页面作为网站的关键资源。
     (2)基于网站拓扑结构的关键资源选取方法:利用有向图来描述网站的拓扑结构,通过链接分析技术为网站中的页面进行排序,选取网站中的前k个重要页面作为该网站的关键资源。
     实验结果表明,基于关键资源的网站分类方法只需在网站中选取部分页面,就可以获得较高的分类精度。
There is a lot of useful information in the Internet, and the information is still increasing at a rapid rate every day. Internet provides a valuable source of information for human. But because the mass, heterogeneous, variable, and non-semantic characteristics of Internet information, it is not easy for people to find the information what they want quickly and accurately. So people urgently need a navigation of web information to help them find information in Internet effectively.
     For this reason, automatic web site classification has a wide application prospect. However, there is a little research on the web site classification. Many methods represent the web pages as normal text and still use the methods of text classification to classify web site. But different from normal text, web sites are combination of many web pages via hyperlinks, so the methods of text classification are not suitable for web site classification.
     By now the main methods of web site classification are superpage and Topic Frequency Vector. The shortcoming of these two methods is that, they need all the pages information of a web site when classifying and this affects the efficiency of classification.
     This thsis propose a method of web site classification based on key resources, and discuss two methods of selecting key resources from two aspects of contents of web pages and the topology structure of web site:
     (1)The method of selecting key resources based on contents of web pages:This method represents a web site as a multiscale site tree makes use of the structure features of web site and web page. Then select the pages as the key resources of web site through reasonable pruning strategy, and the pages must have obvious class characteristics.
     (2)The method of selecting key resources based on the topology structure of web site:This method represents web site`s topology structure as a directed graph. Then sort the pages through link analysis techniques. Finally, select the most important k pages as the key resources of the web site.
     The experiments result show that the method of web site classification based on key resources can reach a high accuracy with little web pages of the web site.

引文

1 Google: Search Engine. http://www.google.com/
    2中国互联网信息中心. http://www.cnnic.net.cn/
    3 Martin Ester, Hans-Peter Kriegle, Matthias Schubert. Web Site Mining: A new Way to Spot Competitors, Customers and Suppliers in the World Wide Web, proceeding of 8th International Conference on Knowledge Discovery and Data Mining 2002:249-258
    4 Yahoo!: Directory Services. http://www.yahoo.com/
    5 DMOZ:Open Diectory Project. http:// DMOZ.org/
    6 Tom M. Mitchell. Machine Learning. McGraw Hill. 1996.
    7 W.Lam and C.Y.Ho. Using a Generalized Instance Set for Automatic Text Categorization, proceeding of the 21st Ann International ACM SIGIR Conference on Research and Development n Information Retrieval Melboume, AU, 1998. pp: 81-89.
    8 B.Masand, G.Lino and D.Waltz. Classifying News Stories Using Memory Based Reasoning, proceeding of the 15th Annual ACM SIGIR Conference. Copenhagen, Denmark, 1992. pp: 59-65.
    9 Thorsten, Joachims. The Categorization with Support Vector Machines: Learning with Many Relevant Features, In European Conference on Machine Learning (ECML). Chemnitz, Germany, 1998. pp: 137-142.
    10 J.T.Y.Kwok. Automatic Text Categorization Using Support Vector Machine, proceeding of International Conference on Neural Information Processing, 1998. pp: 347-351.
    11 Terveen L, Hill W, and Amento B. Constructing, Organizing, and Visualizing Collections of Topically Related Web Resources. ACM Trans. on Computer-Human Interaction, 1999.6(1).pp:67-94
    12 Chakrabarti S., Dom B and Indyk P. Enhanced Hypertext Categorization Using Hpyerlinks. Proceeding of the ACM SIGMOD Conference on Management of Data Seattle, Washington, 1998, pp: 307-318.
    13 Craven M, DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K. and Slattery S. Learning to Construct Knowledge Bases from the World Wide Web.In Artificial Intelligence,1999.
    14 John M. Pierre. On the Automated Classification of Web Sites, Linkoping Electronic Articles in Computer and Information Science Vol. 6(2001)
    15 Martin Ester, Hans-Peter Kriegel, Matthias Schubert. Accurate and Efficient Crawling for Relevant Websites, Proceedings of the Thirtieth international conference on Very large data bases Aug, 2004, 396-407.
    16 Hans-Peter Kriegel, Matthias Schubert:“Classification of Websites as Sets of Feature Vectors”, proceedings of the IASTED International Conference Databases and Applications Feb 17-19, 2004,127-132.
    17 J.Sander, M. Ester, H.P. Kriegel, XiaoWei Xu. The Algorithm GDBSCAN and its Applications. Data Mining and Knowledge Discovery. 2, 1998, 169-194.
    18 Eui-Hong Han, George Karypis, Centriod-Based Document Classification: Analysis & Experimental Results. Porceeds 4PthP PKDD`00 ,Lyon, France, 2000, 424-431.
    19 Tian YongHong, Huang Tiejun, Gao Wen. A Web Site Representation and Mining Algorithm using a Multiscale Tree Model. Journey of Software, 2004, 15(9):1393-1404
    20董宝力,祁国宁,顾新建.基于混合向量空间模型的主题网站识别.清华大学学报(自然科学版) , 2005 45:1795-1801
    21付德宇,代成琴,仲伟.基于关键资源的网站自动分类系统.哈尔滨工业大学学报2006. 38(1):19-22.
    22 TREC-2002 Web Track Gudielines. http://trec.nist.gov
    23 Corinna Cortes and V. Vapink. Support-Vector Networks, Machine Learning, 20, 1995
    24 F. Debole, F. Sebastiani. An Analysis of the Relative Hardness of Reuters-21578 Subsets: Research Articles. Journal of the American Society for Information Science and Technology. 2005, 56(6):584 ~ 59
    25王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取.计算机研究与发展. 2004, 41(10):1786-1791.
    26 G.Salton, A.Wong, C.S.Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, Nov. 1975,18(11):613-620
    27 Hodgson J. Do HTML Tags Flag Semantic Content? IEEE Internet Computing, 2001,5(1):20-25
    28 Page L, Brin S, Motwani R.The PageRank Citation Ranking: Bringing order to the Web. Technical report, Stanford Digital Libraries SIDL-WP-1999-0120,1999
    29 Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Preceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm. New Orlean: ACM Press, 1997, 668~677.
    30刘悦,王斌,杨志峰,张鑫. Web关键资源发现中的链接分析技术.全国第七届计算语言学联合学术会议.2003
    31 Wenpu Xing, Ali Ghorbani. Weight PageRank Algorithm. Preceedings of the Second Annual Conference on Communication Networks and Services Research 2004.
    32吴春旭,郭磊. Web结构挖掘的PageRank算法改进.情报技术2005,10:55-58.
    33蔡小艳,寇应展,杨杰,赵新杰.基于页面关联比重的PageRank排序算法的改进.军械工程院学报. 2008,20(3):66-69.
    34钱功伟,倪林,Miao Yuan,曹荣.基于网页链接和内容分析的改进的PageRank算法.计算机工程与应用. 2007,43(21):160-164.
    35 Havelieala, T.H. Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search. Proceedings of the 11th International World Wide Web Conference, Hawaii, 2002:517~526.
    36 Richardson M, Domingos P. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. Advances in Neural Information Porcessing System. 2002, 14, 1441-1448.
    37黄德才,戚华春,钱能.基于主题相似度模型的TS-PageRank算法.小型微型计算机系统. 2007年3月第3期510~514.
    38卜东波.聚类/分类理论研究以及在文本挖掘中的应用.中国科学院计算技术研究所博士学位论文. 2000年10月.
    39胡卫军,刘文,陈传波,陈长雄.融入链接相关度策略的PageRank算法.华中科技大学学报(自然科学版). 2007, 35(8):60-62.
    40原福永,张圆圆.基于链接分析的相关排序方法的研究和改进.计算机工程与设计. 2007,28(7):1630-1632.
    41药成刚.基于链接结构的中文网页排序算法研究.哈尔滨工业大学工学硕士论文. 2006年6月
    42段宇锋.网络链接分析与网站评价研究.武汉大学博士学位论文. 2004年4月
    43王晓宇,熊方,凌波,周傲英.一种基于相似度分析的主题提取和发现算法.软件学报2003. 14(09):1578-1583
    44 Dekang L. An Information Theoretic Definition of Similarity. In Proceedings of the 15PthP International Conference on Machine Learning Table of Contents, San Francisco, 1998:296-304.
    45叶卫国,卢正鼎,王天江.基于Hyperlink聚类的网页分类研究.华中科技大学学报(自然科学版). 2004, 32(12):5-7.
    46孙建军,成颖,丁芹,李君君,宋玲丽,柯青.信息检索技术.科学出版社,北京,2004年10月

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700