Web数据挖掘系统的设计及关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Web数据挖掘系统的设计及关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Design and Key Techniques of Web Mining System
作者：乔智勇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息检索 ; 数据挖掘 ; Web
英文关键词：IR (Information Retrieval) ; DM (Data Mining) ; Web
学位年度：2002
导师：刘志镜
学科代码：081203
学位授予单位：西安电子科技大学
论文提交日期：2002-01-15

摘要

Web数据挖掘就是利用数据挖掘技术从网络文档和服务中发现和提取信息。应用Web数据挖掘可以改善人们获取信息的速度和准确度。本文在对国内外Web数据挖掘技术进行研究的基础上提出了一个Web数据挖掘系统的框架模型并根据Web的自身特点实现了一个智能网页收集器，它采用了既考虑Web内容又考虑Web结构的URL排序策略，从而使获取的页面是理想的页面。此外在对查询结果的评价上，本文在对其他方法分析的基础上提出了一种新的查询结果评价方法，它充分考虑了Web以及文本自身的特性，取得了较为理想的效果。最后本文对多媒体数据挖掘技术进行了探讨。
Web mining is the use of data mining techniques to automatically discover and extract information from World Wide Web documents and services. With Web mining, the speed and precision of information services can be improved greatly. Based on the whole study of Web mining techniques, this paper presented a general architecture of a Web mining system. A smart web crawler has also been designed and implemented with a new URL order. It takes both web structure and web content into account, so the sufer can get ideal web pages. An integrative ranking algorithm is introduced to provide good result to the user. In additional, multimedia mining techniques are also discussed in this paper.

引文

[1] Steve Lawrence and C. Lee Giles. Searching the World Wide Web. Science. April 1998, Vol.280, pp98-100．
    [2] Steve Lawrence and C. Lee Giles. Accessibility of Information on the Web. Nature. July 1999,Vol. 400, PP107-109．
    [3] O.Etzioni. The World Wide Web: quagmire or gold mine?. Communications of ACM, 1996,39(11) , pp65-68．
    [4] Brian E.Brewington, George Cybenko. Keeping Up with the Changing Web. IEEE Computer, May 2000,33(5) , pp.52-58．
    [5] Nicholas Kushmerick. Gleaning the Web. IEEE Intelligent Systems, March/April, 199914(2) , pp.20-22．
    [6] K. G. Coffman and A. M. Odlyzko. Internet growth: Is there a "Moore's Law" for data traffic?. Handbook of Massive Data Sets, J. Abello, P. M. Pardalos, and M. G C. Resende, eds, Kluwer, 2001
    [7] K. G. Coffman and A. M. Odlyzko. Growth of the Internet. Optical Fiber Telecommunications IV, I. P. Kaminow and T. Li, eds. Academic Press, 2001．
    [8] R. Kosala and H. Blockheel.Web Mining Research: A Survey. SIGKDD Explorations, 2000,2(1) , pp. 1-15．
    [9] Soumen Charkrabarti. Data mining for hypertext: A tutorial survey. SIGKDD Explorations, January 2000,1(2) , pp. 1-11．
    [10] S. K. Madria, S. S. Bhowmick, W.-K. Ng, E.-P. Lim. Research Issues in Web Data Mining. Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery (DAWAK'99) , held in conjunction-with the 10th International Conference on Database and Expert System Applications (DEXA'99) , Florence, Italy, August 30-September 1, 1999．
    [11] Kiduk Yang. Literature Review. Ph.D. thesis, School of Information and Library Science University of North Carolina, January 2001．
    [12] Osmar R. Zaiane, Jiawei Han, Ze-Nian Li, Sonny H. Chee, Jenny Chiang. MultiMediaMiner: A System Prototype for Multimedia Data Mining. SIGMOD Conference 1998: 581-583．
    [13] Osmar R. Zaiane, Jiawei Han, Ze-Nian Li, Jean Hou, Mining Multimedia Data. CASCON'98: Meeting of Minds, Toronto, Canada, November 1998, pp 83-96．
    [14] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. In ACM-SIAM Symposium on Discrete Algorithms, 1998．
    [15] J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. The Web as a graph: Measurements, models and methods. Invited survey at the International Conference on Combinatorics and Computing, 1999．
    [16] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan. A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer, August 1999, 32(8) , pp. 60-67．
    [17] S.Brin and L.Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In 7th International World Wide Web Conference, Brisbane, Australia, 1998．
    [18] Larry Page, Sergey Brin, R. Motwani, T. Winograd.The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Libraries Working Paper 1999-0120, Stanford Univ., Palo Alto, Calif., 1998．
    [19] T. Haveliwala. Efficient Computation of PageRank. Technical Report 1999-31, 1999．
    [20] Monika R.Henzinger. Hyperlink Analysis for the Web. Internet Computing, January/February

    2001,5(1) , pp. 45-50．
    [21] Kruschwitz, U. Exploiting Structure for Intelligent Web Search. In Proceedings of the 34th Hawaii International Conference on System Sciences (HICSS), Maui, Hawaii, 2001． IEEE.
    [22] J. Cho, H. Garcia-Molina, and L. Page. Efficient Crawling through URL Ordering. Proc. Seventh Int'l World Wide Web Conf, Elsevier Science, New York, 1998, pp. 161-172．
    [23] Kurt Bollacker and Steve Lawrence and C. Lee Giles. Discovering Relevant Scientific Literature on the Web. IEEE Intelligent Systems, March/April 2000,15(2) , pp.42-47．
    [24] Brian Pinkerton. WebCrawler: Finding What People Want. Ph.D. thesis, University of Washington 2000．
    [25] Robert Walker Cooley. Web Usage Mining: Discovery and Application of Interesting Patters from Web Data. Ph.D. thesis, University of Minnesota May 2000．
    [26] Cooley, Robert; Mobasher, Bamshad; Srivastava, Jaideep. Web Mining: Information and Pattern Discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference on Tools with artificial intelligence, 1997．
    [27] Cooley, Robert; Mobasher, Bamshad; Srivastava, Jaideep. Data Preparation for Mining World Wide Web Browsing Patterns. Journal of knowledge and information systems, February 1999,1(1) .
    [28] C. J. van Rijsbergen. Information Retrieval (Second Edition). Butterworths, London, 1979．
    [29] R. Baeza-yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley and ACM Press, 1999, ISBN: 020139829X.
    [30] William B.Frakes, Ricardo Baeza-Yates. Information Retrieval: Data Structures&Algorithms, Prentice-Hall, 1992, ISBN 0-13-463837-9．
    [31] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000, ISBN: 1558604898．
    [32] Two Crows Corporation. Introduction to Data Mining and Knowledge Discovery, Third Edition. 1999．
    [33] Dell Zhang, Yisheng Dong. An efficient algorithm to rank Web resources. WWW9 / Computer Networks, 2000, 33(1-6) : 449-455．
    [34] Soumen Chakrabarti, Byron E. Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon M. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of World Wide Web '98 (WWW7) , Brisbane, Australia, April 1998, pp.65-74．
    [35] Philippe Martin, Peter W. Eklund. Knowledge Retrieval and the World Wide Web. IEEE Intelligent Systems, May/June 2000, 15(3) , pp. 18-25．
    [36] S. Lawrence and CL Giles. Context and page analysis for improved web search. IEEE Internet Computing, 1998, 2(4) , pp.38-46．
    [37] L.Introna and H. Nissenbaum. Defining the Web: The Politics of Search Engines. IEEE Computer, January 2000, 33(1) , pp.54-62．
    [38] Jason Rennie and Andrew McCailum. Efficient Web Spidering with Reinforcement Learning. Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99) , 1999．
    [39] Robert E. Filman and Sangam Pant. Searching the Internet. IEEE Internet Computing, July 1998, 2(4) , pp. 21-23．
    [40] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke and Sriram Raghavan. Searching the Web. Inaugural issue of the ACM Transactions on Internet Technologies (TOIT), June 2001．
    [41] Sriram Raghavan and Hector Garcia-Molina.Crawling the Hidden Web. Technical Report 2000-36, Database Group, Computer Science Department, Stanford University, November 2000．


    [42]J. Cho, H. Garcia-Molina: Estimating Frequency of Change. Technical Report 2000-4, Computer Science Department, Stanford University, 2000.
    [43]Gregory Piatetsky-Shapiro. Knowledge Discovery in Databases: 10 years after. SIGKDD Explorations, Feb 2000, 1(2), pp.59-61.
    [44]Ji He, Ah-Hwee Tan and Chew-Lim Tan. A Comparative Study on Chinese Text Categorization Methods. PRICAI'2000 International Workshop on Text and Web Mining, Melbourne, Australia. Sep. 2000, pp.24-35.
    [45]Martijn Koster. Guidelines for Robot Writers.
    [46]Dan Moldovan and Rada Mihalcea. Using WordNet and Lexical Operators to improve Internet Searches. IEEE Internet Computing, 2000, 4(1), pp.34-43.
    [47]B. Amento, L. Terveen, and W. Hill. Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents. Proc. 23rd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 00), ACM Press, New York, 2000, pp. 296-303.
    [48]Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim. "Data Mining and the Web: Past, Present and Future" (Invited paper), Proceedings of WIDM'99, Kansas City, Missouri, November 1999, pp. 43-47.
    [49]Zheng Chen, Liu Wenyin, Feng Zhang, Mingjing Li, Hongjiang Zhang. Web Mining for Web Image Retrieval. Microsoft Research China.
    [50]M.F. Porter. An algorithm for suffix stripping. Program, 1980,14(3): 130--137.
    [51]Osmar R. Zaine, Jiawei Han, Hua Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement in Proc. Int. Conf. on Data Engineering (ICDE'2000), San Diego, CA, February, 2000
    [52]H.J. Zhang, J.H. Wu, D. Zhong, and S.W. Smoliar. An integrated system for content-based video retrieval and browsing. Pattern Recognition, 1997, 30(4): 643-658.
    [53]Osmar R. Zaiane, Jiawei Han, WebML: Querying the World-Wide Web for Resources and Knowledge, In Proc. CIKM'98 Workshop on Web Information and Data Management (WIDM'98), Washington DC, 1998, pp 9-12.
    [54]邹涛，王继成等。WWW上的信息挖掘技术及实现。计算机研究与发展，1999，36(8)，pp．1019-1024．
    [55]王继成，潘金贵，张福炎。Web文本挖掘技术研究。计算机研究与发展，May 2000，37(5)，pp．513-520．
    [56]阳小华。WWW信息收集的ROBOT技术。计算机应用研究，2000，17(4)，pp，90-91．
    [57]韩家炜，孟小峰等。Web挖掘研究。计算机研究与发展，Apr．2001，38(4)，pp．405-414．
    [58]沈达阳，孙茂松。万维网知识挖掘方法的研究。计算机科学，2000，27(2)，pp．79-82．
    [59]王实，高文等。Web数据挖掘。计算机科学，2000，27(4)，pp．28-31，41．
    [60]王伟强，高文等。Internet上的文本数据挖掘。计算机科学，2000，27(4)，pp．32-36．
    [61]姚国祥，罗伟其等。网上信息搜索技术与搜索引擎。计算机科学，2000，27(7)，pp．35-38．
    [62]张卫丰，徐宝文等。元搜索引擎研究。计算机科学，2001，28(8)，pp．36-41，35．
    [63]张卫丰，徐宝文等。Web搜索引擎综述。计算机科学，2001，28(9)，pp．24-28
    [64]樊凌涛，陈健。图象和视频的检索技术。计算机工程与应用，2001，27(9)，pp．71-76，83．
    [65]曹莉华，胡晓峰等。基于内容检索中的视频处理技术研究。计算机工程与应用，2001，24(6)，pp．39-41，55．
    [66]雷鸣，王建勇等。第三代搜索引擎与天网二期。北京大学学报，2001，37(5)，pp．734-740．
    [67]乔智勇，刘志镜．Web数据挖掘系统的设计及实现研究．计算机工程与设计，23(7)．July 2002(已录用)
    [68]乔智勇，刘志镜．Web结构挖掘．小型微型计算机系统，2002(已录用)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700