用户名: 密码: 验证码:
基于用户浏览行为的深度网络挖掘
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,随着互联网的快速发展,网络中蕴含了海量的信息,并且仍在以惊人的速度增长。一般来说,互联网中信息的主要发布形式为静态网页,每个静态网页都含有一定数量的静态超链接,指向其他的静态网页。传统的搜索引擎正是利用这些静态网页中的超链接来收集、索引和显示用户所感兴趣的网页和信息。然而,除此之外,互联网中还有很大一部分信息是以动态数据源的形式存在的。这些信息并不存在于静态网页中,而是存储在网站背后的在线数据库中,并且根据用户的关键词实时地、动态地生成网页来呈现给用户。由于缺乏足够的静态超链接指向这些动态网页,传统的搜索引擎很难发现和索引这些网页,因此这部分信息相对于用户是“隐藏”的。这些“隐藏”信息的集合被称为深度网络(即Deep Web,又名Invisible Web或Hidden Web)。与此相对应,那些静态网页集合被称为Surface Web。
     现在,Deep Web的信息量远远超过了Surface Web,尤其是Deep Web中的高质量数据,更是高达Surface Web的2000多倍。但是,有效而充分地利用DeepWeb中的高质量数据在目前仍然是一个巨大的挑战,其中最重要的问题就是Deep Web数据源的发现和Deep Web数据源的采集。一当前的关于Deep Web数据源的发现与采集的研究工作各有一些不足,比如,有些需要人工参与,有些依赖于特定的领域,所以它们都很难大规模应用。因此,本文围绕着Deep Web挖掘的研究,重点关注Deep Web数据源的发现和采集这两个问题,以方便用户利用Deep Web中的信息,进一步推动Deep Web的发展。本文通过仔细分析用户在Deep Web中特有的浏览行为,归纳出了用户在Deep Web中特有的浏览路径,并基于此浏览路径提出了全自动的、不依赖特定领域的、高效的Deep Web数据源发现和采集的方法,使得大规模的Deep Web挖掘成为可能。
     本文的创新之处主要有三点:
     1.深入分析了网络用户在Deep Web中的浏览行为
     首先分析了用户在Surface Web和Deep Web中的浏览记录与浏览行为,并将它们转换为更为直观的图形表示(浏览图),然后仔细对比了它们的不同之处;再结合Deep Web中网页的功能与布局特点和链接规则,最后提出了用户在DeepWeb中的模型化的浏览路径:表单页面→列表页面→目标页面。这条浏览路径很好地描述了用户在Deep Web中的浏览行为的独特性。据我们所知,本文是第一次提出类似的概念。
     2.提出了一种高效的发现Deep Web数据源的方法
     基于用户在Deep Web中的独特的浏览路径,提出了一种高效的从浏览记录中发现Deep Web数据源的方法。该方法利用Deep Web中的链接特点,首先通过链接聚类把用户浏览过的表单页面、列表页面、目标页面聚类到一块,然后根据用户在浏览过程中的转移关系重建用户的浏览图;接着,该方法从建好的浏览图中检测浏览路径,来发现Deep Web数据源。由于该方法使用链接聚类取代了页面聚类,因此大大提高了Deep Web数据源发现的效率,而且也不依赖于特定的主题。此外,从用户浏览记录中寻找Deep Web数据源,进一步降低了代价,而且提高了发现Deep Web数据源的准确率和发现高质量Deep Web数据源的概率,降低了发现低质量Deep Web数据源的风险。
     3.提出了一种高效的采集Deep Web数据源的方法
     基于用户在Deep Web中的独特的浏览路径,提出了一种高效的采集DeepWeb数据源的方法。由于用户的浏览过程就是访问大量目标页面的过程,因此我们尝试模拟用户的浏览行为,沿着用户在Deep Web中的浏览路径来获取大量的目标页面。该方法从表单页面出发,首先收集一定数目的列表页面;然后,该方法利用DOM树对齐技术和目标链接的布局特点在列表页面上检测目标链接;之后,在列表页面和目标页面上,该方法利用翻页链接的特点来检测翻页链接。当收集到足够的链接后,该方法会学习这些链接的链接规则,并使用学到的这些链接规则去采集目标Deep Web数据源,以提高采集效率。
Recently with the rapid development of Internet, the World Wide Web contains tremendous valuable information, and the information is still growing with a fast speed. Generally, information in the Web is mainly published via static Web pages, and each static page contains a number of outgoing URLs pointing to other static pages. The traditional search engines just make use of these outgoing URLs to collect, index and show the pages and information. However, besides the static Web pages, a large proportion of information in the Web is stored in online Web databases. Such information does not exist in the static pages, but can be dynamically retrieved and displayed as dynamic Web pages to the users according to the queries provided by the users. Due to the lack of static URLs pointing to such dynamic pages, it is hard for the traditional search engines to discover them, and thus such information is "hidden" to users. Therefore, the collection of such "hidden" information is called Deep Web (also named as Invisible Web or Hidden Web). And correspondingly, the collection of static Web pages is called Surface Web.
     Now, the information in Deep Web is much more than Surface Web, especially for the high quality information in Deep Web, which is more than2000times of that in Surface Web. However, currently it is still a huge challenge to effectively and completely exploit the high quality infonnation in Deep Web, and the most important problems are Deep Web discovery and Deep Web crawling. There have been some research works on these two problems, but they are hard to be applied in large-scale applications. It is because that they have respective disadvantages, for example, some works need human interaction and some depend on specific topics. In this dissertation, around the problem of Deep Web mining, we mainly focus on the problems of Deep Web discovery and Deep Web crawling, in order to make it convenient for users to exploit Deep Web information and encourage the development of Deep Web. After carefully investigating the user browsing behavior and summarizing the specific user browsing path in Deep Web, we proposed automatic, topic independent and efficient methods for Deep Web discovery and Deep Web crawling respectively, which make it possible for Deep Web mining in large-scale applications.
     The main contents and contributions of this dissertation are as follows:
     1. Deeply investigated the user browsing behavior in Deep Web
     First deeply investigated the user browsing behavior in Deep Web and Surface Web, transformed it into a visualized graph (browsing map), and carefully compared the user browsing behavior in Deep Web and Surface Web. After that, based on the pages'function, layout and the URL rules in Deep Web, proposed a model user browsing path:Form Page→List Page→Object Page. This browse path well presents the specific characteristics of user browsing behavior in Deep Web. To the best of our knowledge, this is the first time that such a concept is proposed.
     2. Proposed an efficient method for Deep Web discovery
     Based on the specific user browsing path in Deep Web, proposed an efficient method to discover Deep Web sites from Browse Logs. This method first clusters the form pages, list pages and object pages through URL clustering, and rebuilds the browse map based on the jumps between pages. Then it tries to detect the specific user browsing path from the browse map. Next, if a user browsing path is detected and it satisfies some requirements, this site is considered as a Deep Web site. It is very efficient and also topic independent as it uses URL clustering instead of fetching the pages and clustering pages. In addition, discovering Deep Web sites from browse logs reduces the cost in further, and increases the precision of Deep Web discovery and the probability of discovering high quality Deep Web sites.
     3. Proposed an efficient method for Deep Web crawling
     Based on the specific user browsing path in Deep Web, proposed an efficient method to crawl Deep Web sites. Observing that the users visit a large number of object pages during their browsing, we try to simulate the user browsing to collect as many object pages as possible. Starting from the form page, the method first collects a number of list pages; then it makes use of HTML DOM tree alignment technique and the layout of object URLs to detect object URLs from the collected list pages; next, it takes advantage of the characteristics of page-flipping URLs to detect page-flipping URLs from both list pages and object pages. After collecting enough URLs, the method learns URL rules from the detected URLs, and uses the learnt URL rules to crawl the target Deep Web sites in order to increase the crawling efficiency.
引文
[1]刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述.计算机学报,33(9):1475-1489,2007.
    [2]刘伟,严华梁,肖建国,曾建勋.一种Web评论自动抽取方法.计算机学报,21(12):3220-3236,2010.
    [3]郑冬冬,崔志明Deep Web爬虫爬行策略研究.计算机工程程与设计,27(17):3154-3158,2006.
    [4]王超,朱炜,李俊.多策略的主题集中式万维网爬虫设计.计算机科学,31(7):84-86,2004.
    [5]王海龙,胡景芝,赵朋朋,崔志明.基于搜索引擎的Deep Web数据源发现.计算机工程,37(5),2011.
    [6]汪涛,樊孝忠,顾益军,刘林.基于概念分析的主题爬虫设计.北京理工大学学报,24(10):890-893,2004.
    [7]马丹,王翰虎,陈梅,张小平Deep Web数据源发现与分类模型.计算机技术与发展,20(7):66-71,2010.
    [8]李文骏,崔志明.基于搜索引擎的Deep Web数据源发现技术,计算机技术与发展,18(8):77-79,2008.
    [9]冯明远,林怀忠.基于最优查询的多领域Deep Web爬虫.计算机应用研究.26(9):3375-3377,2009.
    [10]王舜燕,李蕾,吴兵华.基于ID3分类算法的深度网络爬虫设计.现代图书情报技术,165(6):41-45,2008.
    [11]周旭.2008.Deep Web数据源的发现与分类研究[D]:[硕士].郑州:河北大学,10-25.
    [12]赵朋朋.2008.Deep Web信息集成若干关键技术研究[D]:[博士].苏州:苏州大学,5-21.
    [13]赵娜.2009Deep Web数据源发现及选择方法研究[D]:[硕士].济南:山东大学,6-20.
    [14]李林.2009.Deep Web数据源发现相关技术研究[D]:[硕士].苏州:苏州大学,10-29.
    [15]2005年中国互联网络信息资源数量调查报告.2006.http://www.cnnic.net.cn/uploadfiles/pdf/2006/5/16/183953.pdf[M].中国互联网信息中心.
    [16]CompletePlanet. http://www.completeplanet.com/
    [17]Deep Web Technologies.http://www.deepwebtech.com/
    [18]Document Object Model (DOM). http://www.w3.org/DOM/.
    [19]ForumMatrix.http://www.forummatrix.org/index.php
    [20]Hot Scripts, http://www.hotscripts.com/index.php [M]
    [21]Internet Forum. http://en.wikipedia.org/wiki/Internet_forum [M]
    [22]Invisible-web.net. http://www.invisible-web.net/[M]
    [23]Message Boards Statistics, http://www.big-boards.com/statistics/[M]
    [24]mod_oai:about. http://www.modoai.org/[M]
    [25]nofollow. http://en.wikipedia.org/wiki/Nofollow [M]
    [26]PubMed. http://www.ncbi.nlm.nih.gov/pubmed/[M]
    [27]The Sitemap Protocol,http://sitemaps.org/protocol.php [M]
    [28]The Web Robots Pages. http://www.robotstxt.org/[M]
    [29]Travel Forums-TripAdvisor. http://www.tripadvisor.com/ForumHome [M]
    [30]Barbosa L, and Freire J.2005. Searching for Hidden Web Databases [C]. In Proceedings of the 8th International Workshop on the Web and Database, pages 1-6.
    [31]Barbosa L, and Freire J.2007. An Adaptive Crawler for Locating Hidden-Web Entry Points [C]. In Proceedings of the 16th ACM International World Wide Web Conference (WWW 2007), pages 441-450.
    [32]Bergman M K.2001. The Deep Web:Surfacing Hidden Value. Journal of Electronic Publishing,7(1).
    [33]Brin S and Page L.1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine [J]. Computer Networks and ISDN Systems,30(1-7):107-117.
    [34]Cai R, Yang J-M, Lai W, Wang Y, and Zhang L.2008. iRobot:An Intelligent Crawler for Web Forums [C]. In Proceedings of the 17th ACM International World Wide Web Conference (WWW 2008), pages 447-456.
    [35]Chang K C-C, He B, Li C, Patel M, and Zhang Z.2004. Structured Databases on the Web: Observations and Implications [J]. In Proceedings of the 33rd ACM Special Interest Group on Management of Data (SIGMOD 2004) Record [J].33(3):61-70.
    [36]Chang K C-C, He B, and Zhang Z.2005. Toward Large Scale Integration:Building a MetaQuerier over Databases on the Web [C]. In Proceedings of the 2nd Conference on Innovative Data Systems Research (CIDR 2005). pages 44-55.
    [37]Chang C-H, Kayed M, Girgis M R, and Shaalan K F.2006. A Survey of Web Information Extraction Systems [J]. IEEE Transactions on Knowledge and Data Engineering (TKDE). 18(10):1411-1428.
    [38]Chang C, Lui S.2001. IEPAD:Information Extraction based on Pattern Discovery [C]. In Proceedings of the 10th ACM International World Wide Web Conference (WWW 2001), pages 681-688.
    [39]Cheng T and Chang K C-C.2007. Entity Search Engine:Towards Agile Best-Effort Information Integration over the Web [C]. In Proceedings of the 3rd Conference on Innovative Data Systems Research (CIDR 2007), pages 108-113.
    [40]Cheng T, Yan X, and Chang K C-C.2007. EntityRank:Searching Entities Directly and Holistically [C]. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB 2007), pages 387-398.
    [41]Cope J, Craswell N, and Hawking D.2003. Automated Discovery of Search Interfaces on the Web [C]. In Proceedings of the 14th Australasian Database Conference (ADC 2003), pages 181-189.
    [42]Crescenzi V, Mecca G, Merialdo P.2001. Roadrunner:Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pages 109-118.
    [43]Davulcu H, Freire J, Kifer M, and Ramakrishnam I V.1999. A Layered Architecture for Querying Dynamic Web Content [C]. In Proceedings of the 28th ACM Special Interest Group on Management of Data (SIGMOD 1999), pages 491-502.
    [44]Doorenbos R B, Etzioni O, and Weld D S.1997. A scalable Comparison shopping agent for the World Wide Web [C]. In Proceedings of the First International Conference on Autonomous Agents (ICAA 1997), pages 39-48.
    [45]Dragut E, Wu W, SiStla P, Yu C, and Meng W.2006. Merging Source Query Interfaces on Web Databases [C]. In Proceedings of the 22nd International Conference on Data Engineering (TCDE 2006), pages 679-690.
    [46]Gravano L, Ipeirotis P, and Sahami M.2003. QProber:A system for automatic classification of hidden-Web databases [J]. ACM Transactions of Information Systems (TOTS),21(1):1-41.
    [47]Guo Y, Li K, Zhang K, and Zhang G.2006. Board Forum Crawling:a Web Crawling Method for Web Forum [C]. In Proceedings of 2006 IEEE/WIC/ACM International Conference on Web Intelligence (ICWI 2006), pages 475-478.
    [48]He B and Chang K C-C.2005. Making Holistic Schema Matching Robust:An Ensemble Approach [C]. In Proceedings of the 11th ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD 2005), pages 429-438.
    [49]He B and Chang K C-C.2006. Automatic Complex Schema Matching across Web Query Interfaces:A Correlation Mining Approach [J]. ACM Transactions on Database Systems (TODS),31(1):346-395.
    [50]He B, Patel M, Zhang Z, and Chang K C-C.2007. Accessing the deep Web:A survey [J]. Communications of the ACM (CACM),50(2):94-101.
    [51]He H, Meng W, Yu C T, and Wu Z.2003. WISE-Integrator:an automatic integrator of Web search interfaces for e-commerce [C]. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB 2003), pages 357-368.
    [52]Ipeirotis P, Gravano L, and Sahami M 2001. Probe, Count, and Classify:Categorizing Hidden Web Databases [C]. In Proceedings of the 30th ACM Special Interest Group on Management of Data (SIGMOD 2001), pages 67-78.
    [53]Ipeirotis P and Gravano L.2002. Distributed Search over the Hidden Web:Hierarchical Database Sampling and Selection [C]. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002), pages 394-405.
    [54]Jiang J and Yu N.2011. Discover Web Forums via User Browsing Behavior Detection [C]. In Proceedings of 2011 International Conference on Computer Science and Network Technology (ICCSNT 2011), pages 2390-2395.
    [55]Jiang J, Song X, Yu N, and Lin C-Y.2012. FoCUS:Learning to Crawl Web Forums [J]. IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear.
    [56]Li K, Cheng X Q, Guo Y, and Zhang K.2007. Crawling Dynamic Web Pages in WWW Forums [J]. Computer Engineering (CE),33(6):80-82.
    [57]Li H, Guo M, Cai L, and Yang Y.2010. An Incremental Update Strategy in Deep Web [C]. In Proceedings of the 6th International Conference on Computing, Networking and Communications (ICNC 2010), pages 143-151.
    [58]Li X, Meng W, Meng X.2007. EasyQuerier:A Keyword Query Interface for Web Database Integration System [C], In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA 2007), pages 35-44.
    [59]Li X, Liu W, and Meng X.2006. Easy Querier-A keyword Based Web Integrated Interface [J]. Journal of Computer Research and Development (JCRD), pages 54-60.
    [60]Liu B, Grossman R L, Zhai Y.2003. Mining Data Records in Web Pages [C]. In Proceedings of the 9th ACM Special Interest Group on Knowledge Discovery and Data Mining (STGKDD 2003), pages 601-606.
    [61]Liu W, Meng X, Meng W.2007. A Survey of Deep Web Data Integration [J]. Chinese Journal of Computers,30(9):1475-1489.
    [62]Liu W, Meng X, and Meng W.2010. ViDE:A Vision-Based Approach for Deep Web Data Extraction [J]. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(3):447-460.
    [63]Liu W, Yan H, and Xiao J.2010. Automatically mining review records from forum Web sites [C]. In Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010), pages 2450-2455.
    [64]Lu Y, He H, Zhao H, Meng W, and Yu C.2007. Annotating structured data of the deep Web [C]. In Proceedings of the 23rd International Conference on Data Engineering (ICDE 2007), pages 376-385.
    [65]Ntoulas A, Zerfos P, and Cho J.2005. Downloading Textual Hidden Web Content through Keyword Queries [C]. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2005), pages 100-109.
    [66]Peng Q, Meng W, He H, and Yu C T.2004. WISE-Cluster:Clustering e-commerce search engines automatically [C]. In Proceedings of the 6th Workshop on Web Information and Data Management (WIDM 2004), pages 104-111.
    [67]Raghavan S and Garcia-Molina H.2001. Crawling the Hidden Web [C]. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pages 129-38.
    [68]Schonfeld U and Shivakumar N.2009. Sitemaps:above and beyond the crawl of duty [C]. In Proceedings of the 18th ACM International World Wide Web Conference (WWW 2009), pages 991-1000.
    [69]Sherman C and Price G.2003. The Invisible Web:Uncovering Information Sources Search Engines Can't See [J]. Library Trends,2003(2):282-298.
    [70]Song X Y, Liu J, Cao Y B, and Lin C-Y.2010. Automatic Extraction of Web Data Records Containing User-Generated Content [C]. In Proceedings of 19th ACM Conference on Information and Knowledge Management (CIKM 2010), pages 39-48.
    [71]Vapnik V N. The Nature of Statistical Learning Theory. Springer,1995.
    [72]Vidal M L A, Silva A S, Moura E S, and Cavalcanti J M B.2006. Structure-Driven Crawler Generation by Example. In Proceedings of 29th ACM Special Interest Group on Information Retrieval (SIGIR 2006), pages 292-299.
    [73]Vidal M L A, Silva A S, Moura E S, and Cavalcanti J M B.2008. Structure-based crawling in the hidden web [J]. Journal of Universal Computer Science (JUCS),14(11):1857-1876.
    [74]Wang J and Lochovsky F H.2003. Data extraction and label assignment for Web databases [C]. In Proceedings of the 12th ACM International World Wide Web Conference (WWW 2003), pages 187-196.
    [75]Wang Y, Yang J-M, Lai W, Cai R, Zhang L, and Ma W-Y.2008. Exploring Traversal Strategy for Web Forum Crawling [C]. In Proceedings of 31st ACM Special Interest Group on Information Retrieval (SIGIR 2008), pages 459-466.
    [76]Wu P, Wen J-R, Liu H, and Ma W-Y.2006. Query Selection Techniques for Efficient Crawling of Structured Web Sources [C]. In Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006).
    [77]Wu W, Yu C, Dean A, and Meng W.2004. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web [C]. In Proceedings of the 33rd ACM Special Interest Group on Management of Data (SIGMOD 2004), pages 95-106.
    [78]Yang J-M, Cai R, Wang Y, Zhu J, Zhang L, and Ma W-Y.2009. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums [C]. In Proceedings of 18th ACM International World Wide Web Conference (WWW 2009), pages 181-190.
    [79]Zhai Y and Liu B.2005. Web Data Extraction Based on Partial Tree Alignment [C]. In Proceedings of the 14th ACM International World Wide Web Conference (WWW 2005), pages 76-85.
    [80]Zhai Y and Liu B.2006. Structured Data Extraction from the Web based on Partial Tree Alignment [J]. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(12):1614-1628.
    [81]Zhao P, Cui Z, Gao L, and Zhong H.2007. Vision-based Deep Web Query Interfaces Automatic Extraction [J]. Journal of Computational Information Systems (JCIS),3(4): 1441-1448. [82] Zhao P, Lin C, Gao L, and Cui Z.2007. Deep Web Sources Focused Crawling [C]. In Proceedings of the 2007 International Conference on Enterprise Information Systems and Web Technologies (ICEISWT 2007), pages 95-99

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700