基于web挖掘技术的网页分类研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于web挖掘技术的网页分类研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Web Page Classification Based on Web Mining
作者：龚畅
论文级别：硕士
学科专业名称：计算机技术
中文关键词：社会化标签 ; 娱乐意向 ; 元数据 ; 网页分类 ; 虚拟文档
英文关键词：Social annotation ; Entertainment Intention ; Metadata ; Web page classification ; Virtual document
学位年度：2009
导师：钱雪忠 ; 钱炜坤
学科代码：081203
学位授予单位：江南大学
论文提交日期：2009-06-01
答辩委员会主席：吴锡生

摘要

随着计算机硬件存储能力和软件环境的不断提高,万维网(World Wide Web)数据膨胀使得人们拥有的数据和资源不断增加,万维网的结构也变得更加复杂。万维网数据的海量、异构和分布性等特点为该领域的研究提出挑战。近年来,Web挖掘已经引起了信息产业界的极大关注,其主要原因是可以利用万维网的海量数据,并且需要将这些数据转换成有用的信息和知识。用户在线活动潜在目标是多样化的。理解用户在线活动的目标和意向可为用户提供个性化服务,提高用户满意度。如电子商务网站可以根据用户浏览网页时是否有参与娱乐活动的意向来摆放娱乐产品。
     近年来Web2.0的话题在各界都引起了广泛地讨论,网络上Web2.0相关主题的应用正在兴起。它应用包括以用户为中心的发布和知识管理平台,如:维基(Wikis),博客(Blogs),和社会化书签网站,如Del.icio.us和Flickr。社会化标签服务不仅为用户标注提供友好的用户界面,而且允许用户在网络上共享这些标签。本文结合网页内容和标签建立虚拟文档对网页分类,取得了满意的效果,为进一步数据挖掘任务提供基础。
     本文主要做了以下几方面工作。
     1.用户娱乐意向挖掘。理解用户在线活动的目标和意向为信息提供者带来很大帮助。本文对娱乐意向进行定义,提出了基于网页内容建立机器学习模型学习用户娱乐意向的框架。基于该框架,通过分类算法构建从网页来获取用户的娱乐意向模型。实验结果表明,出现频率高的特征词更大比例具有娱乐意向,网页娱乐意向识别能力取得满意效果。
     2.社会化书签的特点及表示。标签作为描述网页的关键字,反映了从用户角度对网页内容的理解,为网页提供了丰富的元数据。本文分析社会化标签系统特点及规律性,建立用户、标签和网页这种多关系异构对象的三部图,并对网页标签表示进行定义。
     3.基于社会化标签网页分类。在社会化标签环境下,通常用户根据同一类的标签所标注的网页属于同一类。相应的,用户对同一类网页标注时,所用的标签是同一类的。因此,本文提出了一种基于社会化标签构造网页虚拟文档的表示方法。构建对网页局部文本、网页标签和虚拟文档进行分类的模型。通过实验证实了社会化书签对网页分类的作用,基于虚拟文档的分类算法取得了满意的效果。
With the improvement of computer hardware storage capacity and software environment, data expansion of World Wide Web makes data and resource owned by people increase, the structure of World Wide Web becomes more complex too.The characteristics such as the mass one, the Heterogeneous one and distributive one pose challenges to this area. Recently Web mining has attracted much attention in information industry. The reason for this situation is that world wide data data can be used, it is necessary for us to transform data to useful information and knowledge. The goals of user on line activities are diversity. Understanding goals and intention can greatly help information providers to personalize contents and thus improve user satisfaction. For example, Ecommerce Web sites can display entertainment content based on users' EI.
     Recently, a new family of "Web2.0" application is currently emerging on the Web. These include user-centric publishing and knowledge management platforms likes Wikis, Blogs, and social sharing systems. Social bookmark services, such as Del.icio.us and Flickr, have attracted considerable users'interest and achieved significant success. These services not only provide user-friendly interfaces for people to annotate Web resource, but also enable them to share the annotations on the Web. Social annotations reflect that how user understand web resources content and provide rich meta-data for Web page classification. This paper combines web page and related tags create virtual document to classify web pages and gets promising results, which provides basis for further web mining task.
     This paper has done the work of several respects of the following mainly:
     1. User entertainment intention mining. Understanding goals and intention behind a users' can greatly help information providers. In this paper, we define the Entertainment Intention(EI) and present the framework of building machine learning models to learn El based on Web pages content. Based on that framework, we build models to detect El from web pages. Our experiments show that frequent keywords are more likely to have entertainment. The ability of EI detection shows promising results.
     2. Social annotation representation and distribution.The annotation is the freely and openly assigned text, which are some keywords describe the content of item in different aspects, thus provide rich meta-data for Web page classification. We analysis the dynamics of tagging systems and the distribution tag of popular Web site. Then we build the tripartite model for relational heterogeneous objects, user, tag and URL and give the representation of social annotation.
     3. Web page classification based on social annotation. In the social annotation environment, the same category annotations are usually assigned to the same category web pages by users with common interest. The the annotations assigned to the same category web pages are of the same category.In this paper, we build model to classify web pages: web page content, annotations metadata for corresponding pages and the virtual document of the Web page integrating the annotation metadata and the content of Web page. Experiments confirm that the tags are effective for web pages classification and the Virtual document-based method shows promise results.

引文

1.中国互联网信息中心(CNNIC).http://www.cnnic.net.cn,2007
    2.WWW2006.http://www2006.org/,2006
    3.Adam Mathes.Folksonomies-cooperative classification and communication through shared metadata.http://www.adammathes.com/academic/computer-mediated-communication/folkson omies.html,2004
    4.Kosala Raymond,Hendrik Blockeel.Web mining research:A survey.In SIGKDD explorations newsletter of the ACM special interest group on knowledge discovery and data mining 2(1),2000.1-15.
    5.E.J.Glover,K.Tsioutsiouliklis,S.Lawrence,D.M.Pennock,and G.W.Flake.Using Web structure for classifying and describing web pages.In WWW'02:Proceedings of the 11th International conference on World Wide Web.Honolulu,Hawaii,USA:2002.562-569,
    6.Lee U.,Liu Z.,Cho J..Automatic identification of user goals in web search.Technical report,UCLA Computer Science,2004.
    7.J.R.Wen,J.Y.Nie,and H.J.Zhang.Clustering user queries of a search engine.In Proceedings of the tenth international world wide web conference,HongKong:May 2001.
    8.Lee U.,Liu Z.,Cho J..Automatic identification of user goals in web search.Technical report,UCLA Computer Science,2004.
    9.Baeza-Yates,Ricardo.现代信息检索.北京:机械工业出版社,2004.
    10.孙建军,成颖,丁芹等.信息检索技术.北京:科学出版社,2004.166-175.11.Larwrence Page,Sergey Brin,Rajeev Motwani,and Terry Winograd.The pagerank citation ranking:Bringing order to the web.Techical report,stanford digital library technologies project,1998.
    12.Jon M.Kleinberg.Authoritive sources in a hyperlinked environment.J.ACM,46(5):604-632,1999.
    13.Dai Honghua(Kathy),Nie Zaiqing,Wang Lee etc.Detecting online commercial intention (OCI).In Proceedings of the 15th international conference on World Wide Web.Edinburgh,Scotland:WWW2006.
    14.Rose D.E.,Levinson D..Understanding user goals in web search.In Proceedings of the 13th international conference on World Wide Web.ACM Press,2004.13-19.
    15.Lee U.,Liu Z.,Cho J..Automatic identification of user goals in web search.Technical report.UCLA Computer Science,2004.
    16.Jaime Teevan,Susan T.Dumais,and Eric Horvitz.Personaling search via automated analysis of interest and activities.In Proceedings of ACM SIGIR '05,2005.
    17.Kang I.,Kim G..Query type classification for web document retrieval.In Proceedings of ACM SIGIR '03,2003.64-71.
    18.Broder Andrei.A taxonomy of web search.SIGIR Forum 36(2),2002.
    19.杨晖.基于标签分类内容共享平台的网页自动摘要获取.重庆大学硕士学位论文,2007.5
    20.成江东.基于Web 2.0的企业知识管理平台的设计与开发.华中师范大学硕士学位论文,2007.5
    21.黄晔.基于Web2.0社会性标签与统计推荐系统设计与实现.重庆大学硕士学位论文,2006.10
    22.鲁明羽.Web mining技术及其应用研究.清华大学博士学位论文,2004.
    23.孙建涛.Web挖掘中的降维和分类方法研究.清华大学博士学位论文,2004.
    24.谢振亮.基于Web挖掘技术的网页自动分类和聚类的研究.天津大学硕士学位论文.
    25.H.-J.Oh,S.-H.Myaeng,and M.-H.Lee.A practical hypertext categorization method using links and incrementally available class information.In SIGIR' 00:Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Athens,Greece,2000.264-271
    26.R.Ghani,S.Slattery,and Y.Yang.Hypertext categorization using hyperlink pattems and metadata.In ICML '01:Proceedings of the Eighteenth International Conference on Machine Learning,2001.178-185
    27.Fabrizio Sebastiani.Machine learning in automated text categorization.ACM Computing Surveys,34(1):1-47,2002.
    28.Kamal Nigram,Andrew k.McCallurn,Sebastian Thrun,and Tom M.Mitchell.Text classification from labeled and unlabeled documents using EM.Machine Learning,39(2/3):103-134,2000.
    29.Yiming Yang,Sean Slattery,Rayid Ghani.A study of approaches to hypertext categorization.Journal of Intelligent Information Systems,2002.219-241
    30.Fabrizio Sebastiani.Machine learning in automated text categorization.ACM Computing Surveys,34(1):1-47,2002.
    31.Miguel A.Carreira-Perpinan.A review of dimension reduction techniques.University of Sheffield:Technical Report CS-96-09.Dept.of Computer Science,1996.
    32.Y.Yang and J.O.Pedersen.A comparative study on feature selection in text categorization.In ICML '97:Proceedings of the Fourteenth International Conference on Machine Learning,.Nashville,TN,USA,1997.412-420
    33.McCallum,A.and Nigam K.A Comparison of Event Models for Naive Bayes Text Classification.In AAAI/ICML-98 Workshop on Learning for Text Categorization.Technical Report WS-98-05.AAAI Press.1998.41-48.
    34.S.Kotsiantis,P.Pintelas,Increasing the Classification Accuracy of Simple Bayesian Classifier,Lecture Notes in Artificial Intelligence,AIMSA 2004,Springer-Verlag Vol 3192,2004.198-207
    35.Kamal Nigram,Andrew k.McCallum,Sebastian Thrun,and Tom M.Mitchell.Text classification from labeled and unlabeled documents using EM.Machine Learning,39(2/3):2000.103-134
    36.Yiming Yang,Sean Slattery,Rayid Ghani.A study of approaches to hypertext categorization.Journal of Intelligent Information Systems,2002.219-241
    37.D.Shen,J.-T.Sun,Q.Yang,Z.Chen.A comparison of implicit and explicit links for web page classification.In:Proc.of WWW 2006,May 23.26,2006.643-650.
    38.Joachims T..Learning to classify text using support vector machines.Dissertation,Kluwer,2002
    39.Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines. International Conference on Machine Learning (ICML), 1999.
    40.R. Klinkenberg and T. Joachims, Detecting Concept Drift with Support Vector Machines. Proceedings of the Seventeenth International Conference on Machine Learning (ICML), Morgan Kaufmann, 2000.
    43 .Web Page Classification Based on Social Annotations, 2008 International Conference on Information Technology & Environmental System Sciences,2008.5.
    44.A. Hotho, R. Jaschke, C. Schmitz and G. Stumme. Information Retrieval in Folksonomies. Search and Ranking. In: Proc. of ESWC 2006,2006. 411-426
    45.Golded, S., Huberman, B.: Usage patterns of collaborative tagging systems. Journal of Information Science 32(2), 198 (2006)
    46.H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of collaborative tagging. In: Proc. of WWW2007, May 8-12,2007. 211-220
    47.S.-H. Bao, X.-Y. Wu, B. Fei, G-R. Xue, Z. Su, and Y. Yu. Optimizing web search using social annotations. In: Proc. of WWW2007, May 8-12,2007. 501-510
    48.R. Li, S.-H. Bao, B. Fei, Z. Su, and Y. Yu. Towards effective browsing of large scale social annotations. In: Proc. of WWW 2007, May 8-12, 2007. 943-951
    49.X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web. In: proc. of WWW 2006, May 23.26.417-426
    50.Mianwei Zhou, Shenghua Bao, Xian Wu, Yong Yu. An unsupervised Mode for exploring hierarchical semantics from social annotation. In: K. Aberer et al. (Eds.): ISWC/ASWC 2007, LNCS 4825,2007.680-663
    51.Dumain S. and Chen H. Hierarchical classification of Web Content. In Proc. of the 23rd annal international ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.
    52.Johannes Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, 1999. 487-498
    53.S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hypelinks. In SIGMOD'98: Procedddings of the 1998 ACM SIGMOD Informational conference on Management of Dat. New York, NY, USA, 1998. 307-318
    54.G-R. Xue, D. Shen, Q. Yang, H.-J. Zeng, Z. Chen, Y. Yu, W Xi, and W.-Y. Ma. IRC: An iterative reinforcement categorization algorithm for interrelated web objects. In ICDM '04: Proceedings of the 4th IEEE International Conference on Data Mining, pages 273-280,Brighton, UK, 2004.
    56.C. Corts and V. Vapink. Support-vector networks. Machine Learning, 20(3), 1995. 273-297
    57.Kamal Nigram, Andrew k. McCallum, Sebastian Thrun, and Tom M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 2000. 103-134
    58.S. Wasserman and K. Faust. Social Network Analysis. Cambridge University Press, Cambridge, 1994.
    59.Elke Michlmayr, Steve Cayzer. Learning User Profiles from Tagging Data and Leveraging them for Personal(ized) Information Access. Banff, Canada: WWW2007,May8-12,2007
    60. S. Goder, B. huberman. The structure of collaborative tagging sysytems. HP labs technical report. www.hpl.hp.cora/research/idl/papers/tags/,2006

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700