用户名: 密码: 验证码:
基于DK-Means算法的文本聚类的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术在各个领域的普及,各种应用每天产生的数据量呈指数级增长。如何有效处理这些数据,从中提取有用的知识,是迫切需要解决的问题。数据挖掘是为了满足人们对数据中所蕴涵的信息和知识的充分理解和有效应用而发展起来的一门新兴技术。聚类分析根据数据对象之间的相似度将数据集划分为几个类或者簇,是发现数据内部结构和知识的很好的方法。
     聚类分析是根据样本之间的某种距离在无监督条件下的聚簇过程,利用聚类方法可以把大量的文本划分成用户可以迅速理解的簇,从而使用户可以更快地把握大量文档中所包含的内容,加快分析速度并辅助决策。聚类分析已利用在各个领域,例如,模式识别,图象处理,信息检索等多个学科。根据不同需求,聚类数据集的类型也各不相同,例如,有序数型、标量型、文本型、混合型等数据,本文主要研究了对文本类型的数据进行聚类。
     本文对文本聚类中所涉及的文本降维方法和聚类算法进行了研究。首先,在文本预处理中,提出了结合词频的分词方法,提高了分词准确性的同时,为后边的文本模型的构建,文本降维等做好准备;其次,提出了基于文本相似的文本降维方法,该降维方法,通过计算文本与其他文本的相似性,计算特征词对文本类属性中的贡献度来提取与文本高度相关的词,起到了文本降维的效果,提高了文本聚类的效率和精确度;最后,提出了基于DK-Means的文本聚类算法,该方法与原有方法相比提高了聚类准确度和聚类速度。
     本文首先对属于数据挖掘领域的聚类分析技术进行了介绍,然后讲述了文本聚类的相关技术,包括文本的预处理、文本表示模型、降维技术和文本聚类算法(K-Means, BIRCH, CURE, OPTICS等),再次研究了新的文本降维方法和聚类算法,对于特特征降维方法,提出了新的基于文本相似的文本降维方法。最后根据提出的算法设计和实现了文本聚类。经过测试,表明以上提出的方法,不仅在聚类的准确率和纯度方面有所提高,而且提高了文本聚类的速度。
As popularization of information technology in various fields, the data of variety application is generated by an exponential growth level. Dealing with these data effectively and extracting useful knowledge is a problem to solve urgently. Data Mining is new technology for meeting the full understanding and effective application of the information and knowledge contained in the data. Clustering the Data is better way to find the Structure and knowledge in the data. The cluster analysis is dividing the data into several categories or clusters according to the similarity between data. The Cluster analysis is better pretreatment with data collected before statistical analysis.
     The cluster analysis is a clustering process according to the similarity in the absence of supervision. The Documents will be divided into the cluster using cluster analysis that can be understood by user. So, the users can master the content of a large number of texts rapidly, and accelerate the pace of analysis and help making decision. Cluster analyses have been used in many fields, for example, pattern recognition, image processing, IR, and other disciplines. The type of data sets is different according to different demand. For example its have ordinal number, scalar, text, and other types. This paper mainly researches the clustering of the text.
     In this paper, the approach of text drop-dimensional and algorithm of the clustering involved in the text clustering were researched. Firstly, in the pretreatment of text, the method of segment combined with frequency of word was proposed. It can improve the accuracy of the segment and prepared for the construction of text Model and text drop-dimensional. Secondly, the method of drop-dimensional based on the similarity of text was proposed. It extracts the word of highly relevant to the text by calculating the word's contribution to the text category. It improves the efficiency and precision of the text clustering. Finally, the paper proposed algorithm of the text clustering based on DK-Means that improve the accuracy of the clustering and clustering speed.
     The paper firstly introduce the cluster analysis technology belong to the field of data mining. Then, it introduce the technology related to the text clustering including the pretreatment of text, the text model, technology of feature drop-dimensional and the algorithm of the text clustering, and proposed the new method of feature drop-dimensional based on text'similarity and new algorithm of the text clustering. Finally, the paper design and implement the text clustering according to the new method of feature drop-dimensional and the text clustering'algorithm. After experiment, it not only improves the accuracy and purity of the text clustering'result, but also improve the speed of the text clustering.
引文
1.中国人民大学统计系数据挖掘中心.数据挖掘中的聚类分析[J],统计与信息论坛,2002,17(3):4-10.
    2.夏骄雄.数据资源聚类预处理及其应用研究[D],上海大学,2007.
    3.赵恒.数据挖掘中聚类若干问题研究[D],西安电子科技大学,2005.
    4. Milligan G.W. An examination of the effect of six types of error perturbation on fifteen clustering algorithms [J], Psychometrika,1980,45(3):325-342.
    5. D.W. Scott. Multivariate Density Estimation:Theory, practice, and visualization[M], John Wiley & Sons,1992.
    6. R BellMan. Adaptive control Processes:A Guided Tour, New Jersey:Princeton University Press,1961.
    7.毛国君,段立娟,王实,石云.数据挖掘原理与算法[M],北京:清华大学出版社,2005,156-181.
    8.朱玉全,杨鹤标,孙蕾.数据挖掘技术[M],南京:东南大学出版社,2006,130-168.
    9.张剑.基于概念的文本表示模型的研究[D],北京清华大学,2006.
    10. Lu Yuchang, Lu Mingyu, Li Fan, et al. Analysis and construction of word weighting function in VSM[J], Journal of Computer Research & Development,2002,39(10): 1205-1210.
    11.李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M],北京:科学出版社,2004,197-216.
    12.桂云苗,朱金福.一种用信息熵确定聚类权重的方法[J],理论与决策,2005,29-30.
    13.刘涛,吴功宜,陈正.一种高效的用于文本聚类的无监督特征选择算法[J],计算机研究与发展,2005,42(3):381-386.
    14. Yiming Yang, Jan O. Pedersen, A comparative study on feature selection in text categorization[A],The ICML97[C],Nashvillle,1997,412-420.
    15. Tal L, Shengping L, Zheng C, et al. An evaluation on feature selection for text Clustering[A], The ICML03[C], Washington,2003.
    16.张宁.基于网格和密度的聚类算法的研究[D],大连理工大学,2007.
    17. Manoranjan Dash, Simona Petrutiu, Peter Scheuermann. Efficient Parallel Hierarchical Clustering[A], LNCS[C],2004,3149:363-371.
    18.周永庚,周傲英,曹晶,胡运发.一种基于密度的快速聚类算法[J],计算机研究与发展,2000,37(11):1287-1292.
    19.孙玉芬.基于网格方法的聚类算法研究[D],华中科技大学,2006.
    20.连仁包,曾光清.数据集成中数据清洗模型的研究[J],福建电脑,2007,2,3-4
    21.郭志懋,俞荣华,田增平,周傲英.一个可扩展的数据清洗系统[J],计算机工程,2003,29(3):95-96.
    22. Raman V, Hellerstein J. An Interactive Data Cleaning System[A], VLDB[C],2001, 381-390.
    23. Galhardas H,Florescu D,Shaha D,et al. Declarative Data Cleaning:Language, Model and Algorithms[A], VLDB[C],2001,371-380.
    24.周钦强,孙炳达,王义.文本自动分类系统文本预处理技术的研究[J],计算机应用研究,2004,2:85-86.
    25.吴慧玲,耿西伟,沈建京,贺广生.不良信息过滤的文本预处理方法研究[J],微计算机信息,2006,22(12-3):58-60.
    26.曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统[J],Journal of Software,2006,17(3):356-363.
    27. Foo s, Li H. Chinese Word Segmentation accuracy and its effects on information retrieval [J], Information Processing and Management,2004,40(1):161-190.
    28. Gao JF, Wu AD, Li M, Huang CN, Li HQ, Xia XS, Qin HW. Adaptive Chinese word segmentation[A], Annual Meeting of the Association for Computational Linguistics[C],2004, 21-26.
    29. Allan Borodin, Rafail Ostrovsky, Yuval Rabani. Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces [J], Machine Learning,2004,56(1-3): 153-167.
    30.刁力力,王丽坤,陆玉昌,石纯一.计算文本相似度阈值的方法[J],清华大学学报,2003,43(1):108-111.
    31.张宇.语言信息处理——信息检索,哈尔滨工业大学计算机科学与技术学院,2000,19-26.
    32. Dash M, Liu H, Scheuermann P, Tan K.L. Fast Hierarchical clustering and its Validation [J], Data and Knowledge Engineering,2003,44(1):109-138.
    33.袁方,孟增辉,于戈.对K-Means聚类算法的改进[J],计算机工程与应用,2004,36:177-178.
    34. La Jolla. Alternatives to the K-Means algorithm that find better clustering[A], CIKM[C], 2002,600-607
    35.曾依灵,许洪波,白硕.改进的OPTICS算法及其在文本聚类中的应用[J],中文信息学报,2008,22(1):51-56.
    36. Ankerst M, Breuning M, Kriegel H.P, et al. OPTICS:ordering Points to Identify the Clustering Structure[A], ACM SIGMOD[C],1999,49-60.
    37.谭松波,王月粉.中文文本分类语料库-TanCorpV1.0, http://www.searchforum.org.cn/ tansongbo/corpus.htm.
    38.张惟皎,刘春煌,李芳玉.聚类质量的评价方法[J],计算机工程,2005,31(20):10-12.
    39.于剑,程乾生.模糊聚类方法中的最佳聚类数的搜索范围[J],中国科学(E辑),2002,32(2):274-280.
    40. Zahid N, Limouri M, Essaid A. A new cluster-validity for fuzzy clustering[J], Pattern Recognition,1999,32(7):1089-1097.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700