用户名: 密码: 验证码:
半监督聚类集成理论与技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
聚类分析是数据挖掘和机器学习领域一种重要技术方法之一,在很多领域都有广泛的应用,尤其应用在对大数据等问题的处理和分析上。聚类根据一种给定的相似性度量方式,将所有数据对象划分为不同的簇,要求簇内相似度最大而簇间相似度最小在实际问题的解决中,无监督的聚类方法不能利用少量的先验知识,单一的聚类算法很难满足对结构和分布复杂多变的数据集合的处理。半监督聚类集成技术正好弥补了这方面的缺陷,充分利用半监督学习和集成学习技术,并将其应用到聚类分析中,可以有效的提高聚类的性能。然而由于半监督聚类集成研究刚刚兴起,其很多理论机理知识不是很成熟,理论方面的研究可以为半监督聚类集成技术的发展提供有力的支撑。
     半监督聚类集成技术充分的利用先验知识指导聚类过程,提高聚类的性能,同时利用集成学习的思想,将多个基聚类结果进行组合达到更优化的划分效果。受半监督学习和聚类集成等技术研究的启示,结合概率统计的知识,本文对半监督聚类集成的相关理论进行了数学分析和讨论。在对半监督聚类集成模型和参数进行相关假设的前提下,对其收敛性进行数学证明和分析;引入鲁棒半径的概念来表示鲁棒性程度的范围,对半监督聚类集成的鲁棒性进行分析。然后本文提出一种基于关联矩阵的统一类标签方法,对基聚类(划分)类标签进行统一对齐,将先验知识以约束对的形式加入到基于多数投票法的半监督聚类集成模型中。实验结果表明,先验知识可以提高基聚类和半监督聚类集成的性能,半监督聚类集成具有收敛性和鲁棒性等性能,改进的基于多数投票法的半监督聚类集成方法可以获得较好的聚类效果。
     半监督聚类集成技术,能够有效的利用先验知识指导聚类和集成过程,且通过融合具有一定差异性的基划分结果,可以有效的提高聚类的性能。本文基于统计学知识,证明了半监督聚类集成方法具有收敛性,同时分析了其鲁棒性性能,提出一种鲁棒性度量方法;提出了一种基于多数投票的半监督聚类集成模型。实验结果表明,随着差异性基划分成员数量的增加半监督聚类集成结果具有收敛性,且其鲁棒性性能也比较好;充分利用先验知识后,基于多数投票法的半监督聚类集成方法可以有效的提高聚类的性能。
Clustering analysis is an important technology in the areas of data mining and machine learning. It is widely used in many fileds, especially in the processing and analysis of the big data. According to a kind of given measure of similarity, clustering can divide all the data objects into several clusters, which should maximize the similarity between intra-class objects and minimize the similarity between inter-class objects. In practical issues, unsupervised clustering algorithms perform without considering any prior knowledge, and a single clustering algorithm is very hard to meet the processing of datasets which structure or distribution is complex. But the simi-supervised clustering ensemble can just to make up for this deficiency, which makes full use of semi-supervised learning and ensemble learning technology to clustering analysis. It could effectively improve the performance of clustering. However, due to the research of simi-supervised clustering ensemble is just emerging, and there are few studies in the theoretical analysis. The theoretical study can provide solid foundation for the development of semi-supervised clustering ensemble.
     Semi-supervised clustering ensemble technology is fully used of the prior knowledge to guide the clustering process, which can improve the performance of clustering, at the same time it uses ensemble learning technology to combine the base clusterings to get better results. By the revelation of the semi-supervised learning and clustering ensemble research, and combining the knowledge of probability and statistics, this thesis presents the mathematical analysis and discussion for semi-supervised clustering ensemble. Based on some assumptions, it gives the mathematical proof and analysis of convergence for semi-supervised clustering ensemble in the thesis. The author proposes the concept of robust radius to measure the degree of robustness and analyse the robustness of semi-supervised clustering ensemble. This thesis discusses a new relabeling approach based on contingency matrix to unify the base clustering (partition) labels, and then use pairwise constraints in the form of the prior knowledge, added to the model of semi-supervised clustering ensemble based on majority voting. The experimental results show that prior knowledge can improve the performace of base clustering and semi-supervised clustering ensemble, and semi-supervised clustering ensemble is provided with convergence and robustness, and the approach can obtain a better clustering effect.
     Semi-supervised clustering ensemble technology can effectively utilize the prior knowledge to guide clustering and ensemble process, which improve the performance clustering by aggregating multiple diversity partitions. In this thesis, it proves the convergence of semi-supervised clustering based on statistical technology, and presents a robust measure to analyse the robustness. And then a new semi-supervised clustering ensemble model based on majority voting is proposed. The experimental results show that, with increasing of the diversity base partitions number, semi-supervised clustering ensemble will be convergence and robustness. By prior knowledge, semi-supervised clustering method based on majority voting can get better performance than other clustering ensemble alogrithms.
引文
[1]Han J, Kamber M. Data Mining Concepts and Techniques.范明,孟小峰等译.第二版.北京:机械工业出版社,2010:251-295.
    [2]Judd D, Mckinley P, Jain AK. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(8):871-876.
    [3]Bhatia S, Deogun J. Conceptual clustering information retrieval. IEEE Transactions on Systems, Man, and Cybernetics,1998,28(3):427-436.
    [4]Frigui H, Krishnapuram R. A robust competitive clustering algorithm with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999,21(5):450-465.
    [5]唐伟,周志华.基于Bagging的选择性聚类集成.软件学报,2005,16(4):496-502.
    [6]李昆仑,曹铮,曹丽苹等.半监督聚类若干新进展.模式识别与人工智能,2009,22(5):35-742.
    [7]尹学松,胡恩良,陈松灿.基于成对约束的判别型半监督聚类分析.软件学报,2008,19(11):2791-2802.
    [8]王红军,李志蜀,戚建淮等.基于贝叶斯网络的半监督聚类集成模型.软件学报,2010,11:2814-2825.
    [9]王珏,周志华,周傲英.机器学习及其应用.北京:清华大学出版社,2006.
    [10]Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using gaussian fields and harmonic functions. Proceedings of the 12th International Conference on Machine Learning (ICML 2003), Washington DC,2003.
    [11]Sinkkonen J, Kaski S. Clustering by similarity in an auxiliary space. Proceedings of 2nd International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2000), London, Springer,2000:3-8.
    [12]Sinkkonen J, Kaski S. Clustering based on conditional distributions in an auxiliary space. Neural Computation,2001(14):217-239.
    [13]Kaski S, Sinkkonen J, Peltonen J. Analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks,2001,12(4):936-947.
    [14]Wagstaff K, Cardie C, Rogers S. Constrained k-means clustering with background knowledge. Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA:Morgan KaufMann Publishers Inc.2001:577-584.
    [15]Basu S, Banerjee A, Mooney R. Active semi-supervision for pariwise constrained clustering. Proceedings of the SIAM International Conference on Data Mining. FL, USA, 2004:333-344.
    [16]Klein D, Kamvar SD, Manning C. From instance-level constraints to space-level constraints:marking the most of prior knowledge in data clustering. Proceedings of the 19th International Conference on Machine Learning. San Francisco, CA, USA:Morgan KaufMann Publishers Inc,2002:307-314.
    [17]Zhang L, Li M. Density-based constraint expansion method for semi-supervised clustering. China Computer Engineering,2008,34(10):13-15.
    [18]Dietterich TG. Machine learning research:four current directions. Al Magazing,1997, 18(4):97-136
    [19]张春霞,张讲社.选择性集成学习算法综述.计算机学报,2011,34(8):1401-1410.
    [20]Takemura A, Shimizu A, Hamamoto K. Discrimination of breast tumors in ultrasonic images using an ensemble classifier based on the AdaBoost algorithm with feature selection. IEEE Transactions on Medical Imaging,2010,20(3):598-609
    [21]Partalas I, Tsoumakes G, Hatzikos E, et al. Greedy regression ensemble selection:theory and an application to water quality prediction. Information Sciences,2008,178(20): 3867-3879.
    [22]Diego I, Serrano A, Conde C, et al. Face verification with a kernel fusion method. Pattern Recognition Letters,2010,31(9):837-844.
    [23]Zhou Z, Wu J, Tang W. Ensemble neural networks:many could be better than all. Artificial Intelligence,2002,137(1-2):239-263.
    [24]何鸣,李国正,袁捷等.基于主成份分析的Bagging集成学习方法.上海大学学报,2006,12(4):415-427.
    [25]付忠良,赵向辉,苗青等.基于属性组合的集成学习算法.计算机应用,2010,3(2):465-475.
    [26]赵向辉,付忠良,谢会云等.神经网络和集成学习在地质灾害危险度区划中的应用研究.四川大学学报,2010,42(1):50-55.
    [27]Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research,2002,3:583-617.
    [28]Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data,2007,1(1):1-30.
    [29]Fred A, Jain A K. Data clustering using evidence accumulation. Proceedings of the 16th International Conference on Pattern Recognition, Quebec,2002:276-280.
    [30]Yang Y, Kamel M. An aggregated clustering approach using multi-ant colonies algorithms. Pattern Recognition,2006:39(7):1278-1289.
    [31]罗会兰.聚类集成关键技术研究.浙江大学,博士学位论文,2007.
    [32]Fred A. Finding consistent clusters in data partitions. Proceedings of the 2nd International Workshop on Multiple Classifier Systems, Cambridge,2001:309-318.
    [33]Topchy A, Jain A K, Punch W. Clustering ensembles:models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005, 27(12):1866-1881.
    [34]Zhou Z, Tang W. Cluster ensemble. Knowledge-Based Systems,2006,19(1):77-83.
    [35]Yang Y, Kamel M, Jin F. ART-based clustering aggregation. Proceedings of the IEEE International Conference on Granular Computing, Atlanta,2006:482-485.
    [36]Luo Y, Xiong S. Clustering ensemble for unsupervised feature selection. Proceedings of the 6th International Conference on Fuzzy Systems Knowledge Discovery, Tianjin,2009: 445-448.
    [37]罗会兰,危辉.基于数学形态学的聚类集成算法.计算机科学,2010,37(08):214-218.
    [38]Masson M, Denoeux T. Ensemble clustering in the belief functions framework. International Journal of Approximate Reasoning,2011,52(1):92-109.
    [39]Bojun Y. Carlotta D. Subspace metric ensembles for semi-supervised clustering of high dimensional data. Proceedings of the 17th European Conference on Machine Learning, Lecture Notes in Computer Science,2006,4212:509-520.
    [40]Duan C, Huang J, Mobasher B. A consensus based approach to constrained clustering of software requirements. Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley,2008:1073-1082.
    [41]Abdala D, Jiang X. An evidence accumulation approach to constrained clustering combination. Proceedings of the 64th International Conference on Machine Learning and Data Mining in Pattern Recognition, Leipzig,2009:361-371.
    [42]Wang H, Qi J, Zheng W, et al. Semi-supervised cluster ensemble based on binary similarity matrix. Proceedings of the 2nd IEEE International Conference on Information Management and Engineering (ICIME 2010),2010:251-254.
    [43]Yang Y, Wang H, Lin C, et al. Semi-supervised clustering ensemble based on multi-ant colonies algorithm. Proceedings of the 7th International Conference of Rough Sets and Knowledge Technology (RSKT 2012). LNAI7414,2012:302-309.
    [44]Topchy P, Law H, Jain K, et al. Analysis of consensus partition in cluster ensemble. Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 2004:225-232.
    [45]Wang T. Ca-Tree:A hierarchical structure for efficient and scalable coassociation-based cluster ensembles. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics,2011,41(3):686-698.
    [46]Jia J, Xiao X, Liu B, et al. Bagging-based spectral clustering ensemble selection. Pattern Recognition Letters,2011,32(10):1456-1467.
    [47]管仁初.半监督聚类算法的研究与应用.吉林大学,博士学位论文,2010.
    [48]Wagstaff K, Cardie C. Clustering with instance-level constraints. Proceedings of the 17th International Conference on Machine Learning,2000:1103-1110.
    [49]Basu S, Banerjee A, Mooney R. Semi-supervised clustering by seeding. Proceedings of the 19th International Conference on Machine Learning,2002:19-26.
    [50]邓超,郭茂组.基于Tri-Training和数据剪辑的半监督聚类算法.软件学报,2008,19(3):663-672.
    [51]Xing E, Ng A, Jordan M et al. Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems,2003,15: 505-512.
    [52]肖宇,于剑.基于近邻传播算法的半监督聚类.软件学报,2008,19(11):2803-2813.
    [53]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类.软件学报,2007,18(10):2412-2422.
    [54]Basu S, Bilenko M, Mooney R. A probabilistic framework for semi-supervised clustering. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle,2004:59-68.
    [55]Greene D, Cunningham P. An ensemble approach to identifying informative constraints for semi-supervised clustering. Proceedings of the 18th European Conference on Machine Learning, Warsaw,2007:140-151.
    [56]Halkidi M, Gunopulos D, Kumar N, et al. A framework for semi-supervised learning based on subjective and objective clustering criteria. Proceedings of the IEEE Conference on Data Mining, Houston,2005:637-640.
    [57]Ceccarelli M, Maratea A. Semi-supervised fuzzy c-means clustering of biological data. Proceedings of the 6th International Workshop on Fuzzy Logic and Applications, Milan, 2005:259-266.
    [58]Qian Y, Du X, Wang Q. Semi-supervised hierarchical clustering analysis for high dimensional data. International Journal of Information Technology,2006,12(3):54-56.
    [59]Tang W, Xiong H, Zhong S, et al. Enhancing semi-supervised clustering:a feature projection perspective. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose,2007:707-716.
    [60]Rutayisire T, Yang Y, Lin C, et al. A modified Cop-kmeans algorithm based on sequenced cannot-link set. Proceedings of the 6th International Conference on Rough Sets and Knowledge Technology (RSKT 2011),2011,6954:217-225.
    [61]Frey B and Dueck D. Clustering by passing messages between data points. Science, 2007,315:972-976.
    [62]Elhadary R, Tolba A, Elshakawy M, et al. An efficient and robust combined clustering technique for mining in large spatial databases. Proceedings of International Conference on Computer Engineering and Systems, Cairo,2007:439-445.
    [63]Minaei-bidgoli B, Topchy A, Punch W. Ensembles of partitions via data resampling. Proceedings of the International Conference on Information Technology:Code and Computing, Las Vegas,2004:188-192.
    [64]Topchy A, Jain A, Punch W. Combining multiple weak clusterings. Proceedings of the 3rd IEEE international Conference on Data Mining, Melbourne,2003:331-338.
    [65]Topchy A, Minaei-bidgoli B, Jain A, et al. Adaptive clustering ensembles. Proceedings of the 17th International Conference on Pattern Recognition, Cambridge,2004:212-215.
    [66]Topchy A, Jain A, and Punch W. A mixture model for clustering ensembles. Proceeding SIAM International Conference on Data Mining (SDM 2004),2004,379-390.
    [67]李胜宏.数学分析.浙江:浙江大学出版社,2009:232-245.
    [68]接婧.国际学术界对鲁棒性的研究.系统工程学报,2005,20(2):153-159.
    [69]Jen E. Definitions. Santa Fe Institute Robustness Site, RS-2001-009, http://discuss.santafe.edu/robustness,2001.
    [70]Ali S, Maciejewski A, Siegel H, et. al. Measuring the robustness of a resource allocation. IEEE Transactions on Parallel and Distributed Systems,2004,15(7):630-641.
    [71]Yang S. A voting statistical method of group decision. Proceeding of the International Joint Conference on Computational Sciences and Optimization (CSO 2009),2009,2: 873-875.
    [72]Ayad G, Kamel S. On voting-based consensus of cluster ensembles. Pattern Recognition, 2010,43:1943-1953.
    [73]Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.html.
    [74]http://www.lans.ece.utexas.edu/~strehl/.
    [75]杨燕,靳蕃,Kamel M.聚类有效性评价综述.计算机应用研究,2008,25(6):1630-1632.
    [76]He J, Tan A, Tan C. Modified ART 2A growing network capable of generating a fixed number of nodes. IEEE Transactions on Neural Networks,2004,15(3):728-737.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700