摘要
随着数据规模的爆炸式增长,利用K-means等聚类算法挖掘大数据的潜在价值,已成为一个当前较为重要的研究方向。将Canopy算法与K-means算法结合,可解决K个中心点的选取问题。而针对Canopy-Kmeans算法中初始中心点选取随机、算法受噪声点影响等问题,提出了一种利用密度峰值改进的M-Canopy-Kmeans算法,并采用Spark框架实现算法的并行化。实验结果表明,改进后的算法避免了Canopy中心点选取的盲目性,且有效排除了样本中的噪声点,准确性、抗噪性都有明显提高,且在Spark并行框架中具有良好的加速比和扩展性。
Along with the explosive growth of data scale,how to explore the potential values of big data with clustering algorithm,such as K-means,now becomes a significant research topic.In combination of Canopy with K-means,the selection problem of center points,may be solved,and for the randomness of initial center point selection in canopy-K-means algorithm and the influence of noise on algorithm,a modified M-Canopy-Kmeans algorithm,improved by density peaks,is proposed,and with spark framework,parallel processing of the algorithm is realized.The experiments show that the algorithm exhibits great improvements in accuracy and noise immunity by effectively avoiding the blindness of Cannopy and noise point in samples.In addition,it shows great speed-up ratio and extensibility in Spark parallel framework.
引文
[1]Jain A K,Murty.Data Clustering:A Review[J].Acm Computing Surveys,1999,31(03):264-323.
[2]汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(02):299-304.WANG Zhong,LIU Gui-quan,CHEN En-hong.A K-means Algorithm Based on Optimized Initial Center Points[J].Pattern Recognition and Artificial Intelligence,2009,22(02):299-304.
[3]邱荣太.基于Canopy的K-means多核算法[J].微计算机信息,2012(09):486-487.QIU Rong-tai.Canopy for K-Means on Multi-core[J].Microcomputer Information,2012(09):486-487.
[4]Rong C.Using Mahout for Clustering Wikipedia's Latest Articles:A Comparison between K-means and Fuzzy C-means in the Cloud[C].IEEE Third International Conference on Cloud Computing Technology and Science,IEEE Computer Society,2011:565-569.
[5]翟东海,鱼江,高飞等.最大距离法选取初始簇中心的K-means文本聚类算法的研究[J].计算机应用研究,2014,31(03):713-715.ZHAI Dong-hai,YU Jiang,GAO Fei,et al.K-means Text Clustering Algorithm Based on Initial Cluster Centers Selection According to Maximum Distance[J].Application Research of Computers,2014,31(03):713-715.
[6]毛典辉.基于Map Reduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26.MAO Dian-hui.Improved Canopy-Kmeans Algorithm Based on Map Reduce[J].Computer Engineering&Applications,2012,48(27):22-26.
[7]Rodriguez A,Laio A.Machine learning Clustering by Fast Search and Find of Density Peaks[J].Science,2014,344(6191):1492.
[8]张嘉琪,张红云.拐点估计的改进谱聚类算法[J].小型微型计算机系统,2017,38(05):1049-1053.ZHANG Jia-qi,ZHANG Hong-yun.Improved Spectral Clustering Based on Inflexion Point Estimate[J].Journal of Chinese Computer Systems,2017,38(05):1049-1053.
[9]程堃.基于云平台的聚类算法并行化研究[D].南京:南京邮电大学,2015.KUN Cheng.Parallelized Clustering Algorithm Based on the Cloud Platform[D].Nanjing:Nanjing University of Posts and Telecommunications,2015.
[10]刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(01):11-15.LIU Yuan-chao,WANG Xiao-long,LIU Bing-quan.An Adapted Algorithm of Choosing Initial Values for k-means Document Clustering[J].Chinese High Technology Letters,2006,16(01):11-15.
[11]Hearst M A.Text Tiling:Segmenting Text into Multiparagraph Subtopic Passages[M].MIT Press,1997.
[12]岑咏华,王晓蓉,吉雍慧.一种基于改进K-means的文档聚类算法的实现研究[J].现代图书情报技术,2008,24(12):73-79.CEN Yong-Hua,WANG Xiao-rong,JI Yong-hui.Algorithm and Experiment Research of Textual Document Clustering Based on Improved K-means[J].New Technology of Library and Information Service,2008,24(12):73-79.
[13]丁文超,冷冰,许杰等.大数据环境下的安全审计系统框架[J].通信技术,2016,49(07):909-914.DING Wen-chao,LENG Bing,XU Jie,et al.Security Audit System Framework in Big Data Environment[J].Communications Technology,2016,49(07):909-914.
[14]陈爱平.基于Hadoop的聚类算法并行化分析及应用研究[D].成都:电子科技大学,2012.CHEN Ai-ping.Parallelized Clustering Algorithm Analysis and Application Based on Hadoop Platform[D].Chengdu:University Of Electronic Science and Technology of China,2012.