高维数据的聚类方法研究与应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

高维数据的聚类方法研究与应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Clustering Methods for High Dimensional Data and Their Applications
作者：陈黎飞
论文级别：博士
学科专业名称：人工智能基础
中文关键词：高维数据 ; 子空间聚类 ; 聚类有效性 ; 信息安全
英文关键词：High dimensional data ; Subspace clustering ; Cluster validity ; Information security
学位年度：2008
导师：姜青山
学科代码：081101
学位授予单位：厦门大学
论文提交日期：2008-05-01

摘要

聚类分析是数据挖掘中重要的研究课题,在信息过滤、资料自动分类、生物信息学等领域得到广泛应用。随着技术进步,聚类分析许多应用领域的数据具有很高的维度,例如,各种类型的文档数据、基因表达数据等其维度(属性)可以达到成百上千维,甚至更高。由于高维数据存在的普遍性,高维数据的聚类分析具有非常重要的意义。
     数据在高维空间中的表现相对于低维空间有很大的差异。在高维空间的许多情况下,由于数据分布的内在稀疏性,低维数据聚类常用的L_p距离等相似度度量有效性大大降低;高维空间中簇类往往只存在于某些低维子空间中,而不同的簇类其所处的子空间也可能存在差异。受“维度效应”的影响,许多在低维数据上表现良好的聚类方法运用于高维数据时无法获得很好的效果,需要采用一些特殊的方法进行高维数据的聚类分析。
     本文从高维数据子空间聚类的数学统计模型入手,研究其潜在的概率统计模型,继而提出新的聚类算法、开展高维数据的聚类有效性等研究;并在文本分类、网络入侵检测和恶意软件鉴别中进行应用研究,具有一定的理论意义和实际应用价值。
     本文的主要工作及贡献如下:
     1.提出了一种高维数据子空间聚类的概率统计模型及其学习算法,分析了子空间聚类算法的目标优化函数;
     2.建立了现有软子空间聚类算法与统计模型之间的联系,对其中两种代表性算法进行了多方面的改进;提出检测局部离群点的方法,提高了子空间聚类算法的鲁棒性:
     3.基于统计模型给出了模糊隶属度的新定义,提出一种高维数据的模糊聚类算法;结合三种改进的子空间聚类有效性指标,用于估计高维数据集的子空间簇类数目;
     4.针对传统方法需要对大型、高维数据集进行反复聚类引起的计算效率问题,提出了基于层次划分的最佳聚类数目确定方法;
     5.将子空间聚类方法应用于有指导的文本分类,提出了一种具有线性时间复杂度的文本分类新算法:将以上高维数据的聚类方法应用于网络入侵检测系统的关键特征选择和实际项目进行恶意软件辅助鉴别。
Clustering analysis is an important research in data mining, and has been widely used in many fields, such as message filtering, document categorization, bioinformatics, etc. In those fields, the data are always of high dimensions. For examples, the document data and gene microarray data are generated in several hundreds or even a thousand attributes (or dimensions). The universality of these data makes researches on high dimensional data clustering more and more important.
     The characteristics of data objects in high dimensional space are quite different from which in low dimensional space. In many cases, the effectiveness of similarity measurement which is usually adopted in low-dimensional data clustering, such as L_p-norm, will degrade rapidly in high dimensional space, due to the inherent sparsity of the data. In addition, clusters usually only exist in some low-dimensional subspaces, moreover, the subspaces may spanned by different combinations of dimensions within high dimensional data. Due to the curse of high dimensionality, many methods which work well on low-dimensional data will yield poor performances when clustering high dimensional data.
     In order to address these problems, some new methods are proposed in this thesis, which focuses on the issues of new subspace clustering algorithms and high dimensional cluster validition, based on subspace cluster modeling. The methods mentioned above are also used in text categorization, network intrusion detection and malware detection. The researches in this dissertation have much important theoretical and practical significance.
     The majority of our contributions can be summarized as follows:
     1. A probability model for describing the subspace clusters in high dimensional space as well as its learning algorithm and clustering objective function is presented.
     2. Some recent soft subspace clustering algorithms are improved in terms of stability and clustering accuracy, by analyzing their relationships with the probability model. The algorithms are further improved in terms of robustness by embed local outliers detection.
     3. A new definition of the fuzzy membership has been derived based on the probability model, and a fuzzy algorithm for subspace clustering on high dimensional data is proposed. Furthermore, three traditional cluster validity indices are improved to meet with the requirements of subspace clustering. Combing with the fuzzy algorithm, the new subspace cluster validity indices are used to estimate the number of subspace clusters in high dimensional data.
     4. A hierarchical method is presented to estimate the number of clusters on large and high dimensional datasets. The problem of inefficiency, arose by repeatly clustering on large datasets in the traditional approach, is solved in the new method.
     5. A new classification algorithm with linear time complexity is presented for text categorization, by combining unsupervised subspace clustering methods and supervised classification ones. We apply the proposed methods to network intrusion detection for supervised feature selection and a practical project for malware aided detection.

引文

[1]J.Han and M.Kamber.数据挖掘:概念与技术(第2版)[M].范明,孟小峰译.机械工业出版社,2007.
    [2]D.Hand,H.Mannila and P.Smyth.数据挖掘原理[M].张银奎,廖丽,宋俊等译.机械工业出版社,2003.
    [3]R.Groth.数据挖掘—构筑企业竞争优势[M].西南交通大学出版社,2001,
    [4]杨分召.高维数据挖掘中若干关键问题的研究[D].复旦大学博士学位论文,2003.
    [5]牛琨.聚类分析中若干关键技术及其在电信领域的应用研究[D].北京邮电大学博士学位论文,2007.
    [6]P.Berkhin.A Survey of clustering data mining techniques[M].In:Grouping Multidimensional Data:Recent Advances in Clustering,Ed.J.Kogan and C.Nicholas and M.Teboulle,Springer,2006,2005,pp.25-71.
    [7]W.Qian and A.Zhou.Analyzing Popular Clustering Algorithms from Different Viewpoints[J].Journal of Software,2002,13(8):1382-1394.
    [8]A.K.Jain,M.N.Murty and P.J.Flynn.Data Clustering:A Review[J].ACM Computing Surveys,1999,31(3):264-323.
    [9]陈晓云.文本挖掘若干关键技术研究[D].复旦大学博士学位论文,2005.
    [10]袁军鹏,朱东华,李毅等.文本挖掘技术研究进展[J].计算机应用研究,2006,2:1-4.
    [11]赵晖.支持向量机分类方法及其在文本分类中的应用研究[D].大连理工大学博士学位论文,2005.
    [12]F.Sebastiani.Text Categorization[J].ACM Computing Surveys,2002,34(1):1-47.
    [13]E.Leopold and J.Kindermann.Text categorization with support vector machines:how to represent texts in input space?[J].Machline Learning,2002,46(1-3):423-444.
    [14]O.Zamir and O.Etzioni.Web document Clnstering:A Feasibility Demonstration[C].Proceeding of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval,1998,pp.46-54.
    [15]P.Hansen,B.Jaumard.Cluster analysis and mathematical programming.Mathematical Programming.1997,79(1-3):191-215.
    [16]H.Zeng,Q.He,Z.Chen,et al.Learning to Cluster Web Search Results[C].Proceeding of the SIGIR,2004,pp.210-217.
    [17]F.Gelgi,H.Davulcu and S.Vadrevu.Term Ranking for Clustering Web Search Results[C].Proceeding of the WebDB,2007,pp.736-742.
    [18]M.Sasaki.Spam Detection Using Text Clustering[C].Proceeding of the CW,2005,pp.316-325.
    [19]R.Neomaycr.clustering based ensemble classification for spare filtering.Proceedings of the 6th Workshop on Data Analysis,2006,pp.11-22.
    [20]李翔鹰,陈钟等.一种基于后缀数组聚类(SAC)的中文垃圾邮件过滤方法.计算机科学.2006,33(5):107-112.
    [21]和亚丽.基于高维空间的聚类技术研究[D].中北大学硕士学位论文,2005.
    [22]D.L.Donoho.High-Dimensional Data Analysis:The Curses and Blessings of Dimensionality[Z].Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century,2000.
    [23]L.Parsons,E.Haque and H.Liu.Subspace Clustering for High Dimensional Data:A Review[J].ACM SIGKDD Explorations Newsletter,2004,6(1):90-105.
    [24]M.Steinbach,L.Ert(o|¨)z and V.Kumar.The Challenges of Clustering High Dimensional Data[Z].http://www-users.cs.umn.edu/～ertoz/papers/clustering_chapter.pdf,2003.
    [25]M.Verleysen.Learning High-dimensional Data[Z].Limitations and Future Trends in Neural Computation,S.Ablameyko et al.(Eds.),2003,pp.141-162.
    [26]S.Brin,R.Motwani and C.Silverstein.Beyond Market Baskets:Generalizing Association Rules to Correlations[J].SIGMOD Record(ACM Special Interest Group on Management of Data),1997,26(2):265-276.
    [27]Y.Wang,Q.Guo and X.Li.A Kernel Aggregate Clustering Approach for Mixed Data Set and Its Application in Customer Segmentation[C].Proceeding of the ICMSE,2006,pp.121-124.
    [28]Y.Ye,D.Wang,T.Li and D.Ye.IMDS:Intelligent Malware Detection System[C].Proceeding of the ACM SIGKDD,2007,pp.1043-1047.
    [29]Y.Ye,L.Chen(陈黎飞),T.Li and Qingshan Jiang.SBMDS:A Behavioral String Based Malware Detection System using SVM Ensemble with Bagging.Submitted to ACM SIGKDD'08.
    [30]Y.Li,C.Campbell and M.Tipping.Bayesian automatic relevance determination algorithms for classifying gene expression data[J].Biolnformaties,2002,18(10):1332-1339.
    [31]Y.Cheng and G.M.Church.Biclnstering of Expression Data[C].Proceeding of the 8th International Conference on Intelligent Systems for Molecular Biology,2000,pp.93-103.
    [32]G.J.Gordon,R.V.Jensen,et al.Translation of Mieroarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gege Expression Ratios in Lung Cancer And Mesothelioma[J].Cancer Research,2002,62:4963-4967.
    [33]T.Golub,D.Slonim,P.Tamayo,et al.Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring[J].Science,1999,256(5439):531-537.
    [34]A.H.Sung and S.Mukkamala.Feature Selection for Intrusion Detection using Neural Networks and Support Vector Machines[J].Journal of the Transportation Research Board of the National Academies,2003,1822:33-39.
    [35]W.W.Y.Ng,R.K.C.Chung and D.S.Yeung.Dirnansionality Reduction for Denial of Service Detection Problems Using RBFNN Output Sensitivity[C].Proceeding of the Second International Conference on Machine Learning and Cybernetics,2003,pp.1293-1298.
    [36]L.Chen(陈黎飞),L.Shi,Q.Jiang and S.Wang.Supervised Feature Selection for DoS Detection Problems Using a New Clustering Criterion[J].Journal of Computational Information Systems,2007,3(5):1983-1992.
    [37]C.C.Aggarwal,J.Hun,J.Wang and P.S.Yu.A Framework for Projected Clustering of High Dimensional Data Streams[C].Proceeding of the VLDB,2004,pp.852-863.
    [38]周晓云,孙志挥,张柏礼,杨宜东.高维数据流子空间聚类发现及维护算法[J].计算机研究与发展,2006,43(5):834-840.
    [39]李娜.基于聚类的网络入侵检测方法研究[D].广西大学硕士学位论文,2007.
    [40]J.H.Frieman.On Bias,Variance,0/1 Loss,and the Curse-of-Dimensionality[J].Data Mining and Knowledge Discovery,1997,1:55-77.
    [41]Q.Yang and X.Wu.10 Challenging Problems in Data Mining Research[Z].Lecture on the ICDM,2005.
    [42]X.Yin,J.Han and P.S.Yu.LinkClus:Efficient Clustering via Heterogeneous Semantic links[C].Proceeding of the VLDB,2006,pp.427-438.
    [43]L.Jing,M.K.Ng,J.Xu and J.Z.Huang.A Text Clustering System based on k-means Type Subspace Clustering[J].International Journal of Intelligent Technology,2006,1(2):91-103.
    [44]M.Steinbach,G.Karypis and V.Kumar.A Comparison of Document Clustering Techniques[C].Proceeding of the KDD Workshop on Text Mining,2000,pp.109-110.
    [45]K.Beyer,J.Goldstein,and R.Ramakrishnan.When IS Nearest Neighbor Meaningful[C].Proceeding of the ICDT,1999,pp.217-235.
    [46]A.Hinneburg,C.C.Aggarwal and D.A.Keim.What is the nearest neighbor in high dimensional spaces[C].Proceeding of the VLDB,2000,pp.506-515.
    [47]R.Harpaz.A Short Introduction to Subspace Clustering[Z].http://acc6.its.brooklyn.cuny.edu/～rbharpaz /research.htm.2004.
    [48]L.Jing,M.K.Ng and J.Z.Xu.On the Performance of Feature Weighting K-Means for Text Subspace Clustering[C].Proceeding of the WAIM,2005,pp.502-512.
    [49]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):3-12.
    [50]林青,史晓东.垃圾邮件过滤技术研究[D].厦门大学硕士学位论文,2006.
    [5l]叶艳芳.基于数据挖掘技术的病毒主动防御系统[D].福州大学硕士学位论文,2006.
    [52]M.Benkhalifa and A.Bensaid.Text Categorization using the Semi-Supervised Fuzzy c-Means Algorithm[C].Proceeding of the NAFIPS,1999,pp.561-565.
    [53]S.Dasgupta.Learning mixtures of Gaussians[C].Proceeding of the 40th Annual Symposium on Foundations of Computer Science,1999,pp.634-644.
    [54]M.Dash and H.Liu.Dimensiunality Reduction[M].In:Encyclopedia of Computer Science and Engineering,John Wiley & Sons,Inc.2003.
    [55]K.Chakrabarti and S.Mehrotra.Local Dimensionality Reduction:A New Approach to Indexing High Dimensional Spaces[C].Proceeding of the VLDB,2000,pp.89-100.
    [56]张志兵.空间数据挖掘关键技术研究[D].华中科技大学博士学位论文,2004.
    [57]李庆华,李新,蒋胜益.一种面向高维混合属性数据的异常挖掘算法[J].计算机应用,2005,25(6):132-135.
    [58]薛安荣,鞠时光,何伟华,陈伟鹤.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463.
    [59]李炎,李皓,钱肖鲁,朱扬勇.异常检测算法分析[J].计算机工程,2002,28(6):13-14+40.
    [60]魏藜,宫学庆,钱卫宁等.高维空间中的离群点发现[J].软件学报,2002,13(2):280-290.
    [61]C.C.Aggarwal and P.S.Yu.An effective and efficient algorithm for high- dimensional outlier detection[J].The VLDB Journal,2005,14(2):211-221.
    [62]R.Agrawal,J.Gehrke,D.Gunopulos and P.Raghavan.Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications[C].Proceeding of the ACM SIGMOD,1998,pp.94-05,.
    [63]S.Dhillon,Y.Guan and J.Kogan.Iterative Clustering of High Dimensional Text Data Augmented by Local Search[C].Proceeding of the ICDM,2002,pp.131-138.
    [64]郑岩,黄怀荣,战晓苏等.基于遗传算法的动态模糊聚类[J].北京邮电大学学报,2005,28(1):78-81.
    [65]陈泯融,邓飞其.一种基于自组织特征映射网络的聚类方法[J].系统工程与电子技术,2004,26(12):1864-1866.
    [66]A.Scholkopf,A.Smola and K.R.Muller.Nonlinear component analysis as a kernel eigenvalue problem[J].Neural Computation,1998,10(1):1299-1319.
    [67]E.H.Han,G.Karypis and V.Kumar.Clustering In a High-Dimensional Space Using Hypergraph Models[R].Technical Report,TR-97-063,Department of Computer Science,Universyty of Minnesota,Minneapolis,Minnesota,1997.
    [68]T.Li.General Model for Clustering Binary Data[C].Proceeding of the ACM SIGKDD,2005,pp.18g-197.
    [69]T.Li and S.Ma.IFD:iterative feature and data clustering.Proceedings of the SIAM International conference on Data Mining.2004,pp.472-476.
    [70]C.C.Aggarwal,C.Procopiuc,et al.Fast Algorithm for Projected Clustering[C].Proceeding of the ACM SIGMOD,1999,pp.61-71.
    [71]K.Y.Yip,D.W.Cheung and M.K.Ng.A Review on Projected Clustering Algorithms[J].International Journal of Applied Mathematics,2003,13:24-35.
    [72]G..Moise,J.Sander and M.Ester.Robust projected clustering[J].Kownledge Information System,2008,14(3):273-298.
    [73]G.Gan and J.Wu.Subspace Clustering for High Dimensional Categorical Data[J].ACM SIGKDD Explorations Newsletter,2004,6(2):87-94.
    [74]M.J.Zaki and M.Peters.CLICKS:Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques[C].Proceeding of the ICDE,2005,pp.355-356.
    [75]E.Achtert,C.Bohm,H.-P.Kriegnl,et al.Detection and Visualization of Subspace Cluster Hierarchies[C].Proceeding of the 2th Internatinnal Conference on Database Systems for Advanced Applications,2007,pp.152-163.
    [76]M.Halkidi,Y.Batistakis and M.Vazirgiannis.Clustering validity checking methods:Part Ⅱ[J].ACM SIGMOD Record Archive,2002,31(3):19-27.
    [77]范九伦.模糊聚类新算法与聚类有效性问题研究[D].西安电子科技大学博士学位论文,1998.
    [78]M.Patrikainen and M.Meila.Comparing Subapace Clusterings[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(7):902-916.
    [79]M.Kim,H.Yoo,and R.S.Ramakrishna.Cluster Validation for High Dimensional Datasets[J].Proceeding of the AIMSA,2004,pp.178-187.
    [80]C.Domeniconi,D.Gnnopulos,et al.Locally Adaptive Metrics for Clustering High Dimensional Data[R].Technical Report ISE-TR-06-04,2006.
    [81]M.Narahashi and E.Suzuki.Detecting Hostile Accesses through Incremental Subspace Clustering[C].Proceeding of the IEEE/WIC,2003,pp.337-343.
    [82]A.Franco,A.Lumini,D.Maio and L.Nanni.An enhanced subspace method for face recognition[J].Pattern Recognition Letters,2006,27(1):76-84.
    [83]J.M.Gorriz,J.Ramirez,J.C.Segura,et al,Noise subspace fuzzy C-means clustering for robust speech recognition[C].LNCS,2006,3984:772-779.
    [84]N.Agarwal,E.Haque,H.Liu and L.Parsons.Research Paper Recommender Systems:A Subspace Clustering Approach[C].LNCS,2005,3739:475-491.
    [85]X.Zhang,J.Z.Hnang,D.Qian,J.Xu and L.Jing.Supplier categorization with K-means type subspace clustering[C].LNCS,2006,3841:226-237.
    [86]L.Jing,M.K.Ng and J.Z.Huang.An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensinoal Sparese Data[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(8):1-16.
    [87]R.Agrawal,J.Gehrke,D.Gunopulos and P.Raghavan.Automatic Subspace Clustering of High Dimensional Data[J].Data Mining and Knowledge Discovery,2005,11(1):5-33.
    [88]S.Goil,H.Nagesh and A.Choudhary.Mafia:Efficient and scalable subspace clustering for very large data sets[R].Technical Report CPDC-TR-9906-010,Northwestern University,2145 Sheridan Road,Evanston IL 60208,1999.
    [89]H.S.Nagesh,S.Goil and A.Choudhary.A scalable parallel subspace clustering algorithm for massive data sets[C].Proceeding of the ICPP,2000,pp.477-486.
    [90]J.He,M.Lan,C.L.Tan,et al.Initialization of cluster refinement algorithms:a review and comparative study[C].Proceeding of International Joint Conference on Neural Networks,2004,pp.297-302.
    [91]C.Bohm,K.Kailing,H.P.Kriegel and P.Kroger.Density connected clustering with local subspace preferences[C].Proceeding of the ICDM,2004,pp.27-34.
    [92]I.Assent,R.Krieger,E.Muller and T.Seidl.DUSC-Dimensionality Unbiased Subspace Clustering[C]. Proceeding of the ICDM,2007,pp.409-414.
    [93]陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(2):62-72.
    [94]R.Harpaz and R.Haralick.Modeling High-Dimensional Probability Distributions via Linear Manifold Clusters[Z].http://acc6.its.brooklyn.cuny.edu/～rbharpaz/modLM.pdf,2007.
    [95]M.Brun,C.Sima,J.Hua,et al.Model-based evaluation of clustering validation measures[J].Pattern Recognition,2007,40(3):807-824.
    [96]C.Borgelt and R.Kruse.Fuzzy and Probabilistic Clustering with Shape and Size Constraints[C].Proceeding of the 11th International Fuzzy Systems Association World Congress(IFSA'05),2005,pp.945-950.
    [97]倪巍伟,孙志挥,陆介平.k_LDCHD—高维空间k邻域局部密度聚类算法[J].计算机研究与发展,2005,42(5):784-791.
    [98]M.Ester,H.-P.Kriegel,J.Sander and X.Xu.A density-based algorithm for discovering clusters in large spatial databases with noise[J].Proceeding of the ACM SIGKDD,1996,pp.226-231.
    [99]D.Modha and S.Spangler.Feature Weighting in K-Means Clustering[J].Machine Learning and Data Mining:Methods and Applications,2003,52(3):217-237.
    [100]刘纪平,汪宏斌,汪诚波等.基于模糊最近邻的高维数据聚类[J].小型微型计算机系统,2005,26(2):261-263.
    [101]Y.Zhao,C.Zhang,S.Zhang and L.Zhao.Adapting K-Means Algorithm for Discovering Clusters in Subspaces[C].LNCS,2006,3841:53-62.
    [102]M.Bouguessa,S.Wang and Q.Jiang.A K-means-based Algorithm for Projective Clustering[C].Proceeding of the ICPR,2006,pp.888-891.
    [103]R.Vidal,Y.Ma and S.Sastry.Generalized principal component analysis(GPCA)[C].Proceeding of IEEE Conference on Computer Vision and Pattern Recognition,2003,pp.18-20.
    [104]G.K.Kuchimanchi,V.V.Phoha,K.S.Balagani and S.R.Gaddam.Dimension Reduction Using Feature Extraction Methods for Real-time Misuse Detection Systems[C].Proceeding of the 2004 IEEE Workshop on Information Assurance and Security,2004,pp.195-202.
    [105]C.C.Aggarwal and P.S.Yu.Refining Clustering for High Dimensional Applications[J].IEEE Transaction on Knowledge and Data Engineering.2002,14(2):210-225.
    [106]M.Lan,S.Y.Sung,H.B.Low and C.LTan.A Comparative Study on Term Weighting Schemes for Text Categorization[C].Proceeding of IEEE International Joint Conference on Neural Networks,2005,pp.546-551.
    [107]L.Chen(陈黎飞),Q.Jiang and S.Wang.A New Unsupervised Term Weighting Scheme for Document Clustering[J].Journal of Computational Information Systems,2007,3(2):1455-1464.
    [108]L.Chen(陈黎飞) and Q.Jiang.An extended EM algorithm for subspace clustering[J].Frontiers of Computer Science in China,Higher Education Press and Springer-Verlag,2008,2(1):81-86.
    [109]L.Chen(陈黎飞),Y.Ye and Q.Jiang.A New Centroid-based Classifier for Text Categorization[C].Proceeding of the IEEE 22nd International Conference on Advanced Information Networking and Applications-Workshop,2008,pp.1217-1222,
    [110]A.Strehl and J.Ghosh.Cluster Ensembles:A Knowledge Reuse Framework for Combining Multiple Partititons[J].Journal on Machine Learning Research,2007,3:583-617.
    [111]P.Artigas,A.Likhodedov and R.Caruana.Meta Clustering[Z].http://www-2.es.cmu.edu/～artigas/classproj/mlproj.ps,2000.
    [112]M.Bouguessa,S.Wang and H.Sun.An Objective Approach to Cluster Validation[J].Pattern Recognition Letters,2006,27:1419-1430.
    [113]J.C.Bezdek,et al.Fuzzy Models and Algorithms for Pattern Recognition and Image Processing[M].Kluwer Academic Publishers,Boston.,1999.
    [114]N.R.Pal and J.C.Bezdek.On Cluster Validity for the Fuzzy C-Means Model[J].IEEE Transaction on Fuzzy Systems,1995,3(3):370-379.
    [115]X.Xie and G.Beni.A Validity Measure for Fuzzy Clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI),1991,13(8):841-847.
    [116]H.Sun,S.Wang and Q.Jiang.FCM-Based Model Selection Algorithms for Determining the Number of Clusters[J].Pattern Recognition,2004,37(10):2027-2037.
    [117]R.Tibshirani,G.Walther and T.Hastie.Estimating the number of clusters in a data set via the gap statistic[J].Journal of the Royal Statistical Society,Series B(Statistical Methodology),2001,63(2):411-423.
    [118]C.A.Sngar and G.M.James.Finding the Number of Clusters in a Data Set:An Information Theoretic Approach[J].Journal of the American Statistical Association,1998,98:750-763.
    [119]A.V.Kapp and R.Tibshirani.Are clusters found in one dataset present in another dataset?[J]Biostatistics,2007,8(1):9-31.
    [120]C.F.J.Wu.On the convergence properties of the EM algorithm[J].Annals of Statistics,1983,11(1):95-103.
    [121]张敏,于剑.基于划分的模糊聚类算法[J].软件学报,2004,15(6):858-868.
    [122]T.Zhang,R.Ramakrishnan and M.Livny.BIRCH:An Efficient Data Clustering Method for Very Large Databases[C].Proceeding of the ACM SIGMOD,1996,pp.103-114.
    [123]R.Kohonen.Self-organized formation of topologically correct feature maps[J].Biological Cyberneteis,1982,43:59-69.
    [124]K.G.Woo,J.H.Lee,M.H.Kim and Y.J.K.Lee.FINDIT:a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting[J].Information and Software Technology,2004,46(4):255-271.
    [125]J.Z.Huang,M.K.Ng,H.Rong and Z.Li.Automated Variable Weighting in k-means Type Clustering[J].IEEE Transactions on Knowledge and Data Engineering,2005,27(5):657-668.
    [126]J.H.Friedman and J.J.Meulman.Clustering Objects on Subsets of Attributes[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology),2004,66(4):815-849.
    [127]J.Yang,W.Wang,H.Wang and P.Yu.δ-clusters:capturing subspace correlation in a large data set[C].Proceeding of the ICDE,2002,pp.517-528.
    [128]C.H.Cheng A.W.Fu and Y.Zhang.Entropy-based subspace clustering for mining numerical data[C].Proceeding of the ACM SIGKDD,1999,pp.84-93.
    [129]P.D.Hoff,Model-based subspace clustering[J].Bayesian Analysis,2006,1(2):321-344.
    [130]H.Kriegel,P.Kroger,M.Renz and S.A.Wurst.Generic Framework for Efficient Subspace Clustering of High-Dimensional Data[C].Proceeding of the ICDM,2005,pp.250-257.
    [131]C.M.Procopiue,M.Jones,P.K.Agarwal and T.M.Murali.A monte carlo algorithm for fast projective clustering[C].Proceeding of the ACM SIGMOD,2002,pp.418-427.
    [132]S.Brecheisen.H.-P.Kriegel and M.Pfeifle.Multi-Step density-based clustering[J].Knowledge and Information Systems,2006,9(3):284-308.
    [133]A.Foss and O.R.Za(i|¨)ane.A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets[C].Proceeding of the ICDM,2002,pp.179-186.
    [134]G.Gan,J.Wu and Z.Yang.Fuzzy Subspace Algorithm for Clustering High Dimensional Data[C].Proceeding of the ADMA,2006,.pp.271-278.
    [135]E.Y.Chana,W.K.Chinga,M.K.Ng and J.Z.Huang.An Optimization Algorithm for Clustering Using Weighted Dissimilarity Measures.Pattern Recognition,2004,37(5):743-752.
    [136]H.Frigui and O.Nasraoui.Simultaneous Clustering and Attribute Discrimination[C].Proceeding of the 9th IEEE International Conference on Fuzzy Systems,2000,pp.158-163.
    [137]R.Agrawal,R.Srikant.Fast Algorithms for Mining Association Rules[C].Proceeding of the VLDB,1994,pp.487-499.
    [138]R.K.Agarwal and N.H.Mustafa.K-means projective clustering[C].Proceeding of the PODS,2004,pp.155-165.
    [139]E.K.K.Ng,A.W.Fu and R.C.Wong.Projective clustering by histograms[J].IEEE Transaction on Knowledge and Data Engineering,2005,17(3):369-382.
    [140]Y.B.Kim and J.Gao.A New Semi-Supervised Subspace Clustering Algorithm on Fitting Mixture Models[C].Proceeding of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology,2005,pp.1-8.
    [141]K.Y.Yip,D.W.Chenng and M.K.Ng.On Discovery of Extremely Low-Dimensional Clusters using Semi-Supervised Projected Clustering[R].HKU CS Tech Report TR-2004-08,2004.
    [142]张莉,周伟达,焦李成.核聚类算法[J].计算机学报,2002,25(6):587-590.
    [143]孔锐,张国宣,施泽生等.基于核的K-均值聚类[J].计算机工程,2004,30(11):16-17+84.
    [144]中国互联网协会.中国互联网协会反垃圾邮件规范[Z].http://www.mailer.cn/article/view/456,2003.
    [145]G.P.I.Androutsopoulos,V Karkaletsis,et al.Learning to Filter Spam E-mail:A Comparison of a Naive Bayesian and a Memory-Based Approach[C].Proceeding of the PKDD,2000,pp.1-13.
    [146]Y.Ye,D.Wang,T.Li,D.Ye and Q.Jiang.An intelligent PE-malware detection system based on association mining[J],Journal in Computer Virology,http://springer.lib.tsinghua.edu.cn/content/y7836381m6843464/fulltext.pdf,2008
    [147]M.Balley,J.Oberheide,et al.Automated Classification and Analysis of Internet Malware[C].LNCS,2007,4637:178-197.
    [148]N.Wu,L.Shi and Q.Jiang.An Outlier Mining-Based Method for Anomaly Detection[C].Proceeding of 2007 IEEE International Workshop on Anti-counterfeiting Security,Identification,2007,pp.52-156.
    [149]P.Kabiri and A.A.Ghorbani.Research on Intrusion Detection and Response:A Survey[J].International Journal of Network Security,2005,11(2):84-102.
    [150]L.Portnoy,E.Eskin and S.J.Stolfo.Intrusion detection with unlabeled data using clustering[C].Proceeding of the ACM CSS Workshop on data mining applied to security,2001,pp.123-130.
    [151]卿湘运,王行愚.子空间聚类的非参数模型及变分贝叶斯学习[J].计算机学报,2007,30(8):1333-1343.
    [152]F.Weng,Q.Jiang,L.Chen(陈黎飞) and Z.Hong.Clustering Ensemble Based on the Fuzzy KNN Algorithm[C].Proceeding of the SNPD,2007,3:1001-1006.
    [153]E.R.Dougherty and M.Brun.A probabilistic theory of clustering[J].Pattern Recognition,2004,37:917-925.
    [154]S.Wang and H.Sun.Measuring Overlap-Rate for Cluster Merging in a Hierarchical Approach to Color Image Segmentation[J].International Journal of Fuzzy Systems,2004,6(3):147-156.
    [155]S.Dasgupta and A.Gupta.An elementary proof of the Johnson-Lindenstranss Lemma[R].TR-99-006,International Computer Science Institute.1999.
    [156]D.Achlioptas.Database-friendly Random Projectious[C].Proceeding of the Annual Symposium on Principles of Database Systems,2001,pp.274-281.
    [157]S.Kaski.Dimensionality reduction by random mapping:Fast similarity computation for clustering[C].Proceeding of the International Joint Conference on Neural Networks,1998,1:413-418.
    [158]T.Urruty,C.Djeraba and D.A.Simovici.Clustering by Random Projections[C].Proceeding of the ICDM, 2007,pp.502-506.
    [159]S.Dasgupta.Experiments with random projection[C].Proceeding of the 16th Conference on Uncertainty in Artificial Intelligence.2000,pp.143-151.
    [160]Y.M.Cheung.K~*-Means:A new generalized k-means clustering algorithm[J].Pattern Recognition Letters,2003,24:2883-2893.
    [161]J.Ma,T.Wang and L.Xu.A gradient BYY harmony learning rule on Gaussian mixture with automated model selection[J].NeuroComputing,2004,56:481-487.
    [162]L.Rigutini and M.Maggini.A Semi-supervised Document Clustering Algorithm based on EM[C].Proceeding of the IEEE/WlC/ACM International Conference on Web Intelligence,2005,pp.143-151.
    [163]D.Arthur and S.Vassilvitskii.How Slow is the k-Means Method[C].Proceeding of the Symposium on Computational Geometry,2006,pp.144-153.
    [164]S.A.Teukolsky,W.T.Vetterling and B.P.Flaanery.Numerical Recipes in C++:The Art of Scientific Computing[M].William H.Press,2002.
    [165]S.Tan,X.Cheng,M.Ghanem,B.Wang,and H.Xu.A Novel Refinement Approach for Text Categorization[C].Proceeding of the ACM CIKM,2005,pp.469-476.
    [166]J.Maria and G.Hidalgo.Text Representation for Automatic Text Categorization[Z].http://www.softlab.ntua.gr,2003.
    [167]F.Debole and F.Sebastiani.Supervised Term Weighting for Automated Text Categorization[C].Proceeding of the 18th ACM Symposium on Applied Computing,2003,pp.754-758.
    [168]H.Frigui and O.Nasraoui.Unsupervised learning of prototypes and attribute weights[J].Pattern Recognition,2004,37(3):567-581.
    [169]E.-H.S.Han and G.Karypis.Centroid-Based Document Classification:Analysis & Experimental Results[R].Technical Report:#00-017,University of Minnesota,2000.
    [170]N.Kumar,K.Kummamuru and D.Paranjpe.Semi-Supervised Clustering with Metric Learning using Relative Comparisons[C].Proceeding of the ICDM,2005,pp.693-696.
    [171]S.Shankar and G.Karypis.Weight Adjustment Schemes for a Centroid Based Classifier[R].Technical Report:TR 00-035,University of Minnesota,,2000.
    [172]I.Androutsopoulos,G.Paliouras and E.Michelakis.Learning to Filter Unsolicited Commercial E-Mail[R].NCSR "Demokritos" Technical Report,No.2004/2,2004.
    [173]S.Still and W.Bialek.How many dusters? An information-theoretic perspective[J].Neural Computation,2004,16(12):2483-2506.
    [174]诸克军,苏顺华,黎金玲.模糊C均值中的最优聚类与最佳聚类数[J].系统工程理论与实践,2005,25(3):52-61.
    [175]M.Rezae,A New Cluster Validity Index for the Fuzzy C-Mean[J].Pattern Recognition Letters,1998,19(3-4):237-246.
    [176]陈黎飞,姜青山,Wang Shengrui,董槐林.基于图形轮廓的快速聚类算法[J].计算机研究与发展,2006,43(Supp):314-320.
    [177]冯登国.国内外信息安全研究现状及其发展趋势[J].Net Security Technologies and Application,2001,pp.8-13.
    [178]蒋盛益,李庆华,王卉,孟中楼.一种基于聚类的有指导的入侵检测方法[J].小型微型计算机系统,2005,26(6):1042-1045.
    [179]X.Xu.Adaptive Intrusion Detection Based on Machine Learning:Feature Extraction,Classifier Coustruction and Sequential Pattern Prediction[J].International Journal of Web Services Practices,2006,2(1-2):49-58.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700