用户名: 密码: 验证码:
适应概念漂移的数据流分类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近几年,数据挖掘领域涌现出一种的新研究课题—数据流挖掘。在许多实际应用中,如股票分析、网络故障监测、信用卡欺诈领域得到了广泛的应用。数据挖掘研究领域里分类挖掘是其中重要的分支之一。现在成熟的数据流分类挖掘算法有:基于Hoeffding树的VFDT、适应概念漂移的CVFDT、集成分类器Ensemble Classifiers、VFDTC等。其中,集成分类方式被广泛应用在数据流分类挖掘领域。数据流的特点之一—概念漂移,是当今所有数据流分类算法必须面对的最大的挑战。分类算法性能的高低,取决于其适应概念漂移的能力。如今大多数性能优越的分类算法均采用集成分类方法。
     本文首先阐述了数据挖掘理论的相关知识,详细介绍了经典数据流分类算法EC4.5,以及概念漂移的概念。与EC4.5相比,CEEPCE算法提高了分类的准确率,但仍存在适应概念漂移能力不足的问题。本文提出的适应概念漂移的数据流分类算法基于集成分类器的构造、淘汰、更新、以及差异性加强等因素来优化分类性能。首先,介绍了基分类器的构造方式,结合eEP的特性构造出有较高区分度的基分类器。其次,给出了分类器的淘汰标准,根据基分类器的分类误差率进行淘汰。此外,根据算法CEEPCE提出了两点改进。第一次改进提出了基于分类误差权值的差异性加强方法,从而提高了集成分类器的分类精度。在保证基分类器分类性能的前提下,通过差异性加强方法提取最终的分类器集合。第二次改进对集成分类器的更新方式进行优化,更新时根据分类器的平均错误率与随机分类错误率的比较,选择是否加入相反分类器,这种验证方式虽然在时间效率上有所降低,但面对大数据量的分类时,其适应当前概念的优越性将会体现出来。
In recent years, data mining has emerged a new field of research - data stream mining. In many practical applications, such as stock analysis, network fault detection, and credit card fraud, data mining has been widely used. In the field of data mining, classification mining is one of the important branches. Now there are some comprehensive classifical algorithms: VFDT- based on Hoeffding Tree, CVFDT- adapting to the concept drift, Ensemble Classifiers and VFDTC. Ensemble Classifiers is widely used in data stream classification mining. Concept drift-one of the characteristics of data streams, is the biggest challenge in classification mining. The performance of classification algorithm depends on its ability to adapt to the concept drift. Now, Ensemble Classifiers is the most superior performance of the algorithm.
     First the paper gives the theory of knowledge of data mining and show the algorithm EC4.5 and the concept of Concept-drifting.Comparing with EC4.5, although algorithm CEEPCE improves the accuracy, however,the ability of adapting to the concept-drifting is still inadequate. The classification algorithm that adapt to the concept drift is based on the ensemble classifiers’construction, eliminated, update, and the strengthening of differences for optimizing the classification performance. First, the paper describes the construction of the base classifier and the elimination criteria, combining the characteristics of eEP with to construct a higher degree of distinction between the base classifiers. Second, the paper finger out the standard of the elimination of the classifiers according to the error rate of classifiers. Moreover, according to the characteristics of algorithms CEEPCE, the paper prospers two improvements.The first method gives the method of strengthening of differences based on classification error to improve the accuracy of integrated classifiers. Under the premise of ensuring the performance of base classifier, extract the final set of classifiers. The second method improves the updating way .In order to choose whether to join the opposite classifier based on the comparison of the average error rate and random classification error rate, when updating the classifiers. Although wasting of time, when faced with large amounts of data classification, the superiority to adapt to the concept of change will be reflected.
引文
[1] G.Hulten,L.Spencer,P.Domingos.Mining time-changing data streams.In: Proc of the 2001 ACM SIGKDD Int’l Conf. on KDD,ACM Press.2001:97-106P
    [2] J.Han, J.Pei, Y.Yin, R.Mao.Mining frequent patterns without candidate generation.Data mining and Knowledge Discovery. 2004,1(8):53-87P
    [3] G.S.Manku, S.Rajagopalan, B.G.Lindsay. Approximation medians and other quantiles in one pass and with limited memory.In: Proc of the 1998 ACM SIGMOD Int’l Conf. on Management of Data, ACM Press. 1998: 426-435P
    [4] L.Kaufman , P.J.Rousseeuw.Finding groups in data: an introduction to cluster analysis.1990:368P
    [5] N.Alon, Y.Matias, M.Szegedy.Tracking join and self-join sizes in limited storage.2002,3(64): 719-747P
    [6] P.E.Utgoff.Incremental induction of deeision trees.Machine Learning, Springer Netherlands Press.1989, 2(4):161-186P
    [7] J.Han, J.Pei, B.Mortazavi-Asl, Q.Chen, U.Dayal.Freespan:frequent pattern-projected sequential pattern mining.In:Proc. of the 2000 ACM SIGKDD Int’l Conf.on KDD, 2000: 355-359P
    [8] Y.E.Loannidis, V.Poosala.Histogram-Based approximation of set-valued query answers. In: Proc. of the 1999 Int’l Conf.on VLDB. 1999: 174-185P
    [9] Y.Matias, J.S.Vitter, M.Wang.Wavelet-based histograms for selectivity estimation.In: Proc.of the ACM SIGMOD Int’l Conf. on Management of Data, ACM Press. 1998: 448-459P
    [10] P.B.Gibbons, Y.Matias. New sampling-based summary statistics for improving approximate query answers.In: Proc.of the 1998 ACM SIGMOD Int’l Conf. on Management of Data, ACM Press.1998: 331-342P
    [11] P.Domingos, G.Hulten. Mining high-speed data streams. In: Proc of the 6thACM SIGKDD Int’l Conf. on KDD, ACM Press. 2000:71-80P
    [12] W.N.Street, Y.S.Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In: Proc of the 7th ACM SIGKDD Int’l Conf. on KDD, ACM Press. 2001:377-382P
    [13] J.Rushing, S.Graves, E.Criswell, et al. A coverage based ensemble algorithm (CBEA) for streaming data. In: Proc of the 16th IEEE Int’l Conf. on Tools with Artificial Intelligence. 2004:106-112P
    [14] F.Chu, C.Zaniolo. Fast and light Boosting for adaptive mining of data streams. In: Proc of the 5th Pacific-Asia Conference on KDD, Springer Berlin Press. 2004: 282-292P
    [15] J.Gama, R.Rocha, P.Medas.Accurate decision trees forming high-speed data streams.In: Proc of the 9th ACM SIGKDD Int’l Conf. on KDD, ACM Press. 2003: 523-528P
    [16] J.Han, M.Kamber.Data mining-concepts and techniques. High Edueation Press. 2001
    [17] S.Guha, N.Mishra, R.Motwani, and L.O’Callaghan.Clustering data streams.In: Proc of the Annual Symposium on Foundations of Computer Science. 2000: 359-366P
    [18] L.O’Callaghan, N.Mishra, A.Meyerson, S.Guha, R.Motwani. Streaming-data algorithms for high-quality clustering. In Proc of IEEE Int’l Conf. on Data Engineering. 2002: 685-696P
    [19] G.Widmer, M.Kubat. Learning in the presence of concept-drift and hidden contexts.Machine Learning. 2005, 1(23): 69-101P
    [20] M.B.Harries, C.Sammut, K.Horn. Extracting hidden context. Marchine Learning, Springer Netherlands Press. 1998, 2(32): 101-126P
    [21] H.Wang, W.Fan P.S.Yu, J.Han.Mining concept-drifting data streams using ensemble classifiers. In: Proc of the 9th ACM SIGKDD Int’l Conf on KDD,ACM Press. 2003:226-235P
    [22] J.R.Quinlan. Induction of decision trees. Machine Leaming, Springer Netherlands Press. 1986, l (l):81-106P
    [23] G.Dong, J.Li.Effcient mining of emerging pattern: Discovery trends and differences.In: Proc.of the 5th ACM SIGKDD Int’l Conf on KDD, ACM Press.1999:43-52P
    [24]范明,王秉政.一种直接在Trans-树中挖掘频繁模式的新算法.计算机科学,2003, 30(8)
    [25] M.Kamber, J.Han, J.Chiang. Metarule-guided mining of multi-dimensional association rules Using Data Cubes. In: Proc.of the 1997 ACM SIGKDD Int’l Conf on KDD, ACM Press. 1997 207-210P
    [26] L.Hansen, P.Salamon. Neural network ensembles.IEEE Transaction on Pattern Analysis and Machine Intelligence. 1990, 10(12):993-1001P
    [27] K.Tumer, J.Ghosh.Error correlation and error reduction in ensemble Classifiers.Connection Science, Special Issue on Combining Artificial Neural Network. 1996: 385-404P
    [28] P.Zhang, X.Q.Zhu, Y.Shi. Categorizing and mining concept drifting data streams. In: Proc.of the 14th ACM SIGKDD Int’l Conf on KDD, ACM Press. 2008: 812-820P
    [29] C.C.Aggarwal. An intuitive framework for understanding changes in evolving data streams. In: Proc.of the ICDE Conference. 2002
    [30] C.C.Aggarwal. A framework for diagnosing changes in evolving data streams. In: Proc.of the 2003 ACM SIGMOD Int’l Conf on Management of Data, ACM Press. 2003: 575-586P
    [31]孙岳,毛国君,刘旭.数据流中概念漂移检测的集成分类器设计.计算机应用研究,2008(01)
    [32] J.Shawe-Taylor, N.Cristianini. Kernel methods for pattern analysis.Cambridge University Press.2004
    [33] G.S.Manku, R.Motwani. Approximate frequency counts over data streams. In: Proc.of the 28th Int’l Conf. on Very Large Data Bases, ACM Press. 2002:346-357P
    [34] R.Duda, O.Hart, P.E.andstork.李虹东,姚天翔译.模式分类.机械工业出版社.2007:373-375页.
    [35] Y.Chung, D.F.Hsu, C.Y.Tang. On the diversity-performance relationship for majority voting in classifier emsembles. Multiple Classifier Systems. Springer Berlin Press. 2007: 407-420P
    [36] A.Golestani, K.Ahmadian, A.Amiri, et al. A novel adaptive-boost-based strategy for combining classifiers using diversity concept. In: Proc of the 6th IEEE Int’l Conf. on Computer and Information Science. 2007:128-134P
    [37] D.Santos, E.M.Sabourin, P.MauPin, et al. Ambiguity-guided dynamic selection of ensemble of classifiers. In: Proc of the 10th Intemational Conference on Information Fusion. 2007: l-8P
    [38] J.Li, G.Dong, K.Ramamohanarao. JEP-Classifier: classification by aggregating jumping emerging patters. In: knowledge and Information Systems.2001, 2(3):131-145P
    [39] H.Fan, K.Ramamohanarao. A bayesian approach to use emerging patterns for classification. In: Proc of 14th Australasian Database Conference, Australian Computer Society Press.2002:39-48P
    [40] J.Y.Li, G.Z.Dong, K.Ramamohanarao, L.Wong. DeEPs: A new instance-based lazy discovery and classification system. Machine Learning. 2004, 54(2): 99-124P
    [41]任红伟.Boosting基于EP的分类器提高分类准确率.南京邮电大学,2007
    [42]陈崇超.基于EP的数据流分类算法研究.南京邮电大学,2007
    [43]王勇等.基于相反分类器的数据流分类方法.计算机科学,2006,33(8)
    [44]刘勇.频繁模式挖掘相关技术研究.复旦大学,2007
    [45]刁树民,王永利.适于数据流组合分类的直推学习方法.计算机应用.2009(06):1578~1581页
    [46]刘耀宗等.数据流的预测与分类研究.计算机科学,2007(11)
    [47]刘景春.数据流分类关键技术研究.佳木斯大学学报(自然科学版),2007(01)
    [48]王涛等,数据流挖掘分类技术综述.计算机研究与发展,2007(11)
    [49]尹志武,黄上腾.一种自适应局部概念漂移的数据流分类算法.计算机科学,2008(02)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700