面向非平衡混合型数据的分类算法及应用研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向非平衡混合型数据的分类算法及应用研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：陈宇宙
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：计数最近邻分类算法 ; 非平衡数据 ; 全局密度 ; k-局部密度 ; 边界点检测
英文关键词：k-nearest neightbours by counting ; non-balanced data ; overall density ; k-local density ; boundary point detect
学位年度：2008
导师：廖志芳
学科代码：081203
学位授予单位：中南大学
论文提交日期：2008-05-01

摘要

非平衡混合数据分类处理在现实应用中非常普遍,该数据具有分布不均匀,属性多样等特性。传统的分类学习方法在处理该类型数据时有效性不高,而且在少数类样本足够重要时,甚至会导致较大的损失,因此针对非平衡混合数据的处理方法成为当前国内外数据挖掘研究的重点之一。
     本文的研究工作以传统的分类方法为基础,通过对传统分类算法的改进,实现对非平衡混合数据的处理。通过分析发现计数最近邻分类算法(K—nearest Neightbours By Counting,CwkNN)可以有效地对混合型数据进行分类,但该算法对非平衡性数据处理效果不理想。本文在CwkNN算法的基础之上结合数据的非平衡性特点提出了三种改进的分类方法,分别为如下所述:
     (1)全局密度分类算法:针对CwkNN算法不能处理非平衡型数据的特点,引入一个全局密度,重新平衡数据对分类的影响度。实验发现提高了少数类样本的分类精度,降低了多数类样本的分类精度。
     (2)K—局部密度分类算法:针对全局密度分类算法降低了多数类样本的分类精度,引入K-局部密度,保证在提高少数类样本分类精度的同时,不会降低多数类样本的分类精度,实验证明该方法有效地提高了非平衡型数据的分类精度。
     (3)基于密度的边界点检测及分类算法:针对数据中的边界点,提出了基于密度的边界点检测方法,并对检测出来的边界点采用边界点三种分类方法进行分类。实验证明通过这些方法对存在边界点的非平衡数据可进行正确分类。
The processing of the imbalanced mixed data is very commom in the real world, Such data are unevenly distributed, and diversity of attributes. The effectiveness of traditional classification learning methods is not high in dealing with this type of data, and if the minor samples is sufficiently important, it may lead to greater losses. So against non-equilibrium mixed data processing methods have become one of the focal point of the current domestic and international data mining research.
     The main research work of this paper is on the basis of traditional classification methods, through improving the traditional methods, achieve non-equilibrium mixed data processing. It was found that k-nearest neightbours by counting can be effective in the mixed data classification by analyzing the algorithm, but the effectiveness of the algorithm are not satisfactory for non-equilibrium data processing. So this paper proposes three improved classifying methods by combining the characteristics of imbalanced data with CwkNN algorithm, were as follows:
     (1) The overall density classification algorithm: Against the characteristics of the CwkNN algorithm can not handling non-equilibrium data, the introduction of a overall density, re-balancing of data on the impact of the classification. It was found that the minor samples increase the accuracy of the classification, and the majority samples reduce the classification accuracy through experiments.
     (2) K—local density classification algorithm:Aim at the overall density classification algorithm reducing the classification accuracy of the majority samples, the introduction of a K—local density to ensure that the minor samples will improve the accuracy of classification, and the majority samples will not reduce the classification accuracy at the same time. It was found that the effective increase in imbalanced type of data classification accuracy through experiments.
     (3) The boundary points detection and classification algorithms based on the density: Aim at the boundary points in the data, the paper proposed a boundary points detection method based on the density, and use the three kind of classification methods of boundary points to classify boundary points detected. Experiment prove that these method can classify the non-equilibrium data with boundary points correctly.

引文

[1]王渊璟,张阳德,任力峰.激光诱导自体荧光技术与大肠癌的早期诊断激光与光电子学进展,2001,(3):48-52
    [2]徐刚,袁兆康.数据挖掘及其在医学领域中的应用和展望.实用临床医学,2006,7(11):196-198
    [3]罗可,林睦纲,郗东妹.数据挖掘中分类算法综述.计算机工程,2005,31(1):7-11
    [4]Jie Cheng,Russell Greiner.Comparing Bayesian Network Classifiers.In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence(UAI'99),San Francisco:Morgan Kaufmann Publishers,1999,101-107
    [5]谈恒贵,王文杰,李游华.数据挖掘分类算法综述.微型机与应用,2005,(2):4-6
    [6]赵秦怡,王丽珍,应龙.一种可伸缩的空间决策树分类挖掘算法.计算机工程,2005,31(4):93-95
    [7]王艳兵,赵锐,姚青.基于可变精度的ID3改进算法.计算机工程与设计,2006,27(14):2683-2685
    [8]颜宏文,马瑞,龙际珍,等.数据挖掘中判定树算法SLIQ的设计与应用.计算机工程,2005,31(6):60-62
    [9]魏红宁.基于SPRINT方法的并行决策树分类研究.计算机应用,2005,25(1):39-41
    [10]谭思云,张青枝,李志明.基于粗糙集的分类和规则归纳法.武汉理工大学学报,2003,25(2):75-79
    [11]文绍纯,罗飞,付连续,等.基于遗传算法的人工神经网络的应用综述.自动化与仪器仪表,2001,(6):1-4
    [12]张军英,许进,保铮.Hopfield网的关联分析.自动化学报,1997,23(4):448-454
    [13]张西栓,郭嗣琮,王磊.基于BP网的关联分析方法.自动化技术与应用,2006,25(5):4-7
    [14]John R.Kelsoe,Edward I.Ginns,Janice A.Egeland.Re-evaluation of the linkage relationship between chromosome lip loci and the gene for bipolar affective disorder in the Old Order Amish.Nature,1989,342:238-243
    [15]Oh rn A,Row land T.Rough sets:a know ledge discovery technique for multi-factorial medical outcomes.Am J Phys Med Rehabil,2000,79(1):100-105
    [16]Yue Huang,Paul J.McCullagh,Norman Black,et al.Feature Selection and Classification Model Construction on Type 2Diabetic Patient's Data.Indu-strial Conference on Data Mining,2004:153-162
    [17]Yue Huang,Paul J.McCullagh,Norman Black,Roy Harper.Evaluation of Outcome Prediction for a Clinical Diabetes Database.KELSI,2004:181-190
    [18]Lee Yingjie,Zhu Yisheng,Xu Yuhong,et al.The nonlinear dynamical analysis of the EEG in schizophrenia with temporal and spatial embedding dimension.Journal of Medical Engineering &Technology,2001,25:79-83
    [19]Shah B.Relationship between diabetes and age in human metatarsal bones.The 17th Southern Biomedical Engineering Conference,San Antonio:IEEE Computer Society Pr,1998:2-26
    [20]Harris ND,Ireland RH,Marques JLB,et al.Can changes in Q T interval be used to predict the onset of Hypoglycemia in type 1 diabetes.Computers in Cardiology,2000,27:375-378
    [21]Kentala E,Pyykko I,Viikki K,et al.Production of diagnostic rules from a neurotologic database with decision trees.The Annals of Otology,Rhinology&Laryngology,2000,109(2):170-176
    [22]张世红,徐国恒,刘公霞.数据挖掘在医学上的应用.医学情报工作,2004,(6):408-410
    [23]LEE IN,LEE S C,EMBREECHTS M.Important Variable Selection Techniques with Multiple Solutions for Medical Information Applications.Med Inform Internet Med,2002,27(4):253-266
    [24]Milan Z,GouM,Peter K,et al.Mining diabetes database with ecision trees and association rules.Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems,Washington ,DC,USA:IEEE Computer Society,2002:134-139
    [25]张辉,钱宗才,屈景辉.粗糙集在构建骨肿瘤辅助诊断知识库的应用研究.医学信息,2004,17(5):257-258
    [26]罗森林,成华,顾毓清.C4.5算法在2型糖尿病分类规则建立中的应用.计算机应用研究,2004,(2):32-35
    [27]Cho Y,Walbot V.Computational methods for gene annotation:the arabidopsis genome.Biotechnology,2001,12:126-130
    [28]SACHAJ P,GOODENDA Y L S,CIOS K J.Bayesian Learning for Cardiac SPECT Image Interpretation.Artif Intell Med,2002,26(1-2):109-143
    [29]DREISEITL S,OHNOMACHADOL,KITTL ER H.A Comparison of Machine Learning Methods for The Diagnosis of Pigmented Skim Lesions.J Biomed Inform,2001,34(1):2-36
    [30]樊晓平,彭展,杨胜跃等.基于多层前馈型人工神经网络的抑郁症分类系统研究.计算机工程与应用,2004,40(13):205-208
    [31]蒋孝之,蔡之华.医疗数据挖掘及其应用.微型机与应用,2003,(10):45-47
    [32]涂福泉,陈奎生,陈建勋,等.ROC分析技术的研究现状和发展趋势.计算机与数字工程,2007,35(3):33-38
    [33]范晴,徐建华,宋震.查全率与查准率关系初探.情报杂志,2002,(9):41-42
    [34]张铃,张钹.遗传算法机理的研究.软件学报,2000,11(7):945-952
    [35]RC Holte.Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.Machine Learning,1993,11(1):63-90
    [36]Raskutti B,Kowalczyk A.Extreme Rebalancing for SYMs:a case study.Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining,2004,6(L):234-248
    [37]罗菲菲,刘贵全,安景琦,等.一种基于代价敏感学习的范例推理方法及其应用研究.计算机应用,2005,25(10):2444-2446
    [38]Maloof M A.Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown.http://www.site.uottawa,c～nat/Work-shop2003/maloof-icm103-wids.pdf,2003
    [39]Nitesh V.Chawla,Kevin W.Bowyer,Lawrence O.Hall,et al.SMOTE:Synthetic Minority Over-sampling Technique.Journal of Artificial Intelligence Research,2002(16):341-378
    [40]Gustavo E.A.P.A.Batista,Ronaldo C.Prati,Maria Carolina Monard.A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data.ACM SIGKDD Explorations Newsletter,2004,6(1):20-29
    [41]PHILIP K.CHAN.On the Accuracy of Meta-learning for Scalable Data Mining.Journal of Intelligent Information Systems,1997,(8):5-28
    [42]李闯,丁晓青,吴佑寿.一种改进的AdaBoost算法—AD AdaBoost.计算机学报,2007,30(1):104-120
    [43]Chao Chen,Andy Liaw,Leo Breiman.Using Random Forest to Learn Imbalanced Data.Technical Report,No.666,Department of Statistics,University of Berkely,2004:128-137
    [44]Ramesh Agarwal,Mahesh V.Joshi.PNrule:A New Framework for Learning Classfier Models in Data Mining.http://www.cs.ubc.ca /local/reading/Proc-eedings/siamdatamining2001/pdf/sdm01_30.pdf,2000-2-3
    [45]Hongyu Guo,Herna L Viktor.Learning from Imbalanced Data Sets with Boosting and Data Generation:The DataBoost-IM Approach.ACM SIGKDD Explorations Newsletter,2004,6(1):30-39
    [46]周防,何洁月.生物信息学中基因芯片的特征选择技术综述.计算机科学,2007,34(12):143-150
    [47]熊忠阳,张鹏招,张玉芳.基于χ~2统计的文本分类特征选择方法的研究.计算机应用,2008,28(2):514-516
    [48]Huang K,Yang H,King I,et al.Imbalanced learning with a biased minimax probability machine.IEEE Trans Syst Man Cybern B Cybern,2006,36(4):913-923
    [49]Hui Wang.Nearest Neighbors by Neighborhood Counting.IEEE Trans.pattern analysis and machine intelligence,2006,28(6):942-953
    [50]Morin,R L,Raeside,O E.Reappraisal of Distance-Weighted k-Nearest Neighbor Classification for Pattern Recognition With Missing Oata.IEEE TRANS.SYS,MAN,AND CYBER,1981,3(11):241-243
    [51]Hongbin Shen,Kuo-Chen Chou.Using optimized evidence theoretic K-nearest neighbor classier and pseudo-amino acid composition to predict membrane protein types.Biochemical and Biophysical Research Communications,2005,1(334):288-292
    [52]GG Towell,JW Shavlik,M Noordewier.Renement of Approximate Domain Theories by Knowledge-Based Neural Networks.In Proceedings of the Eighth National Conference on Artificial,Menlo Park,CA:AAAI Press,1990:861-866
    [53]S Cost,S Salzberg.A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features.Machine Learning,1993:57-78
    [54]C Stanfill,D Waltz.Towards memory-based reasoning.Communic-ations of the ACM,1986,(29):1213-1229
    [55]H Wang,W Dubitzky,I Duntsch,et al.A Lattice Machine Approach to Automated Casebase Design:Marrying Lazy and Eager Learning.In Proc.IJCAI99,Stockholm,Sweden,1999:254-259
    [56]M Agyemang,CI Ezeife.LSC-Mine:Algorithm for Mining Local Outliers.Proceedings of the 15~(th)Information Resource Management Association(IRMA)International Conference,New Orleans:IRM press,2004,1:5-8
    [57]Blake C,Keogh E,Merz C J.UCI repository of machine learning databases,http://www.ics.uci.edu/mlearn/MLRepository.html,D epartment of Information and Computer Science,University of California,Irvine,1998
    [58]樊晓平,刘晶,廖志芳,等.大肠癌自体荧光光谱信号处理方法研究.计算机应用研究,2007,24(6):231-232

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700