基于粗糙集的非监督快速属性选择算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于粗糙集的非监督快速属性选择算法

详细信息查看全文 | 推荐本文 |

英文篇名：Fast unsupervised feature selection algorithm based on rough set theory
作者：白鹤翔 ; 王健 ; 李德玉 ; 陈千
英文作者：BAI Hexiang;WANG Jian;LI Deyu;CHEN Qian;School of Computer and Information Technology, Shanxi University;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education ( Shanxi University);
关键词：海量数据 ; 绝对约简 ; 增量式算法 ; 粗糙集 ; 属性选择
英文关键词：massive data;;absolute reduct;;incremental algorithm;;rough set;;feature selection
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：山西大学计算机与信息技术学院;计算智能与中文信息处理教育部重点实验室(山西大学);
出版日期：2015-08-10
出版单位：计算机应用
年：2015
期：v.35;No.300
基金：国家自然科学基金资助项目(41101440,61272095,61403238);; 山西省青年科技基金资助项目(2014021022-1);; 中国博士后科学基金资助项目(2013M530891)
语种：中文;
页：JSJY201508048
页数：5
CN：08
ISSN：51-1307/TP
分类号：249-253

摘要

针对"大数据"中常见的大规模无监督数据集中特征选择速度难以满足实际应用要求的问题,在经典粗糙集绝对约简增量式算法的基础上提出了一种快速的属性选择算法。首先,将大规模数据集看作一个随机到来的对象序列,并初始化候选约简为空集;然后每次都从大规模数据集中无放回地随机抽取一个对象,并且每次都判断使用当前候选约简能否区分这一对象和当前对象集中所有应当区分的对象,并将该对象放入到当前对象集中,如果不能区分则向候选约简中添加合适的属性;最后,如果连续I次都没有发现无法区分的对象,那么将候选约简作为大规模数据集的约简。在5个非监督大规模数据集上的实验表明,所求得的约简能够区分95%以上的对象对,并且求取该约简所需的时间不到基于区分矩阵的算法和增量式约简算法的1%;在文本主题挖掘的实验中,使用约简后的数据集挖掘出的文本主题同原始数据集挖掘出的主题基本一致。两组实验结果表明该方法能够有效快速对大规模数据集进行属性选择。
Focusing on the issue that feature selection for the usually encountered large scale data sets in the big datais too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.

引文

[1]THEODORIDIS S,KOUTROUMBAS K.Pattern recognition[M].4th ed.New York:Academic Press,2009:261-322.
    [2]SUN J.Modern pattern recognition[M].2nd ed.Beijing:Higher Education Press,2008:272-337.(孙即祥.现代模式识别[M].2版.北京:高等教育出版社,2008:272-337.)
    [3]KOHAVI R,JOHN G H.Wrappers for feature subset selection[J].Artificial Intelligence,1997,97(1/2):273-324.
    [4]PAWLAK Z.Rough sets[J].International Journal of Computer&Information Sciences,1982,11(5):341-356.
    [5]KOMOROWSKI J,PAWLAK Z,POLKOWSKI L,et al.Rough sets:a tutorial[M]//Rough fuzzy hybridization:a new trend in decisionmaking.Berlin:Springer,1999:3-98.
    [6]GUO N,SUN X,LIN H,et al.Malware detection based on attributes order reduction[J].Journal of Computer Applications,2011,31(4):1006-1009.(郭宁,孙晓妍,林和,等.基于属性序约简的恶意代码检测[J].计算机应用,2011,31(4):1006-1009.)
    [7]WROBLEWSKI J.Finding minimal reducts using genetic algorithms,ICS Research report 16/95[R].Warsaw:Warsaw University of Technology,1995:186-189.
    [8]MA X,WANG G,YU H.Heuristic method to attribute reduction for decision region distribution preservation[J].Journal of Software,2014,25(8):1761-1780.(马希骜,王国胤,于洪.决策域分布保持的启发式属性约简方法[J].软件学报,2014,25(8):1761-1780.)
    [9]GAO C,MIAO D,ZHANG Z,et al.Rough set based attribute reduction with consistent confidence[J].Journal of Computer Applications,2012,32(4):1067-1069.(高灿,苗夺谦,张志飞,等.粗糙集信度一致属性约简[J].计算机应用,2012,32(4):1067-1069.)
    [10]HU X,CERCONE N.Learning in relational databases:a rough set approach[J].Computational Intelligence,1995,11(2):323-338.
    [11]WANG X,YANG J,TENG X,et al.Feature selection based on rough sets and particle swarm optimization[J].Pattern Recognition Letters,2007,28(4):459-471.
    [12]XIAO D,WANG G,HU F.Fast paralle attribute reduction algorithm based on rough set theory[J].Computer Science,2009,36(3):208-211.(肖大伟,王国胤,胡峰.一种基于粗糙集理论的快速并行属性约简算法[J].计算机科学,2009,36(3):208-211.)
    [13]HU F,WANG G,HUANG H,et al.Incremental attribute reduction based on elementary sets[C]//RSFDGr C 2005:Proceedings of the 10th International Conference on Rough Sets,Fuzzy Sets,Data Mining,and Granular Computing,LNCS 3641.Berlin:Springer,2005:185-193.
    [14]LIANG J,WANG F,DANG C,et al.A group incremental approach to feature selection applying rough set technique[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):294-308.
    [15]LIANG J,WANG F,DANG C,et al.An efficient rough feature selection algorithm with a multi-granulation view[J].International Journal of Approximate Reasoning,2012,53(6):912-926.
    [16]XU Z,ZHANG C,ZHANG S,et al.Efficient attribute reduction based on discernibility matrix[M]//RSKT 2007:Proceedings of the Second International Conference on Rough Sets and Knowledge Technology,LNCS 4481.Berlin:Springer,2007:13-21.
    [17]KIRA K,RENDELL L.The feature selection problem:traditional methods and a new algorithm[C]//AAAI 1992:Proceedings of the Tenth National Conference on Artificial Intelligence.Menlo Park:AAAI Press,1992:129-134
    [18]University of California,Irvine.Center for Machine Learning and Intelligent Systems,UCI Machine Learning Repository[DB/OL].[2015-01-12].http://archive.ics.uci.edu/ml/datasets.html.
    [19]ROWEIS S.Data for Matlab hackers[DB/OL].[2015-01-12].http://www.cs.nyu.edu/~roweis/data.html.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700