摘要
针对"大数据"中常见的大规模无监督数据集中特征选择速度难以满足实际应用要求的问题,在经典粗糙集绝对约简增量式算法的基础上提出了一种快速的属性选择算法。首先,将大规模数据集看作一个随机到来的对象序列,并初始化候选约简为空集;然后每次都从大规模数据集中无放回地随机抽取一个对象,并且每次都判断使用当前候选约简能否区分这一对象和当前对象集中所有应当区分的对象,并将该对象放入到当前对象集中,如果不能区分则向候选约简中添加合适的属性;最后,如果连续I次都没有发现无法区分的对象,那么将候选约简作为大规模数据集的约简。在5个非监督大规模数据集上的实验表明,所求得的约简能够区分95%以上的对象对,并且求取该约简所需的时间不到基于区分矩阵的算法和增量式约简算法的1%;在文本主题挖掘的实验中,使用约简后的数据集挖掘出的文本主题同原始数据集挖掘出的主题基本一致。两组实验结果表明该方法能够有效快速对大规模数据集进行属性选择。
Focusing on the issue that feature selection for the usually encountered large scale data sets in the big datais too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.
引文
[1]THEODORIDIS S,KOUTROUMBAS K.Pattern recognition[M].4th ed.New York:Academic Press,2009:261-322.
[2]SUN J.Modern pattern recognition[M].2nd ed.Beijing:Higher Education Press,2008:272-337.(孙即祥.现代模式识别[M].2版.北京:高等教育出版社,2008:272-337.)
[3]KOHAVI R,JOHN G H.Wrappers for feature subset selection[J].Artificial Intelligence,1997,97(1/2):273-324.
[4]PAWLAK Z.Rough sets[J].International Journal of Computer&Information Sciences,1982,11(5):341-356.
[5]KOMOROWSKI J,PAWLAK Z,POLKOWSKI L,et al.Rough sets:a tutorial[M]//Rough fuzzy hybridization:a new trend in decisionmaking.Berlin:Springer,1999:3-98.
[6]GUO N,SUN X,LIN H,et al.Malware detection based on attributes order reduction[J].Journal of Computer Applications,2011,31(4):1006-1009.(郭宁,孙晓妍,林和,等.基于属性序约简的恶意代码检测[J].计算机应用,2011,31(4):1006-1009.)
[7]WROBLEWSKI J.Finding minimal reducts using genetic algorithms,ICS Research report 16/95[R].Warsaw:Warsaw University of Technology,1995:186-189.
[8]MA X,WANG G,YU H.Heuristic method to attribute reduction for decision region distribution preservation[J].Journal of Software,2014,25(8):1761-1780.(马希骜,王国胤,于洪.决策域分布保持的启发式属性约简方法[J].软件学报,2014,25(8):1761-1780.)
[9]GAO C,MIAO D,ZHANG Z,et al.Rough set based attribute reduction with consistent confidence[J].Journal of Computer Applications,2012,32(4):1067-1069.(高灿,苗夺谦,张志飞,等.粗糙集信度一致属性约简[J].计算机应用,2012,32(4):1067-1069.)
[10]HU X,CERCONE N.Learning in relational databases:a rough set approach[J].Computational Intelligence,1995,11(2):323-338.
[11]WANG X,YANG J,TENG X,et al.Feature selection based on rough sets and particle swarm optimization[J].Pattern Recognition Letters,2007,28(4):459-471.
[12]XIAO D,WANG G,HU F.Fast paralle attribute reduction algorithm based on rough set theory[J].Computer Science,2009,36(3):208-211.(肖大伟,王国胤,胡峰.一种基于粗糙集理论的快速并行属性约简算法[J].计算机科学,2009,36(3):208-211.)
[13]HU F,WANG G,HUANG H,et al.Incremental attribute reduction based on elementary sets[C]//RSFDGr C 2005:Proceedings of the 10th International Conference on Rough Sets,Fuzzy Sets,Data Mining,and Granular Computing,LNCS 3641.Berlin:Springer,2005:185-193.
[14]LIANG J,WANG F,DANG C,et al.A group incremental approach to feature selection applying rough set technique[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):294-308.
[15]LIANG J,WANG F,DANG C,et al.An efficient rough feature selection algorithm with a multi-granulation view[J].International Journal of Approximate Reasoning,2012,53(6):912-926.
[16]XU Z,ZHANG C,ZHANG S,et al.Efficient attribute reduction based on discernibility matrix[M]//RSKT 2007:Proceedings of the Second International Conference on Rough Sets and Knowledge Technology,LNCS 4481.Berlin:Springer,2007:13-21.
[17]KIRA K,RENDELL L.The feature selection problem:traditional methods and a new algorithm[C]//AAAI 1992:Proceedings of the Tenth National Conference on Artificial Intelligence.Menlo Park:AAAI Press,1992:129-134
[18]University of California,Irvine.Center for Machine Learning and Intelligent Systems,UCI Machine Learning Repository[DB/OL].[2015-01-12].http://archive.ics.uci.edu/ml/datasets.html.
[19]ROWEIS S.Data for Matlab hackers[DB/OL].[2015-01-12].http://www.cs.nyu.edu/~roweis/data.html.