一种基于Spark的不确定数据集频繁模式挖掘算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

一种基于Spark的不确定数据集频繁模式挖掘算法

详细信息查看全文 | 推荐本文 |

英文篇名：A Spark-based Frequent Patterns Mining Algorithm for Uncertain Datasets
作者：杨阳 ; 丁家满 ; 李海滨 ; 贾连印 ; 游进国 ; 姜瑛
英文作者：YANG Yang;DING Jiaman;LI Haibin;JIA Lianyin;YOU Jinguo;JIANG Ying;Faculty of Information Engineering and Automation, Kunming University of Science and Technology;
关键词：不确定数据 ; 数据挖掘 ; 频繁模式 ; Spark
英文关键词：uncertain data;;data mining;;frequent patterns;;Spark
中文刊名：XXYK
英文刊名：Information and Control
机构：昆明理工大学信息工程与自动化学院;
出版日期：2019-06-15
出版单位：信息与控制
年：2019
期：v.48
基金：国家自然科学基金资助项目(51467007,61562054,61462050)
语种：中文;
页：XXYK201903004
页数：8
CN：03
ISSN：21-1138/TP
分类号：10-17

摘要

如何在海量不确定数据集中提高频繁模式挖掘性能是目前研究的热点.传统算法大多是以期望、概率或者权重等单一指标为数据项集支持度,在大数据背景下,同时考虑概率和权重支持度的算法难以兼顾其执行效率.为此,本文提出一种基于Spark的不确定数据集频繁模式挖掘算法(UWEFP),首先,为了同时兼顾数据项的概率和权重,计算一项集的最大概率权重值并进行剪枝;然后,为了减少对数据集的多次扫描,结合Spark框架的优点,设计了一种具有FP-tree特征的新颖的UWEFP-tree结构进行模式树的构建及挖掘;最后在Spark环境下,以UCI数据集进行实验验证.实验结果表明本文的方法在保证挖掘结果的同时,提高了效率.
In recent years, improving the performance of mining frequent patterns in massive uncertain datasets has become an active research topic. Most traditional algorithms for mining frequent patterns consider only a single factor of data items-any of expectation, probability, or weight, while for those algorithms that consider both probability and weight, it is difficult to balance execution efficiency when big data are involved. Therefore, we propose a Spark framework-based algorithm for mining frequent patterns according to expected weight for uncertain datasets(UWEFP for short). To consider both the probabilities and weights of items, UWEFP first calculates the maximum probability weight value of one set and to prune them. A novel UWEFP-tree structure with the advantages of Spark framework is designed to mine frequent patterns; it has the FP-tree characteristics and reduces the time of scanning the datasets. Finally, in the Spark environment, UCI datasets are used to verify the algorithm. The experimental results show that the proposed algorithm is effective and has excellent performance.

引文

[1] Soysal ? M,Gupta E,Donepudi H.A sparse memory allocation data structure for sequential and parallel association rule mining[J].The Journal of Supercomputing,2016,72(2):347-370.
    [2]Lin C W,Gan W,Fournier-Viger P,et al.Efficient mining of weighted frequent itemsets in uncertain databases[C]//12th International Conference on Machine Learning.Berlin,Germany:Springer,2016:236-250.
    [3] Karim M R,Cochez M,Beyan O D,et al.Mining maximal frequent patterns in transactional databases and dynamic data streams:A spark-based approach[J].Information Sciences,2018,432:278-300.
    [4] Chee C H,Jaafar J,Aziz I A,et al.Algorithms for frequent itemset mining:A literature review[J].Artificial Intelligence Review,2018(3):1-19.
    [5] Chui C K,Kao B,Edward H.Mining frequent itemsets from uncertain data[C]//11th Pacific-Asia conference on Advances in knowledge discovery and data mining:vol.4426.Berlin,Germany:Springer-Verlag,2007:47-58.
    [6] Wang L,Cheung D W,Cheng R,et al.Efficient mining of frequent item Sets on large uncertain datasets[J].IEEE transactions on Knowledge and Data Engineering,2012,24(12):2170-2183.
    [7] Sun X,Lim L,Wang S.An approximation algorithm of mining frequent itemsets from uncertain dataset[J].International Journal of Advancements in Computing Technology,2012,4(3):42-49.
    [8] Leung C K,Carmichael C L,Hao B.Efficient mining of frequent patterns from uncertain data[C]//17th IEEE international Conference on Data Mining Workshops.Piscataway,NJ,USA:IEEE,2007:489-494.
    [9] Chun W L,Tzung P H.A new mining approach for uncertain datasets using CUFP-trees[J].Expert Systems with Applications,2012,39(4):4084-4093.
    [10] Bernecker T,Kriegel H P,Renz M,et al.Probabilistic frequent itemset mining in uncertain datasets[C]//15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM,2009:119-128.
    [11] Sun L,Cheng R,Cheung D W,et al.Mining uncertain data with probabilistic guarantees[C]//16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM,2010:273-282.
    [12] Yun U,Leggett J.WSpan:Weighted sequential pattern mining in large sequential datasets[C]//3th International IEEE Conference on Intelligent Systems.Piscataway,NJ,USA:IEEE,2006:512-517.
    [13] Wang W,Yang J,Yu P S.Efficient mining of weighted association rules (war)[C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM,2000:270-274.
    [14] Lan G C,Hong T P,Lee H Y.An efficient approach for finding weighted sequential patterns from sequences[J].Applied Intelligence,2014,41(2):439-452.
    [15] Lan G C,Hong T P,Lee H Y,et al.Mining weighted frequent itemsets[J].The Workshop on Combinational Mathematics and Computation Theory,2013:85-89.
    [16] Lin C W,Gan W,Fournier-Viger P,et al.RWFIM:Recent weighted-frequent itemsets mining[J].Engineering Applications of Artificial Intelligence,2015,45:18-32.
    [17] Lin C W,Gan W,Fournier-Viger P,et al.Efficiently mining frequent itemsets with weight and regency constraints[J].Applied Intelligence,2017,47(3):769-792.
    [18] Lin C W,Gan W,Fournier-Viger P,et al.Weighted frequent itemset mining over uncertain datasets[J].Applied Intelligence,2016,44(1):232-250.
    [19] 何文韬,邵诚.工业大数据分析技术的发展及其面临的挑战[J].信息与控制,2018,47(4):398-410.He W T,Shao C.The development and challenges of industrial big data analysis technology[J].Information and Control,2018,47(4):398-410.
    [20] Qiu H,Gu R,Yuan C,et al.YAFIM:A parallel frequent itemset mining algorithm with spark[C]//IEEE International Parallel & Distributed Processing Symposium Workshops.Piscataway,NJ,USA:IEEE,2014:1664-1671.
    [21] Rathee S,Kaul M,Kashyap A.R-Apriori:An efficient apriori based algorithm on spark[C]//the 8th Workshop on Information and Knowledge Management.New York,USA:ACM,2015:27-34.
    [22] Shi L,Qian X Z.Research and implementation of parallel FP-Growth algorithm based on Hadoop[J].Microelectronics&Computer,2015,32(4):150-154.
    [23] Zhao Z,Yan D,Wilfred N G.Mining probabilistically frequent sequential patterns in uncertain databases[C]//Proceedings of the 15th International Conference on Extending Database Technology.New York:ACM Press,2012:74-85.
    [24] Lin C W,Gan W,Fournier-Viger P,et al.Efficiently mining uncertain high-utility itemsets[J].Soft Computing,2017,21(11):1-20.
    [25] Sethi K K,Ramesh D.HFIM:A Spark-based hybrid frequent itemset mining algorithm for big data processing[J].Journal of Supercomputing,2017,73(1):1-17.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700