用户名: 密码: 验证码:
关联规则扩展模型的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,数据库中的知识发现(Knowledge Discovery in Databases,KDD)已成为涉及人工智能和数据库等学科的一门非常活跃的研究领域。而关联规则是KDD要发现的一类重要的模式,它的挖掘问题是KDD的一个重要研究方向,本文着重对关联规则的扩展模型和挖掘算法进行了研究。
     传统关联规则挖掘假定数据库中每个项目具有相同的重要性,然而在很多实际应用中,事实并非如此:用户可能对某些特定的项目兴趣更大,而且同一项目在不同时段的重要性也可能是不同的。考虑到这一点,目前文献中有两类关联规则的扩展模型:加权关联规则和多支持度关联规则。加权关联规则的思想是为每个项目追加一个表示其重要性的权值,传统的关于支持的定义被扩展为加权支持,这样就可以发现较多用户感兴趣的关于重要项目的关联。多支持度关联规则放弃了传统模型中的单一最小支持阈值,取而代之的是为每个项目设置一个最小支持阈值,对比较重要的项目可以把阈值设的小一些,这样同样可以达到加权关联规则模型的效果。
     本文的主要贡献提出并研究了混合关联规则模型。
     负关联规则作为传统关联规则的对立面,也包含了有用的信息。然而,文献中对负关联规则的研究较少且并没有完全形式化。为此,通过在项目集中引入负项目,本文提出了一种关联规则的扩展模型:混合关联规则。它是传统关联规则和负规则的超集。
     本文提出并详细讨论了混合关联规则的三种挖掘算法:直接的算法,基于hash树的算法,和基于树的算法。第一种算法是从混合规则的定义出发采用直接的方法生成大项集,算法的复杂度较大,效率不高;第二种算法是对Apriori算法稍加修改得到的,也不够令人满意;为进一步提高执行效率,我们设计了基于一种树结构的算法。此外,我们还给出了一个冗余负关联规则的剪枝算法。所有算法均用C实现。为验证算法的有效性,我们根据理论分布,使用伪随机函数生成了若干数据集,进行了关联规则挖掘实验,并对实验结果进行了分析。
     然而,混合关联规则模型也有一些难题,如传统的“规则过多”问题这里表现得尤为突出。为此,本文探讨了在混合关联规则模型中引入权值或多支持度的将来研究方向。
In recent years, Know1edge Discovery in Databases (KDD) has become
    an active research field, which can fifld its relevance to artificia1
    inte11igence, database system and other discip1ines. 0ne of the major
    patterns to be discovered in KDD endeavors is association ru1es, whose
    di scovery prob1 em has been an important research direct ion in KDD fi e1d.
    This thesis is chief1y devoted to expanded mode1s and mining a1gorithms
    of as soc i at ion ru l es.
    0ne of the basic assumptions in tradi tiona1 associat ion ru1es is that
    a11 items in the database are equa11y important. Thi s is, however, rare1y
    true in rea1--1 if e app1ications t users' interests may vary from item to
    item, users' interests toward a certain item may vary from time to time.
    A11owing for this fact, two c1asses of expanded associat ion ru1e mode1 s
    have been proposed: weighted associat ion ru1es, and assoc iat ion ru1 es
    with mu1tip1e minimum supports. In the first mode1, each item is 1abe1ed
    with a weight va1ue denoting the significance of the item, and
    traditiona1 support definition is then extended to we1ghted support. As
    a resu1t, the generated association ru1es are skewed toward ru1es
    concerning items with greater weights (to users' greater interest).
    The second mode1, on the other hand, takes a different route. lt abandons
    the traditiona1 s ing1e minimuIn support thresho1d. In i ts p1ace, each item
    ls assoc1ated with a pre--specified minimum item support (MIS). By
    1owering the MIS thresho1ds for important items, the effect of the first
    mode1 can a1so be achieved.
    The ma1n contribution of the thesis is that, on the bas is of negative
    association ru1es, we deve1op the mode1 of hybrid associat ion rules.
    It has been observed that negative association ru1es, the opposite
    of (tradit iona1) association ru1es, a1so revea1 readi 1y app1 icab1e
    informat ion. In KDD commun ity, however, negat ive assoc iat ion ru1es are
    1ess covered and not comp1ete1y forma1ized. In an effort to formu1ate
    negative ru1es e1egant1y, we introduce negat ive items into itemsets, and
    thereby defifle an expanded mode1 of association rules f hybri d
    association rules. This expansion makes hybrid association ru1es a
    
    
    superset of both traditional and negative association rules.
    The thesis proposes and elaborates three approaches to hybrid association rule discovery: the direct approach; hash-tree-based approach; and tree-based approach. The first approach derives directly from the new problem setting to generate large itemsets. However, It suffers from great complexity and low efficiency. The hash-tree-based approach is a slight modification of the Apriori algorithm, which, again, we find not satisfying. To further improve the efficiency, we devise another approach based on a different yet similar tree structure. In addition, we also give a pruning algorithm to weed out redundant negative rules. All the algorithms are implemented in C. To assess the performances of the algorithms, we first apply theoretical distribution and use pseudo-random function to generate some synthetic datasets, on which experiments are then performed to find hybrid association rules. However, the model of hybrid association rules has its difficulties too. For instance, the problem of "too many rules" may be especially notorious in our model. As a future research direction, we may apply different thresholds (e.g. weights or multiple minimum supports) on items/itemsets of different nature.
引文
[1]裴健,柴玮,赵畅,等.联机分析处理数据立方体代数.《软件学报》,1999,10(6):561~569
    [2]U. Fayyad, G. Piatetsky-Shapiro, P. Smith. From Data Mining to Knowledge Discovery, AAAI/MIT Press, Menlo Park, CA, 1996, 1~34
    [3]U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. Knowledge Discovery and Data Mining: Towards A Unifying Framework. http://www.cs.bham.ac.uk/~anp
    [4]郑宏珍,柳明欣.数据挖掘及其工具的选择.《计算机应用》,1999,19(10):109~110
    [5]R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules between Sets of Items in Large Databases. Proceedings of ACM SIGMOD Conference on Management of Data, Washington, DC, May 1993,207~216
    [6]M. Craven, J. Shavlik. Using Neural Networks for Data Mining. Future Generation Computer Systems, 13, 1997, 211~229
    [7]U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining,AAAI/MIT Press, 1996
    [8]R. Wille. Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts. Rival I (ed.). Ordered Sets. Dordrecht: D Reidel Publishing Company,1982, 445~470
    [9]R. Godin, R. Missaoui. An Incremental Concept Formation Approach for Learning from Databases. Theoretical Computer Science, Special Issue on Formal Methods in Databases and Software Engineering, October 1994, 133,387~419
    [10]王志海等.概念格上规则提取的一般算法与渐进式算法.《计算机学报》,1999,1
    [11]R. Agrawal, R. Srikant. Mining Generalized Association Rules. Proceedings of 21st International Conference on Very Large Databases, Zurich, Switzerland,Sept. 1995, 407~419
    [12]J. Han, Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. Proceedings of 21st Intl Conference on Very Large Databases, Zurich,Switzerland, Sept. 1995
    
    
    [13]J. Han. Mining Knowledge at Multiple Concept Level. Proceedings of 4th Intl Conference on Information and Knowledge Management (CIKM95), November 1995
    [14]Y. Fu. Discovery of Multiple-level Rules from Large Databases. Doctor thesis,Simon Fraser University, Canada, July 1996
    [15]孟志青.时态关联规则采掘的若干性质.《计算机工程与应用》,2001,10:86~87
    [16]丁祥武.挖掘时态关联规则.《武汉交通科技大学学报》,1999,23(4):365~367
    [17]R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules.Proceedings of 20th Intl Conference on Very Large Databases, Santiago, Chile,1994, 487~499
    [18]M. Houtsma, A. Swami. Set-oriented Mining for Association Rules in Relational Databases. Proceedings of 1995 Intl Conference on Data Engineering,Tapei, Taiwan, March 1995
    [19]J. Park, M. Chen, P. Yu. An Effective Hash-based Algorithm for Mining Association Rules. Proceedings of ACM SIGMOD Conference on Management of Data, San Jose, CA, May 1995, 175~186
    [20]A. Savasere, E. Omiecinski, S. Navathe. An Efficient Algorithm for Mining Association Rules. Proceedings of 21st Intl Conference on Very Large Databases,Zurich, Switzerland, Sept. 1995,432~444
    [21]H. Toivonen. Sampling Large Databases for Association Rules. Proceedings of 22nd Intl Conference on Very Large Databases, Bombay, India, 1996, 134~145
    [22]D. Cheung, et al. Maintenance of Discovered Association Rules in Large Databases: an Incremental Updating Technique. Proceedings of 12th Intl Conference on Data Engineering, New Orleans, LA, 1995, 106~114
    [23]冯玉才,冯剑林.关联规则的增量是更新算法.《软件学报》,1998,9(4):301~306
    [24]R. Agrawal, et al. Parallel Mining of Association Rules. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6), 962~969
    [25]J. Park, et al. Efficient Parallel Data Mining for Association Rules. Proceedings of 4th Intl Conference on Information and Knowledge Management, Baltimore, MD,Nov. 1995
    
    
    [26]D. Cheung, et al. Efficient Mining of Association Rules in Distributed Databases. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6),910~921
    [27]D. Cheung, et al. A Fast Distributed Algorithm for Mining Association Rules.Proceedings of 1996 Intl Conference on Parallel and Distributed Information Systems (PDIS'96), Miami Beach, FL, Dec. 1996
    [28]G. Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong Rules.Knowledge Discovery in Database, G. Piatetsky-Shapiro and W. Frawley eds.AAAI/MIT Press, 1991,229~248
    [29]R. Srikant, R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. Proceedings of ACM-SIGMOD Intl Conference on Management of Data, Montreal, Canada, June 1996, 1~12
    [30]张朝晖,陆玉昌,张钹.发掘多值属性的关联规则.《软件学报》,1998,9(11),801~805
    [31]R. Srikant, R. Agrawal. Mining Association Rules with Item Constraints. Proceedings of 3rd Intl Conference on Knowledge Discovery in Databases and Data Mining. Newport Beach, CA, Aug. 1997, 67~73
    [32]R. Ng, V. Lakshmanan, J. Han, et al. Exploratory Mining and Pruning Optimizations of Constrained Association Rules. Proceedings of ACM-SIGMOD Conference on Management of Data, Seattle, WA, June 1998, 13~24
    [33]M. Kamber, J. Han, J. Chiang. Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes. Proceedings of 3rd Intl Conference on Knowledge Discovery in Databases and Data Mining. Newport Beach, CA, Aug.1997, 207~210
    [34]Y. Fu, J. Han. Meta-Rule-Guided Mining of Association Rules in Relational Databases, Proceedings of 1995 Intl Workshop on Knowledge Discovery and Deductive and Object-Oriented Databases (KDDOOD'95), Singapore, Dec. 1995,39~46
    [35]R. Meo, G. Psaila, S. Ceri. A New SQL-Like Operator for Mining Association Rules. Proceedings of 22nd Intl Conf on Very Large Databases, Bombay, India, 1996,122~133
    [36]王翔.数据库中关联规则发现的研究.合肥工业大学硕士论文,2000
    [37]铁治欣,陈奇,俞瑞钊.关联规则采掘综述.《计算机应用研究》,2000,1:1~5
    [38]朱绍文,王泉德,黄浩等.关联规则挖掘技术及发展动向.《计算机工程》,
    
    2000,9:4~6
    [39]路松峰,胡和平.加权关联规则的开采.《小型微型计算机系统》,2001,3:347~350
    [40]欧阳为民,郑诚,蔡庆生.数据库中加权关联规则的发现.《软件学报》,2001,12(4):612~619
    [41]H. Mannila. Database Methods for Data Mining. KDD-98 tutorial, 1998
    [42]B. Liu, W. Hsu, Y. Ma. Mining Association Rules with Multiple Minimum Supports. KDD-99, San Diego, CA, 1999, 337~341
    [43]B. Liu, W. Hsu, Y. Ma. Mining Association Rules with Multiple Minimum Supports. SoC technical report, 1999
    [44]A. Savasere, E. Omiecinski, S. Navathe. Mining for Strong Negative Associations in a Large Database of Customer Transactions. Proceedings of IEEE 14th Intl Conference on Data Engineering, Orlando, FL, 1998.
    [45]左万利.含有类别属性数据库中联系性规则的挖掘.《吉林大学自然科学学报》,1999,1:33~37.
    [46]A. Silberschatz, A. Tuzhilin. On Subjective Measures of Interestingness in Knowledge Discovery. Proceedings of 1st Intl Conference on Knowledge Discovery and Data Mining, Montreal, Canada, Aug. 1995.
    [47]王翔,袁兆山.关联规则的扩展模型.《小型微型计算机系统》2001,1:73~74

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700