用户名: 密码: 验证码:
面向学科相关性分析的文本关联规则挖掘技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
文本挖掘技术是现代信息处理中正在研究的热点课题。在文本挖掘过程中,文本数据预处理和文本关联规则分析是两个最主要的步骤。本文针对文本挖掘数据预处理过程中存在的问题,结合一般评审系统项目申报书的特点,提出了基于改进的最大关键词特征权重下文本关联规则挖掘方法,并在此基础上深入探讨了关联规则的后处理挖掘的方法。
     本文在分析科技项目申报书特点的基础上,研究并实现了针对项目申报书的文本关联规则提取,并用文本特征向量对项目申报书进行有效的表示。本文在分析了传统的基于词频-逆文频的文本特征选择的基础上,引用了一种改进的关键词特征权重计算方法,该方法利用领域关键词信息域权重的计算方式,实现了对科技项目申报书的相关学科进行有效的文本特征选择,并通过实验验证了该方法的有效性。
     针对传统关联规则挖掘过程中文本特征向量表示维数巨大,频繁集生成复杂等问题,提出了基于XML格式的最大特征权重关联规则挖掘方案。该方案在整个文本挖掘过程中,都可以给文本数据的存储和计算带来极大的便利。针对学科相关性分析特点,研究并设计了基于学科领域关键词同现的关联规则后处理方法。通过计算关键词和学科领域词之间的同现度,达到对学科相关领域新的热点或者盲点问题进行研究的目的,并通过实验验证了该方法的有效性。
Text Mining is the hot research of the modern information processing. In the process of text mining, the text data pre-processing and the association rules extraction play an important role. In that case of the open question of the pre-processing about text mining and the general characteristics of the project application in the evaluation system, this paper put forward a new method of association rules analysis based on the improved max weighting of keyword feature. On the basis, this paper pays some attention on the reprocessing of the association rules.
     Based on the characteristics of the project application of science and technology, the establishment method of association rules extraction for the text data is researched and designed. At the same time, an effectual description of the text eigenvectors is used. Besides, an improved method for keywords'feature weighting is introduced by analyzing the traditional text feature selection in the TF-IDF algorithm. In this method, the weight computing based on the information domain of field keywords is proposed to implement the valid selection of text feature for the relevant disciplines in the project application of science and technology. Then, the validity of this method is verified by experiments.
     For the huge dimension of the Chinese text vector indexing and the complex to produce the frequent set, a scheme is proposed to resolve the problems in the process of the traditional association rules extraction which is based on the XML format and the maximum feature weighting. This scheme could bring great convenience both in the text data storing and computing in the whole process of the text mining. For the characteristics of disciplinary correlation analysis, a post-processing technique for the association rules extraction is used, which is based on the keyword co-occurrence of the subject field. With calculating, co-occurrence between the keyword and the typical words for the subject field, we could discover the new research issues or the blank ones in the field of interdisciplinary.
引文
[1]Pang-Ning Tan, Michael Steinbach, Vipin Kumar著.数据挖掘导论.范明,范宏建等译.北京,人民邮电出版社,2006.5,201-295
    [2]Tan A H. Text Mining:The State of the Art and the Challenges. In:Proc of PAKDD Workshop on Knowledge Discovery from Advanced Databases. Beijing, China.1999,65-70
    [3]Hearst M A. Mining in Textual Mountains. http://mappa.mundi.net/ trip-m/hearst
    [4]Hearst M A. Untangling Text Data Mining. In:Proc of the 37tAnnual Meeting of the Association for Computational Linguistics. College Park, USA.1999,3-10
    [5]Kodratoff Y. Knowledge Discovery in Texts;A Definition and Applications. In:Proc of the 11th International Symposium on Foundations of Intelligent Systems. London, UK:Springer-Verlag,1999,16-29
    [6]张友忠.数据挖掘中关联规则的研究与应用.[硕士学位论文],成都理工大学,2004.5
    [7]Wikipedia (information retrieval). Available from:http://en.w ikipedia. org/wiki/Information_retrieval. Accessed:20 October 2009.
    [8]Wikipedia (information extraction). Available from:http://en.w ikipedia. org/wiki/Information_extraction. Accessed:20 October 2009.
    [9]Swanson D R, Smalheiser N R. Information discovery from complementary literatures:Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 2001,52(10):797—812
    [10]DiGiacome R A, Dremer J M, Shah D M. Fish oil dietary supplementation is patients with Raynaud's phenomenon:A double-blind, controlled, prospective study, American Journal of Medicine[J],1989,8: 158—164
    [11]胡静,蒋外文,朱华.Web文本挖掘中数据预处理技术研究[J].现代计算机,2009,303:48-50
    [12]杨柯,基于关联规则的中文文本自动分类算法研究.[硕士学位论文], 重庆大学,2007.4
    [13]王永成,张坤.中文文献自动分类研究[J].情报学报,1997,16(5):355-358.
    [14]RONEN FELDMAN. Text mining via information extraction [A]. JAN M. ZYTKOW, JAN RAUCH. Third European Conference, PKDD& 99[C]. Prague: Czech Republic,1999.
    [15]陈玉泉,朱锡钧.文本数据的数据挖掘算法[J].上海交通大学学报,2000,34(7):936-938
    [16]张民,李生,王海峰.基于知识评价的快速汉语自动分词.系统情报学报,1996,15(2):4-13
    [17]韩客松,王永成,陈桂林.汉语语言的无词典分词模型系统.计算机应用研究,1999,31(10):8-9
    [18]李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究.清华大学学报,2001,41(7):98-101
    [19]周水庚,关佶红,胡运发.无需词典支持和切词处理的中文文档分类.高科技通讯.2001.3
    [20]徐妙君,顾沈明.面向Web的文本挖掘技术研究.控制工程,2003.5,10:44-46
    [21]袁军鹏,朱东华,李毅等.文本挖掘技术研究进展.china academic journal electronic publishing house,2006.2,1-4
    [22]周雪忠,吴朝晖.文本知识发现:基于信息抽取的文本挖掘.计算机科学,2003,30(1):63-66
    [23]郝枫.文本关联分析中频繁项集挖掘算法的研究与改进.[硕士学位论文].太原理工大学,2008.5
    [24]Text Mining Technology Turing Information into Knowledge. A White Paper from IBM.1998
    [25]Feldman R, Dagan I, Kloegsen W. Efficient Algorithm for Mining and Manipulating Associations in Texts.13th European Meeting on Cybernetics and Research.1996
    [26]Ahonen H, Heinonen 0. Mining in the Phrasal Frontier. PKDD-97, Trondheim, Norway, June 1997
    [27]张毓敏.文本文档聚类技术的研究.[硕士学位论文].上海交通大学,2003.2
    [28]苏新宁,杨建林,江念南等.数据仓库和数据挖掘.北京,清华人学出 版社,2006.4,115-212
    [29]J. Han and M. Kamber.范明,孟小峰等译.数据挖掘概念与技术(第二版).北京:机械工业出版社,2007.3.
    [30]R. Agrawal, T. Imielinske, A. Swami. Mining Association Rules between Sets of Items in Large Databases [A].1993 ACM SIGMOD Inter national Conference on Management of Data. Washington, US, May 1993, 207-216.
    [31]R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases [A]. Proc.20th Int'1 Conf. Very Large Data Base [C],1994,478-499
    [32]Han Y Fu. Discovery of Multiple-Level Association Rules from Large Databases [A]. Proc.21th Int'1 Conf. Very Large Data Base[C],1995, 420-431
    [33]J. Han, J. Pei and Y. Yin. Mining Frequent patterns without candidate generation.2000 ACM SIGMOD International Conference on Management of Data. Dallas, US, May 2000,1-12.
    [34]李林,易云飞,黄潜等.基于矩阵的模糊关联规则挖掘算法及其应用研究[J].计算机应用技术,2009(20):69-72
    [35]Han Jiawei, Kamber M. Data Mining:Concepts and Techniques [M]. China Machine Press,2007.
    [36]Ji Lei, Zhang Baowen, Li Jianhua. A New Improvement on Apriori Algorithm [A].2006 International Conference on Computational Intellige ce and Security [C].2006,1:840-844.
    [37]Kuok C, Fu A, Wong M. Mining Fuzzy Association Rules in Databases [J]. ACM SIGMOD Record,1998,27 (1):41-46.
    [38]朱天清,熊平.模糊关联规则挖掘及其算法研究[J].武汉工业学院学报,2005,24(1):24-28.
    [39]朱烨,叶高英.关联规则挖掘Apriori算法的改进[J].现代电子技术,2008,31(18):78-80.
    [40]刘慧玲.频繁模式挖掘算法LPSMiner及其并行模式研究.[硕士学位论文].兰州:兰州大学,2009.5
    [41]R. J. Bayardo. Efficiently mining long patterns from databases. 1998 ACM SIGMOD International Conference on Management of Data. Washin gton, US,1998,85~93.
    [42]S. F. Lu and Z. D. Lu. Fast mining maximum frequent itemsets. Journal of Software,2001,12(2):293-297.
    [43]Y. Q. Song, Y. Q. Zhu, Z. H. Sun, et al. An algorithm and its updating algorithm based on FP-Tree for mining maximum frequent itemsets. Journal of Software,2003,14(9):1586-1592.
    [44]D. Burdick, M. Calimlim, J. Flannick, et al. MAFIA:a maximal frequent itemset algorithm. IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1490-1504.
    [45]王珊等.数据仓库与联机分析处理.北京:科学出版社,1999.
    [46]黄解军,潘和平,万幼川.数据挖掘的体系框架研究.计算机应用研究.2003,5:1-3.
    [47]史忠值.知识发现.北京:清华大学出版社,2002.
    [48]韩客松,王永成.文本挖掘、数据挖掘和知识管理.情报学报,2001.2,20(1):100-104
    [49]梅馨,邢桂芬.文本挖掘技术综述.江苏大学学报,2003.9,24(5)72-76
    [50]梁南元.书面汉语自动分类系统-CDWS.中文信息学报.1987(2):101-106
    [51]Guttman M A, Zerhouni E A, Mcveigh E R. Analysis and visualization of cardiac function from MR images. IEEE Computer Graphics and Application 1997,17(1):30-38
    [52]谌志群,张国煊.文本挖掘研究进展.模式识别与人工智能.2005,18(1)66-72
    [53]Feldman R, Hirsh H. Dagan I. Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems.1998,10(3): 281-300
    [54]杨斌,孟志青.一种文本分类数据挖掘的技术.湘潭大学自然科学学报,2001.12,23(4):34-37
    [55]T. Joachims. Text categorization with support vector machines: learning with many relevant features.10th European Conference on Machine Learning (ECML-98). London, UK,1998,1398:137-142.
    [56]W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS),1999, 17(2):141-173.
    [57]刘斌,黄铁军,程军等.一种新的基于统计的自动文本分类方法.中 文信息学报,2002,16(6):18-24.
    [58]A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization.1998,752:41-48.
    [59]M. Ruiz, and P. Srinivasan. Hierarchical neural networks for text categorization.22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, United States, August 1999,281-282.
    [60]刘进峰.动态关联规则理论与应用研究.[硕士学位论文],浙江大学,2006.2
    [61]郭崇慧,田凤占.数据挖掘教程[M].北京:清华大学出版社,2005:107-121.
    [62]曲守宁,王钦,邹燕等.基于关联规则的文本聚类算法的研究[J].计算机应用研究,2008,25(4):33-38
    [63]冯中毅,董海棠.一种新的频繁集发现算法P&FP.兰州交通大学学报.2004,23(6):81-84
    [64]http://www.caopeng.org/bbs/thread-11140-1-1.html
    [65]曹卫峰.中文分词关键技术研究.[硕士学位论文].南京理工大学.2009
    [66]Yang Y, Pedersen J.O. A comparative study on feature selection in text categorization. The 14th International conference on Machine Learning,1997:412-420.
    [67]Church K. W, Hanks P. Words association norms, mutual information and lexicography. Computational Linguistics,1989,16(1):22-29.
    [68]Dunning T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,1993,19(1):61-74
    [69]G Salton and T. Y. Clement. On the construction of effective vocabularies for information retrieval. ACM SIGIR Forum,1974,9(3): 48-60.
    [70]刘里,何中市.基于关键词语的文本特征选择及权重计算方案[J].机算机工程与设计,2006,27(6):934-936
    [71]杨洁,季铎,蔡东风等.基于联合权重的多文档关键词抽取技术[J].中文信息学报,2008,22(6):75-79
    [72]苏小虎.VSM的权重改进对文档相似度的影响研究[J].人工智能与识别技术,2008,2:135-137
    [73]Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki. POLYPHONET: An Advanced Social Network Extraction System from the Web. Edinburgh, Scotland. ACM May 23-26,2006.
    [74]T. Miki, S. Nomura, and T. Ishida. Semantic web link analysis to discover social relationship in academic communities. In Proc. SAINT 2005,2005.
    [75]J. Mori, Y. Matsuo, and M. Ishizuka. Finding user semantics on the web using word co-occurrence information. In Proc. Int'1. Workshop on Personalization on the Semantic Web(PersWeb05),2005.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700