用户名: 密码: 验证码:
带噪声的文本聚类及其在反垃圾邮件中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网技术的飞速发展,文本数据呈指数级增长。为了获得数据之间的内在关系及隐含信息,文本挖掘技术应运而生。聚类分析作为数据挖掘的一个重要功能,在文本挖掘中有着非常重要的作用,本文将讨论带有干扰信息的文本聚类方法。
     传统的文本挖掘方法首先将文本表示成向量空间模型;然后用TFIDF权重将文档转化为向量形式,最后在向量空间模型中计算文本相似度。在传统的向量空间模型中,由于没有考虑词之间存在的概念相似情况,因此影响了数据聚类的准确性。因而针对中文提出了一种基于知网模型和语义内积的相似度计算方法。
     然而,这一方法却并不适用于垃圾邮件的聚类问题。原因是垃圾邮件发送者经在邮件编辑完成后,用类似于查找替换的办法,把文本中规范的敏感关键词替换为另一个用插入符号、改动次序甚至用拼音替代等方法混淆过的、但能被读者理解的词语,以逃脱邮件处理程序的过滤。如果利用传统的方法则会采取一系列预处理措施,将会过滤掉干扰信息,这样会使垃圾邮件的相似度计算准确度较低,最终导致聚类质量效果较差。
     针对垃圾邮件含有较多干扰信息而导致相似性度量较差这一问题,本文采用非度量的方法,将Needleman-Wunsch算法应用到文本相似度计算中。最后,利用该相似度计算方法,提出一种基于Needleman-Wunsch的聚类算法,最终完成文本聚类。
     与基于向量空间模型相比,采用Needleman-Wunsch算法计算文本相似度时,避免了分词过程,减少语义损失,保留了所有的文本信息,保证了聚类质量;而本文通过预处理将文档内容分成中文字符、英文字符串和符号串,减轻数据稀疏问题,减少了字符的比较次数,从而加快了处理速度。
     通过仿真实验与传统的聚类算法进行对比,该聚类质量和效率都有很大改进。这说明本文提出的聚类算法适合于垃圾邮件聚类,从而提供了一种有效的垃圾邮件过滤技术。具体思路是利用本文方法将垃圾邮件与合法邮件进行聚类,根据文档相似度值聚成不同的类别,从而判断出垃圾邮件与合法邮件。
With the rapid development of Internet technology, the text data is growing exponentially. In order to obtain the intrinsic relationship between the data and implied information, text mining technology emerges as the times require.Cluster analysis has a very important role in text mining and has an important feature of data mining, the paper will discuss the text clustering method with interference information.
     Traditional text mining methods first represent text into a vector space model; secondly, documents are converted to vector form by using the TFIDF weights.Finally calculate the text similarity in the vector space model. Traditional vector space model don't consider the conceptual similarity between the words, thus affecting the accuracy of the data clustering. To solve the problem, a method of similarity for Chinese based on the HowNet model and semantics of the inner product is proposed.
     However, this method is not appropriate to the problem of spam. Because in order to escape the filter of the mail, when finishing editing spam, spam senders will use some methods such as finding and replacing the sensitive keywords by another or inserting symbols or changing orders of words or altering words to phonetic.But readers can understand the text. Traditional methods will take a series of pretreatment measures, which will filter out the interference information and cause less accuracy of similarity. Ultimately the methods lead to poor quality of clustering effect.
     In this paper, a method based on Needleman-Wunsch algorithm is proposed to measure the similarity among the spam mail, in which the texts usually contain a lot of noises. Based on the proposed similarity measurement, an efficient clustering algorithm based on Needleman-Wunsch algorithm is devised. Finally text clustering is completed.
     Compared with the vector space model, when using the Needleman-Wunsch algorithm to compute the text similarity, the method avoids the process of segmentation, reduces the semantic loss, and retains all the text information, so that the quality of the clustering is ensured;By preprocessing the content of the document into Chinese characters, English strings and symbol strings, the data sparseness problem is alleviated, the number of comparisons of the characters is reduced,thereby speeding up the processing speed.
     Compared by simulation with traditional clustering algorithm, the clustering quality and efficiency are greatly improved.That shows that the proposed clustering algorithm is suitable for spam clustering, and then provides a valid e-mail spam filtering technology. The specific idea is that spam and legitimate e-mail are clustered by using the method proposed in the paper. According to the document similarity values, they are clustered into different categories. Finally the spam and legitimate mail are determined.
引文
[1]Cohen W. Fast effective rule induction in Machine Learning[C].Proceedings of the Twelfth International Conference, Lake Taho,California, Mongan Kanfmann.1995:115-123.
    [2]Vijay V,Raghavan. A critical analysis of vector space model for information retrieval[J].Journal of the America Society for Information Science,1986, 37(2):279-287.
    [3]Fodor I.K. A Survey of Dimension Reduction Techniques [J]. LLNL technical report, June 2002:148-494.
    [4]宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘[J].软件学报,2002,013(008):1508-1514.
    [5]姜宁,史忠植.文本聚类中的贝叶斯后验模型选择方法[J].计算机研究与发展,2002.39(5):580-587.
    [6]李涓子,黄昌宁.语言模型中一种改进的最大熵方法及其应用[J].软件学报,1999,10(3):257-263.
    [7]赵军,黄昌宁.汉语基本名词短语结构分析模型[J].计算机学报,1999,22(2):141-146.
    [8]Salton G. Buckley B. Term weighting approaches in automatic text retrieval [J].Information Processing and Management,1998,24(5):513-523.
    [9]Mitra M, Singhal A, Buekley C.Improving automatic query expansion [C].In Proceedings of the 21st annual international ACMSIGMOD, Portland, Oregon, 1998:15-159.
    [10]Oren Eli Zamir, Oren Etzioni. Web Document Clustering:A Feasibility Demonstration[J]. InProc. ACMSIGIR'98,1998.
    [11]Porter M. F. An algorithm for suffix stripping [J]. Program,1980,14(3):130-137.
    [12]郭莉,张吉,谭建龙.基于后缀树模型的文本实时分类系统的研究和实现.中文信息学报,2005,19(5):16-23.
    [13]Dekang Lin. An information theoretic definition of similarity [J]. In Proc.15th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1998:296-304.
    [14]Salton G, Lesk M E. Computer evaluation of Indexing and text processing[J]. Association for Computing Machinery,1968,15(1):8-36.
    [15]Salton G. Automatic text processing[J]. Addison Wesley,1989.
    [16]Salton G, Wong A, Yang CS. On the specification of term values in automatic indexing[J]. Journal of Docurntation,1973,29(4):351-372
    [17]Wong S K M, Ziarko W, Wong P C N.Generalized vector space model in information retrieval[J].In:Proc the 8th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval,1985:18-25.
    [18]Sheridan P, BalleriniJ P. Experiments in multilingual information retrieval using the SPIDER system[J].In:Proc the thAnnua ACM SIGIR International Conference on Research and Development in Information Retrieval,Zurich,1996:58-65.
    [19]陈届伟.中文图像文档高速过滤中的关键技术研究[D].博士学位论文.北京邮电大学,2005.
    [20]Deerwester S,Dumais S T,Furnas GW etal. Indexing by latent semantic analysis Journal of American Social Inference of Science,1990,1(6):391-407.
    [21]Dumais S,Landaner T,Littman M. Automatic cross-linguistic information retrieval using latent semantic indexing.In:Proc SIGIR'96,Zurich 1996.
    [22]潘谦红,王炬等.基于属性论的文本相似度计算[J].计算机学报,1999,6:651-655.
    [23]张焕炯,王国胜等.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,19:21-22.
    [24]晋耀红.基于语义的文本过滤系统的设计与实现[J].计算机工程与应用,2003,39(17):22-25.
    [25]Nabil Belacel, Pierre Hansen, Fuzzy J-Means. A new heuristic for fuzzy clustering[J].Pattern Recognition,2002.35(10):2193-2200.
    [26]费洪晓,康松林.基于词频统计的中文分词的研究[J].计算机工程与应用,2005.7:67-68.
    [27]高新波,谢维信.模糊c-均值聚类算法中加权指数m的研究[J].电子学 报,2000.28(4):80-83.
    [28]Pascual-Marqui R.D, Pascual-Montano A.D, Kochi K, etc. Smoothly distributed fuzzy c-means:a new self-organizing map [J]. Pattern Recognition,2001:2395-2402.
    [29]Zhang T, Ramakrishnan R, Livny M. BIRCH:An efficient data clustering method for very large databases.Proc.1996 ACM-SIGMOD Int. Conf. Ma-gement of data (SIGMOD'96),103-114.
    [30]Boley D. Principal Direction Divisive Partitioning [J]. Data Mining and Knowledge Discovery,1998,2(4):325-344.
    [31]Hotho A, Maedche A, Staab S. Ontology-based Text Document Clustering[C].Klopotek Ma, Wierzchon St, Trojanowski K,eds. Proc. of the Conf. on Intelligent Information Systems. Zakopane:Springer-Verlag,2003.
    [32]Guha S,Rastogi R, Shim K. CURE:an efficient clustering algorithm for large database[J].Information Systems,2001,1,26(1):35-58,
    [33]Karypis G, E.H. Han,Kumar V. CHAMELEON:A hierarchical clustering algorithm using dynamic modeling [J]. COMPUTER,1999,32(8):68-75.
    [34]Choi D,Park S.Self-creating and organizing neural networks [J].IEEE Trans on Neural Networks,1994,5(4):561-575.
    [35]Fritzke B. Growing cell structure:A self organizing network for supervised and unsupervised learning [J]. Neural Networks,1994,7(10):1441-1460.
    [36]Alahakoon D,Halgamuge S. K. Dynamic self-organizing maps with controlled growth for knowledge discovery [J].IEEE Trans, on Neural Networks,2000,11(3): 601-614.
    [37]Zhao Y,Karypis G. Criterion Functions for Document Clustering:Experiments and Analysis[R]. Technical Report, University of Minnesota, MN 2001.
    [38]谭,斯坦巴赫.数据挖掘导论[M].北京:人民邮电出版社,2006:339-344.
    [39]施展,李郝林.实验数据聚类有效性的评价及其应用[J].模式识别与人工智能,1997,10(2):184-188.
    [40]Steinbach M,Karypis Ge,Kumara V. A Comparison of Document Clustering Techniques [J]. KDD-2000 Workshop on Text Mining,2000,7(20-23):109-110
    [41]SAUL B.NEEDLEMAN, CHRISTLAN D.WUNSCH. A general method applicable to the search for similarities in the amino acid sequence of two proteins [J].1970, 48:443-453.
    [42]潘文峰.基于内容的垃圾邮件过滤研究[D].北京:中国科学院计算技术研究所,2004.7.
    [43]王学熙,王亚东,湛燕.学习特征值对K-均值聚类算法的优化[J].计算机研究与发展,2003,40(6):869-873.
    [44]Liu Qun, Li Sujian. Word similarity computing based on How-Net [J]. Computational Linguistics and Chinese Language Processing,2002,7(2):59-76.
    [45]彭京,杨冬青,唐世渭,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1344-1363.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700