用户名: 密码: 验证码:
基于SVM的中文电子邮件作者身份挖掘技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机技术、信息化程度的日益提高,尤其是互联网的日益普及,电子邮件已经成为人们必不可少的经济、实用的信息交换手段。但是,不幸的是,网上邮件滥用的现象时有发生,比如:垃圾邮件、欺骗邮件、威胁邮件、反动邮件等。在这些邮件中,发送者总是试图隐藏他的真正身份以逃避侦察,发送者通过匿名邮件服务器可以更改或伪造自己的地址,更改自己的真实姓名等,因此,通过邮件本身找出邮件作者的真实身份是一件很困难的事情。这样,研究一种识别原始邮件作者真实身份的方法,为计算机取证提供依据,追究非法邮件作者的刑事责任,无疑为控制非法电子邮件的现象提供一种行之有效的方法。本文在分析数据挖掘各种技术的基础上,提出了一种自动辨别或分类匿名邮件作者身份的方法,应用支持向量机做分类算法,提取邮件的各种特征:包括语言特征、头信息和结构特征,自动把邮件分类到预定的作者类别中。本文在分类算法及特征提取策略方面取得了很大进展,对有限数据集的实验取得了满意的结果,为作者身份识别提供了可能。但是分类精度还达不到用于计算机取证的程度,有待将来进一步研究。
With the rapid growth in computer technology and information level, especially the increasing popularization of Internet, e-mail has become an expedient and economical form of communication. But unfortunately, the phenomenon of e-mail misusage is common on the Internet, such as junk mail, cheating mail, threatening mail and antisocial mail etc. In these mails, the sender always attempts to hide his true identity hi order to avoid detection. The sender's address can be forged and routed through anonymous mail server, or the sender's name may have been modified. So it is difficult to find out the real identity of e-mail and undoubtedly to identify the original author of illegitimate e-mail and provide evidence for computer forensic is an effective method to control the illegitimate e-mail phenomenon. In this paper, we propose one method that identify or classify anonymous e-mail authorship automatically on the basis of analyzing various kinds of data mining technology. We adopt the support vector machine algor
    ithm to extract various e-mail document features including linguistic features, header information and structural characteristics and classify or attribute authorship of e-mail messages to predefined author list. Great progress on classification algorithm and feature extraction strategy has been made. Experiments on a limited number of e-mail documents gave satisfying results. This makes it possible to identify authorship of e-mail. But the classification precision is far from the computer forensic standards and further researches should be implemented in the future.
引文
[1] M.Sahami, S.Dumais, D.Heckerman, et al. A Bayesian Approach to Filtering Junk E-mail[C]. In AAAI-98 Workshop on Learning for text Categorization, USA: Madison, Wisconsin, 1998.
    [2] H.Drucker,D. Wu, V. N. Vapnik. Support Vector Machines for spam categorization[C]. IEEE Transactions on Neural Networks, 1999,10(5): 1048~1054.
    [3] 钱桂琼,杨泽明,许榕生.计算机取证的研究与设计[J].计算机工程,2002,28(6):56~58.
    [4] Lann. Stylometry and method. The case of Euripides[J]. Literary and linguistic computing, 1995, (10): 271~278.
    [5] Mendenhall. The Characteristic Curves of Composition[J]. Science, Ⅸ, 1887,237~249.
    [6] Yule. On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship[J]. Biometrika, 1938,(30): 363~390.
    [7] Morton. The Authorship of Greek Prose[J]. Journal of the Royal Statistical Society (A), 1965, (128): 169~233.
    [8] Mosteller, Wallace. Inference and Disputed Authorship: The Federalist[J]. Reading: Addison-Wesley, 1964.
    [9] Burrows, J.F. Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style[J]. Literary and Linguistic Computing, 1987,(2): 61~70.
    [10] Holmes, D.I. and Forsyth, R.S. The 'Federalist' Revisited: New Directions in Authorship Attribution[J]. Literary and Linguistic Computing, 1995,(10): 111~127.
    [11] Khmeleve.Using Leteral and Grammatical Statistics for Authorship Attribution[J]. Problem Peredachi Informatsii, 37(2): 96~108.
    [12] Merriam,T., Matthews,R. Neural Computation in Stylometry Ⅱ: An Application to the Works of Shakespeare and Marlowe[J], Literary and Linguistic Computing, 1994,(9): 1~6.
    [13] Lowe, D.,Matthews. Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions[J], Computers and the Humanities, 1995, (29): 449~461.
    [14] B. Kjell. Authorship attribution of text samples using neural networks and Bayesian
    
    classifiers [C]. In IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, 1994a.
    [15] F.J. Tweedie, S. Singh, D. I. Holmes. Neural network applications in stylometry:The Federalist papers[J], Computers and the Humanities, 1996,30(1): 1~10.
    [16] Holmes, D.I.,Forsyth, R.S. The 'Federalist' Revisited: New Directions in Authorship Attribution[J], Literary and Linguistic Computing, 1995,(10): 111~127.
    [17] Matsura, Kanada. Authorship Detection of Sentences by 8 Japanese Modern Authors via N-gram Distribution[J]. IPSJ SIG Notes, 2000-NL-137: 1~8.
    [18] Yoshida, Nobesawa, Saito. Effective Features of Authorship Identification[J]. IPSJ SIG Notes, 2001-NL-145:83~90.
    [19] J. Hoorn, S. Frank, W. Kowalczyk, F. van der Ham. Neural network identification of poets using letter sequences[J]. Literary and Linguistic Computing, 14(3): 311~338.
    [20] Cagatay Catal, Kemalettin Erbakici,Yasar Erenler. Computer-Based authorship attribution for Turkish documents[C]. IJCI Proceeding of Intl. Ⅻ. Turkish Symposium on Artificial Intelligence and Neural Networks, 2003, 1.1(1),
    [21] E.Spafford, S.Weeber. Software forensics: tracking code to its authors[J]. Computers and Security, 1993, (12): 585~595.
    [22] P.Sallis, S.MacDonell, G.MacLerman, et al. Identified: Software Authorship Analysis with Case-Based Reasoning[C]. In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, 1997, 53~56.
    [23] I.Krsul. Authorship analysis: Identifying the author of a program[R]. Department of Computer Science, Purdue University, 1994.
    [24] I.Krsul and E.Spafford. Authorship analysis: Identifying the author of a program[J]. Computers and Security, 1997, (16): 248~259.
    [25] Olivier de. Vel. Mining E-mail Authorship[C]. KDD-2000 Workshop on Text Mining, ACM International conference on knowledge Discovery and Data Mining, Boston, MA, USA, 2000.
    [26] Olivier. de Vel, A. Anderson, M. Comey, G. Mohay. Multi-Topic E-mail Authorship Attribution Forensics[C]. ACM Conference on Computer Security-Workshop on Data Mining for Security Applications, November 8, 2001, Philadelphia, PA.
    [27] Malcolm Walter Corney. Analysing E-mail Text Authorship for Forensic
    
    Purpose[D]. Australia, University of Software Engineering and Data Communications, 2003.
    [28] Yuta Tsuboi. Authorship Identification for Heterogeneous Documents(D). Japanese: Nara Institute of Science and Technology, University of Information Science, 2002.
    [29] Mary Cook.Experimenting to Produce a Software Tool for Authorship Attribution[R]. 2003.
    [30] E. Stamatatos, N. Fakotakis, G. Kokkinakis. Automatic Authorship Attribution[C], Dept. of Electrical and Computer Engineering University of Patras.
    [31] A.Anderson, M.Comey, O.de Vel, et al. Identifying the authors of Suspect E-mail[C]. communications of the ACM, 2001.
    [32] Richard B.Segal, Jeffrey O.Kephart. MailCat: An Intelligent Assistant for Organizing E-mail[C].In Proc. of the sixteenth National Conference on Artificial Intelligence, USA: Orlando, Florida, 1999.
    [33] William W.Cohen. Learning Rules that Classify Email[C]. In Proc. of the AAAI Spring Symposium on Machine learning in Information Access, 1996.
    [34] Jefferson Provost. Naive bayes vs. rule-learning in classification of email[R]. Technical report, Department of Computer Science, University of Texas at Austin, 1999.
    [35] W.Elliot and R.Valenza. Was the Earl of Oxford the true Shakespeare?[J]. Notes and Queries, 1991, (38), 501~506.
    [36] C.Crain. The Bard's fingerprints[C]. Lingua Franca, 1998, 29~39.
    [37] 蔡健,黄国兴,谢孟军.基于数据挖掘方法的电子邮件过滤[C].微型电脑应用,2001,17(8),21~23.
    [38] 薛冰冰,普杰信.数据挖掘技术及其在电子邮件中的应用[C].信息技术,2003,27(7),4~5.
    [39] 李文斌,刘椿年,黄佳进.基于数据挖掘的垃圾E-mail过滤方法[C].北京工业大学学报,2003,29(2),237~240.
    [40] 朱斌,熊应,朱海云.人工智能在电子邮件分类中的应用研究[C].华南理工大学学报,2001,29(12),53~56.
    [41] 甘勇,陈锬,朱贵良.基于语义分析的电子邮件过滤系统设计[C].微电子学与计算机,2002,(8),28~29.
    [42] 蔡立军,施荣华.一种新的电子邮件过滤系统模型的设计[C].计算机工程,2003,29(16),167~169.
    [43] 王晓国,朱炜,黄韶坤等.一种E-mail挖掘方法及其在CRM中的应用[J].计
    
    算机工程,2003,29(15):93~95.
    [44] 涂乘胜,鲁明羽.Web内容挖掘技术研究[J].计算机应用研究,2003,(11),5~9.
    [45] 王继成,萧嵘,孙正兴等.Web信息检索研究进展[J].计算机研究与发展,2001,38(2):187~193.
    [46] C.Apte, F.Damerau, and S.Weiss. Text mining with decision rules and decision tree[C]. In workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.
    [47] Zhu Lanjuan. The Theory and Experiments on Automatic Chinese Documents Classification[J]. Journal of the China Society for Scientific and Technical Information, 1987,(6).
    [48] Cao Suqing, Zeng Fuhu, Cao Huauguang. A Mathematical Model for Automatic Chinese Text Categorization[J]. Journal of the China Society for Scientific and Technical Information, 1999,(1).
    [49] Aitao Chen. Chinese text retrieval without using a dictionary[C]. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, 1997, 42~49.
    [50] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现 [J].计算机应用研究,2001,18(9):23~26.
    [51] 解冲锋,李星.基于序列的文本自动分类算法[J].软件学报,2002,13(4),783~789.
    [52] 刘开瑛,薛翠芳等.中文文本中抽取特征信息的区域与技术[J].中文信息学报,1998,12(2):1~7.
    [53] 王梦云,曹素青.基于字频向量的中文文本自动分类系统[J].情报学报,2000,19(6):644~649.
    [54] 朱华宇,孙正兴,张福炎.一个基于向量空间模型的中文文本自动分类系统[J].计算机工程,2001,27(2):15~17.
    [55] 苏伟峰,李绍滋,李堂秋.一个基于概念的中文文本分类模型[J].计算机工程与应用,2002(6),193~195.
    [56] 贺海军,王建芬,周青等.基于支持向量机的中文网页分类器[J].计算机工程,2003,29(2):47~48.
    [57] 都云琪,肖诗斌.基于支持向量机的中文文本自动分类研究[J].计算机工程,2002,28(11):137~138.
    [58] Y.Yang, J.P.Pedersen, Feature selection in statistical learning of text categorization[C], in the 14th Int. Conf. on Machine Learning, 1997,412~420.
    
    
    [59] David D. Lewis. Feature Selection and Feature Extraction for Text Categorization[C]. In Speech and Natural Language workshop. New York, 1992,212~217.
    [60] Yiming Yang, Jan O.Pedersen. A Comparative Study on Feature Selection in Text Categorization[C]. In International Conference on Machine Learning(ICML), 1997.
    [61] Alfons Juan, Hermann Ney. Reversing and Smoothing the Multinomial Nave Bayes Text Classifier[C]. In 2nd Intemational Workshop on pattern recognition in information systems,2002.
    [62] A.McCallum, K.Nigam. A comparison of event models for Naive Bayes text classificationIn AAAI-98 workshop on learning for Text Categorization,1998.
    [63] Miguel E.Ruiz,Padmini Srinivasan. Hierarchical neural networks for text categorization[C]. Proceedings of SIGIR-99:22nd ACM International Conference on Research and Development in Information Retrieval, ACM Press, New York,US, 1999,281~282,
    [64] T.Joachims. Text categorization with support vector machines: Learning with many relevant features[C]. In Proceedings of the European Conference on Machine Learning, Springer, 1998.
    [65] Dimitris Meretakism, Dimitris Fragoudis, Hongjun Lu, et al. Scalable association-based text classification[C]. 9th ACM International Conference on Information and Knowledge and Management, Mclean, US,2000,373~374.
    [66] Vapnik V. and Lerner A., Pattern recognition using generalized portrait method[J]. Automation and Remote Control, 1963, (24).
    [67] Boser B., Guyon L., Vapnik V. A training algorithm for optimal margin classifier[C]. In fifth annual Workshop on Computational Learnirig Theory, Baltimore, MD: ACM Press, 1992: 144-152.
    [68] Yang, Expert network: Effective and efficient learning from human decisions in text categorization and retrieval[C]. In proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrival, 1994, 13~22.
    [69] Yang, Jan O.Pederson. A comparative study on feature selection in text categorization[C]. Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, Morgan Kaufmann, 1997,412~420.
    [70] Osuna, Freund, Girosi. Training support vector machines: An application to face
    
    detection[C]. In Proceedings of CVPR97, New York,NY.IEEE, 1997,130~136.
    [71] 叶俊勇,汪同庆,杨波等.基于支持向量机的人脸检测算法[J].计算机工程,2003,29(2):23~24.
    [72] 侯风雷,王炳锡.基于说话人聚类和支持向量机的说话人确定研究[J].计算机应用,2002,22(10):33~35.
    [73] 苏毅,吴文虎.基于支持向量机的语音识别研究[C].第六届全国人机语音通讯学术会议,深圳:2001,223~226.
    [74] Mukherjee, Osuna,Girosi, Nonlinear prediction of chaotic time series using a support vector machine[J]. In: proc. of NNSP'97,1997.
    [75] Drucher H, Burges C, Kaufman L et al. Support Vector regression machines[J]. In: Mozer M, Jordan M, Petsche T(eds). Neural Information Procession Systems, MIT Press, 1997, 9.
    [76] Ji He, Ah-Hwee Tan, Chew-Lim Tan. On Machine Learning Methods for Chinese document categorization[J]. Applied Intelligence, 2003, (18): 311~322.
    [77] L.Zhu. The theory and experiments on automatic Chinese documents classification[J]. Journal of the china society for Scientific and Technical Information (in Chinese), 1987, 6(6).
    [78] S.Deerwester, S.Dumais, G.Furnas, T.Landauer, et al, Indexing by Latent Semantic Analysis,Journal of the American Society for Information Science, 1990,41(6): 391~407.
    [79] Nicholas J. Belkin, W. Bruce Croft, Information filtering and information retrieval: two sides of the same coin?[J], Communications of the ACM, 1992, 35(12): 29~38.
    [80] Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification[J]. Journal of Machine Learning Research, 2001, 45~66.
    [81] Wang Xuyang, Li Ming, Pang Shuxia. A Method of Mining Classification Rules in Large Database[C].中国人工智能进展,广州:2003.1345~1349.
    [82] O.Teytaud, R.Jalam. Kernel-based text categorization[C]. In International Joint Conference on Neural Networks(IJCNN'2001), Washington DC: 2001.
    [83] T.Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C], in Proc. 14th Int.Conf.Machine Learning, CA: Morgan, Kaufman, 1997.
    [84] 胡红霞,王振兴等.搜索引擎技术的研究及发展趋势[J].信息工程大学学报,
    
    2001,2(4):66~69.
    [85] 张学工.关于统计学习理论与支持向量机[J].自动化学报,2000,26(1):33~42.
    [86] 董云杰,邱熔胜.基于支持向量机与极端保守在线算法相结合的多类分类器[J].模式识别与人工智能,2003,16(4):476~481.
    [87] 吴翔,谭李,陆文凯等.提高超大规模SVM训练计算速度的研究[J].模式识别与人工智能,2003,16(4):46~49.
    [88] 邱熔胜,董云杰.SVM QP问题分解算法的研究发展[J].模式识别与人工智能,2002,16(1),63~68.
    [89] C.Cortes, V.Vapnik. Support vector networks, Machine Learning, 1995, 20, 273~297.
    [90] V.Vapnik. The Nature of Statistical Learning Theory[C]. Springer-Verlag, New York, 1995.
    [91] 柳回春,马树元,吴平东等.手写体数字识别技术的研究[J].计算机工程,2003,29(4):24~25.
    [92] 肖健华,吴今培,杨叔子.基于SVM的综合评价方法的研究[J].计算机工程,2002,28(8):28~30.
    [93] 李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,24(1):62~68.
    [94] I.H.Witten, E.Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implentations[N]. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers, San Francisco, California, USA,2000.
    [95] Chang Chihchung, Lin Chihjen. LIBSVM: A Library for Support Vector Machines(Version 2.3.1)[C]. 2001.
    [96] R.Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection[C], In International Joint Conference on Artificial Intelligence, 1995.
    [97] M. Stone. Cross-validatory choice and assessment of statistical predictions[J]. Journal of the Royal Statistical Society, 1974, 36(2): 111~147.
    [98] Y.Yang. An evaluation of statistical approaches to text categorization [J]. Journal of Information Retrieval, 1999, 1: 67~88.
    [99] 陈建奇,张玉清,李学农等.安全电子邮件的研究与实现[J].计算机工程,2002,28(6):121~122.
    [100] 陆昀晔,李建华.电子邮件安全监管系统[J].计算机工程,2003,29(19):88~89.
    
    
    [101] 朱明.数据挖掘[M].合肥:中国科学技术大学出版社,2002
    [102] 蔡自兴,徐光祜.人工智能及其应用[M].北京:清华大学出版社,第二版,1996.
    [103] Chih-Chung Chang, Chih-Jen Lin. LIBSVM-A Library for Support Vector Machines[EB/OL]. http://www.csie.ntu.edu.tw/~cilin/libsvm/,2004-04-01.
    [104] Karolina Owczarzak. Linguistic Evidence for Forensic Purposes[EB/OL]. http://www.linguistlist.org/issues/14/14-247.html, 2003-01-23.
    [105] Yong Guan. Reading List: Computer and Network Forensies[EBtOL]. http://clue.eng.iastate.edu/~guan/course/CprE-592-YG-Fall-2002/paper/,2002.
    [106] 李东,张湘辉.汉语分词在中文软件中的广泛应用[EB/OL].http://www.21cnbj.com/industrynews/searchengine2003/2003-10-21-374.html, 2003-10-21.
    [107] 张忠平,Web文本挖掘(Text-Mining)技术[EB/OL].http://www.21cnbj.com/industrynews/searchengine2003/2003-10-04-331.html, 2003-10-04.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700