用户名: 密码: 验证码:
基于覆盖算法的中文垃圾邮件过滤
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
Intemet的发展给人们带来了全新的网络体验,其中的电子邮件技术也成为一种快捷、经济的现代通信手段。然而电子邮件在为人们提供便利的通信手段的同时,也日益成为广告、病毒、恶意程序、不良信息等内容传播的重要载体,给人们的生活带来了诸多不便,同时给网络的安全带来极坏的影响。因此,解决好垃圾邮件问题具有重要的现实意义。
     在多种反垃圾邮件的技术中,垃圾邮件过滤技术以其方便、可引入技术种类丰富成为反垃圾邮件研究领域的一个热点方向。现有的垃圾邮件过滤技术主要有基于IP地址的垃圾邮件过滤、基于邮件关键字的过滤以及基于邮件内容的过滤,但这些过滤方法均只单纯考虑了邮件的部分信息而忽视了其它的有用部分。本文在对以上的过滤方法进行分析之后,综合考虑各种过滤方式的优点,提出将邮件地址、关键字、邮件内容等因素同时考虑进行垃圾邮件过滤。
     本文所做的主要工作和创新点如下:
     1、对电子邮件的格式进行了较为细致的分析,并在此基础上具体讨论了VC环境下如何实现对邮件的接收和访问;
     2、本文对以往基于内容的垃圾邮件过滤方法进行了改进,将内容之外的其它各类邮件要素,如来源地址、主题、是否具有附件以及附件类型等,均作为邮件的特征属性供分类器学习。试验结果表明,这些属性对于邮件类别的确定具有重要的影响;
     3、对邮件内容进行处理时,为了减少特征向量的维数,本文使用文本分类中常见的几种特征降维方法(文本频度、x~2统计量方法、互信息方法、信息增益方法、期望交叉熵方法、文本证据权方法)分别对垃圾邮件样本进行了测试。实验结果表明,x~2统计量方法和期望交叉熵方法对邮件分类最为有效,文本频度方法和文本证据权方法稍差,而互信息方法和信息增益的方法效果最差;
     4、获得有效邮件特征向量后,必须使用合适的分类器进行分类。本文首次将张铃、张钹教授提出的前向人工神经网络基础上的覆盖算法应用到垃圾邮件过滤中。本文对使用覆盖算法和支持向量机方法作为分类器进行了对比,实验表明覆盖算法作为一种优秀的分类器,可以有效的进行垃圾邮件过滤,具有较高的正确率;
     5、在垃圾邮件的过滤中存在一定的风险性,一般说来,收件人宁愿多接收到一些垃圾邮件也不希望正常邮件被错判为垃圾邮件。本文从风险角度分析了覆盖算法对测试样本进行分类时的过程,根据分析结果提出对其“拒识”样本的处理过程进行改进,通过改变非垃圾邮件所属覆盖的影响范围,降低了垃圾邮件过滤时的风险;
     6、由于各种模式识别的方法均存在自身的优缺点,本文根据少数服从多数的朴素思想,对基于多种模式识别方法的投票式垃圾邮件过滤的可行性进行了探讨。
     本文在垃圾邮件过滤方面完成了一定的工作,但还存在一些不足,今后可以在以下方面继续研究:
     1、本文的研究对象是针对中文垃圾邮件,今后可以对非中文垃圾邮件进行研究,将非中文垃圾邮件也纳入研究范围中;
     2、本文研究的邮件是传统意义上的文字形式邮件,但随着电子邮件技术的发展,现在已经存在并将会有更多的邮件形式,如何从多种邮件形式中获得所需的过滤邮件信息,是下一步需要研究的内容;
     3、多模式识别方法下的垃圾邮件过滤可以进一步深入研究。
The development of the Internet brings us totally new network experience. Among these developments e-mail technology has become a quick, economical communication method. Although e-mail brings us facility, it is becoming an important carrier of advertisement, virus, baleful program and bad information. This brings inconvenience to our lives and extremely bad impact to the security of the network. Solving the problem of spare is urgent.
     There are many kinds of methods that can be used to solve the problem of spam. Spam filtering is one of the mainstream methods by far, which include IP-based spare filtering, keyword-based spam filtering and content-based spam filtering. All these methods have just considered some information of the e-mail. The dissertation gives out the analysis about the methods mentioned above. According to the analysis, the paper summarizes the advantages of these methods and proposes to consider mail address, keyword and context simultaneously at the time of filtering spam.
     The major contribution of this dissertation can be summed up in six points:
     1) After carefully analyzing the format of the e-mail, the paper discusses in detail how to realize reception of the e-mail in the environment of VC.
     2) In this dissertation, we make an improvement on traditional content-based spam filtering. We consider mail address, subject, attachment and content at the same time and take these factors as the attributes of the classifier. Experiments indicate that these attributes have great impact on the result of spam filtering.
     3) In order to reduce dimensions of attribute vectors, we use several feature reduction methods which are usually used in text categorization (Document-Frequency, x~2 statistic, Multi-Information, Information-Gain, Expected-Cross-Entropy, Weight-of-Evidence-for-Text) to do experiments separately. According to the results, x~2 statistic and Expected-Cross-Entropy are the most useful methods to reduce dimensions. Document Frequency and Weight of Evidence for Text are less effective, while Multi-Information and Information Gain are the least effective of all.
     4) After obtaining the attributes of the e-mail, we need to find an appropriate mean of classification. This article is the first to adopt cross cover algorithm which was propounded by Ling Zhang and Bo Zhang to filter spam. In the experiments, we compare the result of using cross cover algorithm as classifier with the result of using SVM. The experiments prove cross cover algorithm is an excellent classifier, which can filter spare effectively with a high correction rate.
     5) Risk exists in the spam filtering, in that the receiver of the e-mail prefers getting more spam to missing normal mail. We discuss the classification process of cross cover algorithm from the perspective of possible risk. According to the result of analysis, we propose an improvement of one process in the handling of "rejection" samples by employing cross cover algorithm. So we can reduce the risk by changing the area which is affected by normal mail.
     6) Different pattern recognition methods have different advantages and disadvantages. Guided by the philosophy that the minority is subordinate to the majority, we discuss the feasibility of constructing a voting email model based on multiple pattern recognition methods.
     Issues to be further analyzed are as follows:
     1) This passage focuses on spare written in Chinese, we can extend our research to cover spare written in other language.
     2) The main type of mail discussed here is letter-type, but the technology of email is developing. Existing email will have more types. How to get more useful information from many types of email requires more attention in future analysis.
     3) We can work further in the spam filtering based on multiple pattern recognition methods.
引文
[1] 第十七次中国互联网发展状况统计报告[R].中国互联网信息中心.http://www.cnnic.cn/,2005~2007.
    [2] 曹麒麟,张千里.垃圾邮件与反垃圾邮件技术[M].人民邮电出版社,2003.
    [3] Paul Hoffman, Dave Crocker. Unsolicited Bulk Email: Definitions and Problems[R]. Internet Mail Consortium Report: UBE-DEF IMCR-004, October 5, 1997
    [4] G Lindberg. Anti-Spam Recommendations for SMTP MTAs[R]. RFC 2505, 1999
    [5] 中国互联网协会反垃圾邮件规范.http://www, isc.org.cn/20020417/ca134119.html, 2003.3.25
    [6] 王耿.反垃圾邮件算法的设计与实现[D].北京:北京邮电大学.2005.
    [7] 中国互联网络发展状况统计报告[R].中国互联网信息中心.http://www.cnnic.cn/,2005~2007.
    [8] 陈凯.反垃圾邮件技术的研究与实践[D].北京:北京邮电大学,2006.
    [9] 谢希仁.计算机网络工程[M].北京:人民邮电出版社,2002:264~265.
    [10] 熊志勇.数据挖掘在反垃圾邮件领域[D].南昌:南昌大学,2006.
    [11] 谢希仁.计算机网络工程[M].北京:人民邮电出版社,2002:270~273.
    [12] D. Crocker. STANDARD FOR THE FORMAT OF ARPA INTERNET TEXTMESSAGES. RFC 822.August 1982.
    [13] N. Borenstein, N. Freed. MIME (Multipurpose lnternet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of lnternet Message Bodies[R]. RFC 1521. September 1993.
    [14] K. Moore. MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text[R]. September 1993.
    [15] N. Borenstein. The text/enriched MIME Content-type[R]. September 1993.
    [16] GB 12200.1-1990,汉语信息处理词汇01部分:基本术语[S].北京:中国标准岀版社,1991.
    [17] 朱德熙.语法讲义[M].北京:商务印书馆,1982.
    [18] 揭春雨、刘源、梁南元.中文信息学报[J].中文信息学报.1989:3(1).
    [19] 亢临生,张永奎.基于标记的分词算法[J].山西大学学报.1995,17(3).
    [20] 朱殉.中文自动分词系统的研究[D].武汉:华中师范大学,2004.
    [21] 张滨,晏蒲柳,李文翔等.基于汉语句模的中文分词算法[J]计算机工程.2004:(1).
    [22] 刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J]计算机工程与应用.2006:(3).
    [23] 黄昌宁.统计语言模型能做什么?[J]语言义字应用.2002:(1).
    [24] 王彩荣.汉语自动分词专家系统的设计与实现[J].微处理机,2004:(3).
    [25] 尹锋.基于神经网络的汉语自动分词系统的设计与分析[J].情报学报,1998:(1).
    [26] 刘丽珍,宋瀚涛.文本分类中的特征抽取[J]计算机工程,2004,30(4):14-15.
    [27] Yang Y, Pedersen J P. A comparative study on feature selection in text categorization. In: Proc of the 14~th Int' 1 Conference Machine Learning (ICML'97)[C]. 1997.412~420
    [28] Miadenic D, Grobelnk M. Feature selection for unbalanced class distribution and Naive Bayes. In Proc of the 16~th Int'l Conference on Machine Learning (ICML'99)[C]. San Francisco: Morgan Kaufmann Publishers. 1999.258~267
    [29] David D Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In:Proceedings of 15~th ACM International Conference on Research and Development in Information Retrieval (SIGIR-92)[C]. 1992:37~50.
    [30] 陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报.2005:(6).
    [31] Watanabe S., "Pattern Recognition: Human and Mechanical". New York: Wiley, 1985.
    [32] 肖健华.智能模式识别方法[M].华南理工大学出版社。2006:2.
    [33] 吴涛.构造性知识发现方法研究[D].安徽:安徽大学.2003.
    [34] 张铃,张钹 神经网络中BP算法分析[J]l模式识别与人工智能,1994,7(3),pp:191.195.
    [35] 张铃,张钹.人工神经网络理论及应用[M].杭州:浙江科学技术出版社,1997.
    [36] 张铃,张钹.M-P神经元模型的几何意义及其应用[J].软件学报,1998,9(5):334~338.
    [37] Ling Zhang,Bo Zhang.A Geometrical Representation 0f McCulloch-Pitts Neural Model and Its Applications, IEEE Trans. on Neural Networks Vol. 10, No.4[C], July 1999: 925-929.
    [38] 张铃,张钹.人工神经网络理论及应用[M].杭州:浙江科学技术出版社,1997.
    [39] 张铃,张钹,殷海风.多层前向神经网络的覆盖算法设计算法[J].软件学报,1999(07):P66~71.
    [40] 肖健华.智能模式识别方法[M].华南理工大学出版社。2006:7~9.
    [41] 张学工.关于统计学习理论与支持向量机.自动化学报,2000,26(1):32~42.
    [42] Vapnik V N. The Nature of Statistical Learning Theory[M], NY: Springer-Verlag, 1995.
    [43] 张学工.关于统计学习理论与支持向量机.自动化学报,2000,26(1).32~42.
    [44] Burges C J C. A tutorial on support vector machines for pattern recognition[J]. Data Mining and Knowledge Discovery, 1998,2(2).
    [45] 肖健华.智能模式识别方法[M].华南理工大学出版社。2006:150~152.
    [46] 中文自然语言处理开放平台.计算所汉语词法分析系统ICTCLAS.http://www.nlp.org.cn/project/project.php?proj_id=6
    [47] 张启宇.基于贝叶斯算法的垃圾邮件过滤系统的研究与设计.山东:曲阜师范大学.2006.
    [48] 张燕平,张铃,吴涛.机器学习中的多侧面递进算法MIDA.电子学报,2005(2).
    [49] 程泽凯,林士敏.文本分类器准确性评估方法.情报学报,2004(23):5.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700