用户名: 密码: 验证码:
中文Web文档倾向性自动分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
如何在浩若烟海而又纷繁芜杂的文本中最快捷地获取有效信息始终是信息处理的一大目标,也是一大难题。文本自动分类系统,作为信息处理的重要研究方向,旨在根据文本的内容自动判别文本类别。目前,国际上对于英文文本分类的问题研究已经比较成熟,而中文文本分类问题以中文环境和语义为特色,引入了特殊矛盾和特殊困难,成为特别的研究课题。
     其中中文文本倾向性分析研究更是一个崭新的、充满挑战的研究领域。为了维护网络安全的健壮性,因此我们提出了实验型中文Web文档倾向性分类鉴别器项目。鉴于以往的鉴别基于关键词的简单匹配和人工处理,效率低下;为此本项目旨在加强中文Web文档鉴别的实时性和高效性。
     在研究的过程中,我们系统考察了中文Web文档自动分类的各个环节以及具体的实现技术:从语料库的建立,中文Web文档的分词,索引的选择,权重的设计方案及分词系统SMCW的建立,到特征选择方法的研究讨论,各种分类方法的研究讨论,最后到中文Web文档倾向性分类系统(SCUSCTC SCU Smart Chinese Text Classifier)的结构提出及用Java语言开发实现该系统,并对最后的分类结果及中间分词结果进行了细致的实验和考察。系统功能特色有:1)分类方法智能准确:基于领域和语言学知识结合的方法,使文本分类的精度较以往机械匹配的方法大大提高;2)文本分类高速及时:精巧的算法设计配以高效的实现技术,使分类处理既保质又保量;3)输出格式标准通用:采用标准
    
    通用的XML作为系统的输出格式,这不仅方便了信息的交换、再加工,
    而且有利于实现与不同数据库和应用系统的进一步集成。
     最后,本文和本系统的成果表现为:l)研究了现代网络情况下,
    对于中文W七b文档倾向性分类的方法和技术,并提供了一个可供研究
    并具有一定实用价值的原型系统;2)提供了相关的论文和开发文档,
    对于以后的研究有极大的帮助:3)对在网关上利用的中文w七b文档分类
    器进行了实践性的研究:4)编制了中文Web文档倾向性分类的性能要求
    及相关参数的测试评定;5)实现了实时性的中文w七b文档倾向性分类,
    达到了一定的速度要求和精度要求.
     在以后的工作中考虑如下问题:1)数据集的标准化;2)分词系统精
    度的提高,对歧义处理以及未登录词识别的能力的提高:3)进行合理
    的语义分析:4)利用用户反馈信息动态更新训练集;5)定t分析分类器
    不同要素对分类系统性能的影响,使用合适的模型来比较和评价分类
    系统;6)自然语言理解问题,如“引用”问题;7)对于敏感词汇伪装的
    识别问题。
     本文组织如下:第一部分为引言,第二部分描述了文本分类解决
    的问题并对其性能评估方法和阅值选取原则进行了介绍,第三部分描
    述了文本的模型表示及其方法和比较,第四部分介绍了特征提取的方
    法,第五部分探讨了不同的文本分类方法:Nalve Bayes、kNN、决策
    树以及SVM自动分类系统的关键技术,第六部分是该系统的测试数据
    和实验结果,第七部分是结束语.
Since 1990s, as volumes of information available on the Internet continue to increase, there is a growing demand for tools to help people find, filter, and manage these resources more efficiently. Text categorization, the assignment of free text document to one or more predefined categories based on their content, is an important component in many information management tasks. Since Chinese text classification has a distinct feature based on Chinese language context and semantics, it becomes a special research field with special difficulties and controversy, among which Chinese text orientation analysis is especially frontier and challenging.
    With the development of modem network techniques, network becomes an essential tool for people to communicate with others. In order to maintain the robustness of network security, we start our project of Laboratorial Chinese Web Documents' Orientation Text Classifiers. In previous classifiers this process is very time-consuming and costly, thus limiting its applicability. So our classifiers may meet the requirements of real-time and high accuracy.
    In this thesis, we give a survey of the state-of-the-art in Chinese
    
    
    
    text categorization, from the building of the corpus, the divided syncopation system of Chinese Web document, the selection of index, and the design of weight to the structure of SCUSCTC (SCU Smart Chinese Text Classifier) and its implement in Java. Finally, we give a thorough analysis of the experiments results and ascertain the main advantages and features of SCUSCTC as follows: 1) artificial intelligence and accuracy, 2) high speed and realtime, 3) Using XML as a standard and universal output format.
    The main contribution of this thesis includes: 1) Research the methodology and technology of Web text classification under modern network, and process a practicable system prototype; 2) Provide many correlative papers and development documents for further research; 3) Process a practicable research of Web text classification on gateway; 4) Design the performance request and related parameters' evaluation of Web text classification; 5) Implement a real-time Web text classification system (SCUSCTC), which satisfies certain high speed and high accuracy.
    In further research, the following issues must be considered: 1) The standardize of corpus; 2) Improve the accuracy of Chinese words divided syncopation system, handle the different meanings of one word and recognize the words that do not appear in the dictionary; 3) Process semantic analysis; 4) Dynamically update the training sets fed back by the user; 5) Quantitatively analyze the system performance influenced by different factors, use an appropriate model to compare and evaluate the Web text classification system; 6) Natural language process; 7) Distinguish the disguise of sensitive words.
    This thesis is divided into seven chapters, with Chapter 1 as the
    
    
    
    introduction. In Chapter 2 we formally define 1C and introduce performance measures and thresholding strategies for TC. Chapter 3 describes the needed steps to transform raw text into a representation suitable for the classification task. Feature selection methods are surveyed in Chapter 4. In Chapter 5 we describe four methods that have been successfully applied to text categorization: kNN, Naive Bayes, Decision Tree and SVMs. In Chapter 6 we describe our own work using the "Korean and World Cup Corpus", while Chapter 7 concludes the whole thesis and discusses open issues and possible avenues of further research for TC.
引文
[T.Joachims, 98]T.Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Leaming(ECML),pages 137-142,Berlin, 1998.Springer.
    [Y.Yang, 01] Yiming Yang. A study on thresholding strategies for text categorization(ps.gz). Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), p 137-145, 2001.
    [D.Lewis,92] D.Lewis. An evaluation of phrasal and clusered representations on a text categorization task. In 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrival(SIGIRZ)2),pages 37-50,1992.
    [Y.Yang, 99] Y.Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrival, 1(1/2):67-88, 1999.
    [J Rocchio, 71].J Rocchio. The SMART Retrival System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Pretice-Hall,1971
    [Y Yang, 97] Yiming Yang and Jan O.Pedersen. A comparative study on feature selection in text categorization. In Douglas H.Fisher(editor), Proceedings of ICML-97, 14th Inernational Conference on Machine Learning pages 412-420, Nashville, US, 1997. Morgan Kaufrnann Publishers, San Francisco, US.
    [Mladenic and Grobelnik, 99]D Mladenic, M Grobelnik. Feature selection for unbalanced class distribution and Nave Bayes. In: Proc of the 16th Int'l Conf on Machine tearning (ICML'99). San Francisco; Morgan kaufmann Publishers, 1999.258~267.
    [J.R.Quinlan, 93]J.R.Quinlan, c4.5: Programs for Machine Learning. San Mateo, Calif: Morgan Kaufmann, 1993.
    [J.R.Quinlan, 96]J.R.Quinlan, "Improved Use of Continuous Attributes in C4.5", J.Atrificial Intelligence Research, vol. 4,p77-90, 1996.
    
    
    [T.S.Lim et al., 00]T.S.Lim, W.Y.Loh, and Y.S.Shih, "A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Tree Old and New Classification Algorithms",Machine Learning, vol.40,no.3,p203-228,Sept.2000.
    [T. Joachims, 97] T. Joachims. Text categorization with support vechine machine. Technical Report. LS Ⅷ Number 23. Dortmund: University of Dortmund, 1997.
    [T.Joachims, 99a]T. Joachims, Making Large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning, B. Schkopf and C. Burges and A. Smola (ed.), MIT Press, 1999.
    [T. Joachims, 00] T. Joachims, Estimating the Generalization performance of an SVM Efficiently. International Conference on Machine Learning (ICML), 2000.
    [T. Joachims, 99b] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Interntional Conference on Machine Learning (ICML), 1999.
    [Morik et al., 99] K. Morik, P. Brockhausen, and T. Joachims, Combining statistical learning with a knowledge-based approach-A case study in intensive care monitoring. International Conference on Machine Learning (ICML), 1999.
    [T. Joachims, 02] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Dissertation, Kluwer, 2002.
    [V.Vapnik]V.Vapnik. Nature of Statistical Learning Theory. John Wiley and Sons, Inc.,New York, in preparation.
    [J.C.Burges, 97]J.Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Bell Laboratories, Lucent Technologies. 1997.
    [Cortes and Vapnik, 95] Corinna Cortes, V.Vapnik. Support-Vector Network. Machine Learning, 20.273—297, 1995.
    [S.S. Keerthi et al]S.S. keerthi et al. Improvements to Platt's SMO Algorithm for SVM Classifier Design.
    [Platt, 98] John C. Platt. Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. 1998.
    [Nando de Freitas et al., 99]Nando de Freitas et al. Sequenttial Support Vector
    
    Machines. Neural Neworks for Signal Processing Ⅸ. 31-40(1999).
    [Fabrizio Sebastiani, 02]Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys(CSUR), Volume 34, Number Ⅰ, pages 1-47,2002.
    [Salton and Buckley, 95]Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal, Volume 24, Number 5, pages 513-523.
    [K. Aas and L. Eikvil, 99]K. Aas and L. Eikvil. Text categorisation: A survey. Technical report, Norwegian Computing Center, June 1999 http://citeseer.ni.nec.com/aas99text.html.
    [Yiming Yang and Xin Liu, 99]Yiming Yang and Xin Liu. "A re-examination of text categorization methods." 22ndAnnual International SIGIR, http://www.cs.cmu.edu/~yiming/publications.html.1999.
    [Andrew et al., 98]Andrew, McCallum, Kamal, Nigam. A comparison of event models for nave bayes text classification. In:Sahami, M..ed. AAAI 98 Workshop on Learning for Text Categorization. Menlo Park: AAAI Press, 1998. 509~516.
    [Apte et al., 94] Apte, C., Damerau, F., Weiss, S.M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12(3):233~251.
    [KontKanen et al., 98]KontKanen, P., Myllymaki, P.,Sllander, T.,et al. BAYDA:software for bayesian classification and feature selection. In: Agrawal, R.,Stolorz, P.E., Piatetsky-Shapiro, G.,eds. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining(KDD'98). Menlo Park:AAAI Press, 1998. 254~258.
    [Quilan, 93]Quilan, J.R. Constructing Decision Tree in C4.5:Programs for machine learning, San Matco, CA: Morgan Kaufmann Publishers, 1993. 17~26.
    [Dunja et al., 99]Dunja, Mladenic, Marko, Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In:Bratko, I.,Dzeroski, S.,eds. Proceeding of the 16th International Conference on Machine Learning(ICML'99).
    
    San Francisco. CA: Morgan Kaufmann Publishers, 1999. 258~267.
    [D.Lewis, 92]D.Lewis. An evaluation of phrasal and clusered representations on a text categorization task. In 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrival(SIGIR'92),pages 37-50,1992.
    [胡江滔等,01]胡江滔,周水庚,周傲英基于遗传算法的中文WEB文档分类研究.第十八届全国数据库学术会议论文集(技术报告篇:113-116.
    [程静,邱玉辉,0]]程静,邱玉辉.Web Mining中的网页分类.第十八届全国数据库学术会议论文集(技术报告篇)2001,P80-83.
    [范炎等,01]范炎、郑诚、王清毅、蔡庆生、刘洁,用Nave Bayes方法协调分类Web网页,软件学报,2001/12(09)1386-07.
    [中国标准出版社,91]中国标准出版社《汉语信息处理词汇01部分:基本术语(GB12200.1-90)》,1991.
    [朱德熙,82]朱德熙,《语法讲义》,商务印书馆,1982.
    [中国标准出版社,93]中国标准出版社,GB/T13715-92《信息处理用现代汉语分词规范》,1993.
    [梁南元,87]梁南元《书面汉语自动分词系统-CDWS》,《中文信息学报》1(2),1987.
    [Jiawei Han范明等,01]Jiawei Han,Micheline Kamber著范明,孟小峰等译.数据挖掘概念与技术 机械工业出版社2001年8月
    [朱华宇等,01]朱华宇,孙正兴,张福炎,一个基于向量空间模型的中文文本自动分类系统,计算机工程,2001年2月第27卷第2期,P15—17,63.
    [庞剑锋等,01]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究,2001(9).
    [陆玉昌等,02]陆玉昌等,向量空间法中单词权重函数的分析和构造,计算机研究与发展,2002年10月.
    Rainbow http://www-2.cs.cmu.edu/~mccallum/bow/,C4.5 http://www.cse.unsw.edu.au/~quinlan/
    
    
    BAYDA http://www.cs.helsinki.fi/research/cosco/Projects/NONE/SW/
    CART http://www.salford-systems.com/products-cart.html
    LIBSVM http://www.csie.ntu.edu.tw/~cjlim/libsvm/
    OSU SVM http://eewww.eng.ohio-state.edu/~maj/osu_svm/
    SVM light http://svmlight.joachims.org/
    【1】关联规则制导的遗传算法在文本分类中的应用(Application of Association Rules aided Genetic Algorithm in Text Classification),计算机科学2002Vol.29No.8,D66-68,第一作者.
    【2】对源于数据库的XML文档的结构制导压缩技术(Structure Guided Compression for XML Document In DB,计算机科学2002Vol.29No.8,p59-61,第二作者.
    【3】基于自然语言理解和发信人聚类的电子邮件鉴别(The Discrimination of E-mails Based on Natural Language Understanding and Sender Clustering),计算机科学2001 Vol.28 Supp.No.8,2001年10月,p.184-188,第四作者.
    【4】银行回单柜系统的研究与实现.四川大学学报,2001,38(03),第四作者.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700