用户名: 密码: 验证码:
两类仿生学算法在文本分类中的应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的发展,用户获取到的信息量不断地增加,其中大部分是文本类型的数据,一种高效地管理并有效地利用这些无序数据的技术—文本挖掘技术在这几十年来逐渐地成为一个热点研究领域,文本分类是该领域中的一个重要研究方向。自从90年代以来,文本分类技术中开始引入统计方法和机器学习方法,以前的基于知识工程的文本自动分类方法逐渐地被取代了,同时也涌现出一大批对文本分类中各关键技术进行深入细致研究的文献,这些研究主要包括在文本预处理、特征选择、文本表示模型、分类方法和分类性能评价等方面。在面对互联网发展带来的海量数据处理的问题时,各种文本处理方法都表现出一定的困难。如数据量大、建立的向量空间模型的特征项的维数大、预处理和计算时间长、数据集中噪声多和分类算法的精度低等问题。本文对文本分类中特征选择方法和分类算法进行了研究。
     佳点集遗传算法是利用数论中佳点集的理论对遗传算法中的交叉算子重新设计,以导向以高适应度模式为祖先的“家族”方向的随机搜索算法,与遗传算法相比,算法的精度和速度有所提高,避免了早期收敛现象。覆盖算法从几何的角度出发,把输入的样本向量映射到高维的空间球面上,并通过训练以尽可能少的领域覆盖各个类别形成分类网络模型。粒子群算法是一种模拟鸟群迁徙的进化算法,类似于遗传算法,从随机的初始解开始迭代搜索最优解,也用适应度来评价解的品质,但在迭代过程中没有交叉和变异这两个操作,是一种容易实现,精度高,收敛速度快的算法。
     本文结合佳点集遗传算法在高适应度模式的祖先上搜索更好样本的原则和K近邻算法的简单有效性,提出了基于佳点集遗传算法的特征选择方法;针对覆盖算法具有对高维数据的良好处理能力,但存在分类精度和泛化能力之间的矛盾,本文将覆盖算法和粒子群优化算法相结合,提出一种改进的粒子群优化覆盖算法。最后本文构建了文本分类系统,通过在三组数据上进行实验对比分析,以及F1测量对其性能评估,结果表明本文提出的算法可以有效地提高分类精度和效率。
With the development of information technology, users can access to increasing amount of information, most of which is text-type data, an efficient management and effective use of technology in processing such disorder data-text mining technology in the past few decades becomes a hot research field, text classification is an important research direction in the field. Since 90 years, text categorization has introduced in statistical method and machine learning method, replacing the previous knowledge -based engineering classification method, also emerge a large number of studies about the key technologies of text categorization, These studies include in the text preprocessing, feature selection, text representation model, classification algorithm and classification performance evaluation and so on. in processing massive data development of the Internet brought, a variety of text processing methods have shown some difficulties. Such as the amount of data is large,the large dimension of the established vector space model, a long time for pre-processing and computing, a lot of noise data in the data set and low accuracy problem of classification algorithm. In this paper, feature selection in text categorization and classification algorithm is studied.
     Good point set genetic algorithm is a random search algorithm, re-designs crossover with the theory of good point set of number theory, to guide the ancestors of higher fitness model "family" orientation, Compares with the genetic algorithms, this algorithm improves the accuracy and speed, and avoids early convergence. Covering algorithm starting from geometric point of view, mappes the vector of input sample to the sphere of high-dimensional space, and cover each type of sample with areas as little as possible through training to form classification network model. Particle swarm algorithm is a evolutionary algorithms of simulating migratory birds, similar to genetic algorithm, starting from random initial to iterative search for the best solution, and evaluates the quality of solution with the fitness, but it has no two operations of crossover and mutation in the iteration process,and is easy to implement, high precision and fast convergence of the algorithm.
     This paper combinates the principles of search for better sample in the ancestors of higher fitness model of good point set genetic algorithm with simple and effectiveness of simple K nearest neighbor algorithm, proposes a feature selection method based on good point set genetic algorithm; For covering algorithm is good for high dimensional data processing, but there is a contradiction between classification accuracy and generalization ability.this paper combines cover algorithms and particle swarm optimization algorithm, gives an improved particle swarm optimization covering algorithm. Finally, text classification system is constructed in this paper, through experiment and comparative analysis in three groups of data and performance evaluation with F1 measure,its results show that the proposed algorithm can effectively improve the classification accuracy and efficiency.
引文
[1]吴涛,张铃,张燕平.机器学习中的核覆盖算法[J].软件学报,2005,28(8)1295-1301
    [2]刘依璐,基于机器学习的中文文本分类方法研究[D].西安:西安电子科技大学,2009
    [3]李运龙,基于概念的文本分类算法研究[D].广州:华南理工大学,2010
    [4]张治国,中文文本分类反馈学习研究[D].西安:西安电子科技大学,2009
    [5]傅京孙,蔡自兴,徐光佑.人工智能及其应用[M].北京:清华大学出版社,2004,152-156
    [6]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,2001,18(9):23-26
    [7]孙丽华,中文文本自动分类研究[D].哈尔滨:哈尔滨工程大学,2002
    [8]熊小草,文本分类中特征选择的理论分析与算法研究[D].哈尔滨:哈尔滨工程大学,2007
    [9]何国斌,赵晶璐.汉语文本自动分词算法的研究[J].重庆:计算机工程与应用,46(3):125-130
    [10]化柏林.知识抽取中的停用词处理技术落后[J].现代图书情报技术,2007(8):48-51
    [11]周钦强,孙炳达,王义.文本自动分类系统文本预处理方法的研究[J].计算机应用研究,2005(2):85-86
    [12]张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8).171-185
    [13]Salton G, Wong A, Yang C S. On the specification of term values in automatic indexing[J]. Joumal Of Documentation,1973,29(4):351-372. [14] Salton G, Wong A, Yang C S. A vector space model for automatic indexing[ J]. Communicationsof ACM,1975,18(5):613-620
    [15]张治国,中文文本分类反馈学习研究[D].西安:西安电子科技术大学.2009
    [16]Sun Z H, Bbis Z H, et al. Object detection using feature selection. Pattern recognition,2004,37:2165-2176
    [17]Hsu W H. Genetic wrappers for feature selection in decision tree induction and variable ordering in Bayesian network structure learning Information science,2004,163:103-122
    [18]吴根清.统计语言模型研究及其应用[D].北京:清华大学.2004
    [19]任江涛,孙婧昊,黄焕宇,印鉴.一种基于信息增益及遗传算法的特征选择算法[J].计算机科学,2006,33(10):193-195
    [20]刘素华,候惠芳,李小霞.基于遗传算法和模拟退火算法的特征选择方法[J].计算机工程,2005,31(6):157-159
    [21]Liu H, Setiono R, A Probabilistic Approach to Feature Selection-A Filter Solution. In:Proceddings of International Conference On Machine Learning,1996:319-327
    [22]孙雷,王新.一种基于遗传操作和类内类间距离判据理论的特征选择方法[J]计算机工程与应用,2004,40(21):178-181
    [23]Li L, Weinberg C R, et al.Gene selection for sample classification based on gene expression data:study of sensitivity to choice of para meters of the GA/KNN method. Bioinformatics,2000,17:1131-1142
    [24]刘勇国,李学明,张伟,彭军.基于遗传算法的特征子集选择[J].计算机工程,2003.29(6):19-20
    [25]Jiawei Han, Micheline Kamber数据挖掘概念与技术[M].北京:机械工业出版社,2008:200-206
    [26]Inza I, Larranaga P, Sierra B. Feature subset selection by Bayesian networks based on optimization. Artificial Intelligence,2001,123(1-2) :157-184
    [27]赵洪波,冯夏庭.支持向量机函数拟合在边坡稳定性估计中的应用[J].岩石力学与工程学报,2003.22(2):241-245
    [28]Joachims T. Text categorization with support vector machines:Learn ing with many relevant features [C]. Proceedings of the 10th European Conference on Machine Learning. Chemnitz, Germany,1998
    [29]韩力群.人工神经网络教程[M],北京:北京邮电大学出版社,2007:185-201
    [30]杨雪.支持向量机多类分类方法的研究[D].哈尔滨:哈尔滨工程大学,2006
    [31]刘或.基于贝叶斯理论的文本分类技术的研究与实现[D].吉林:吉林大学,2009
    [32]蒋良孝.朴素贝叶斯分类器及其改进算法研究[D].北京:中国地质大学,2009
    [33]朱福喜,朱三元,伍春香.人工智能基础教程[M].北京:清华大学出版社,2006:315-326
    [34]Yang Y. M, Pederson J.O, A comparative Study on Feature Selection in Text Categorization. In:Proceeding of the Fourteenth International Conference of Machine Learning (ICML'97),1997,412-420
    [35]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2003,18(1):26-32
    [36]单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,03(22):146-148
    [37]万忠,张燕平,张玲,陈洁等.基于覆盖算法决策界的特征选择算法[J].计算机技术与发展,2006,16(4):84-87
    [38]T.E. Dunning. Accurate methods or the statistics of surprise and coincidence[C]. In:Computational Linguistics,1993.19(1):61-74
    [39]闫屹,张燕平,耿筱媛.基于CHI值特征选取和覆盖的文本分类方法[J],计算机技术与发展,2008,18(5):79-85
    [40]段震,王倩倩,张燕平,张玲.覆盖算法下文本分类特征选择的研究[J],计算机技术与发展,2008,18(11):29-31
    [41]张铃,张钹.佳点集遗传算法[J].计算机学报,2001,24(9):1-5
    [42]任江涛,卓晓岚,许盛灿,印鉴.基于PSO面向K近邻分类的特征权重学习算法[J].计算机科学,2007,34(5):187-189
    [43]Zhang Ling, Zhang Bo. A geometrical representation of M-P neural model and its applications[J]. Journal of Software,1998,9(5):334-338(in Chinese)
    [44]贾瑞玉,李永顺,李景成,冯伦阔.佳点集遗传覆盖算法[J].计算机工程.2009,24(9):1-9
    [45]Zhang Ling, Zhang Bo, Yin HaiFeng. An alternative covering design algorithm of multi—layer neural networks. Journal of Software,1999, 10(7):737-742(in Chinese)
    [46]刘政怡,龚建成,吴建国.基于交叉覆盖算法的中文文本分类[J].计算机工程,2006,32(19):183-184
    [47]张燕平,张铃,吴涛,徐锋,张曼,王伦文.基于覆盖的构造性学习算法(SLA)及在股票预测中的应用[J].计算机研究与发展,2004,41(6):979-984
    [48]Kennedy J, Eberhart RC. Particle swarm optimization. In:Proc. of the IEEE Conf. on Neural Networks, Ⅳ. Perth:IEEE Press,1995,1942-1948
    [49]胡旺,李志蜀.一种更简化而高效的粒子群优化算法[J].软件学报,2007,18(4):861-868
    [50]Y. Shi and R. Eberhart, A modified particle swarm optimizer [A], in Proc. IEEE World Congr Comput. Intell. [C],1998,69-73
    [51]吴涛,尚丽,王伟等.基于关联规则的覆盖领域约简算法[J].计算机工程,2008,34(5):57-59
    [52]李永顺,贾瑞玉.基于条件信息熵的覆盖约简算法[J].计算机工程,2010,36(16):176-179
    [53]纪震,周家锐,廖惠连,吴青华.智能单粒子优化算法[J].计算机学报,2010,33(3):556-561

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700