用户名: 密码: 验证码:
基于HMM模型的Web信息抽取方法的研究与改进
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着因特网技术的迅速发展,网上信息成几何级数增长,如何在海量联机文本中获取所需的信息成为目前重要的研究课题,因此,通用搜索引擎和垂直搜索引擎技术也日益成为人们研究的重点。相对于通用搜索引擎,垂直搜索引擎在信息抽取技术的支撑下,为用户提供更有针对性、更加直观的结构化信息。信息抽取是指从一段文本中抽取指定的信息(例如事件、事实),并将其形成结构化的数据填入数据库中供用户查询使用的过程。目前,信息抽取技术已经获得了长足的发展,然而在垂直搜索引擎中,基于网页模板的信息抽取仍然是最常使用的信息抽取方法。这种方法虽然有准确率和回召率高的优点,但在抽取网页格式多、变化频率高时,会降低抽取系统的灵活性,增加维护成本。
     本文研究基于隐马尔可夫模型的Web信息抽取方法,并对隐马尔可夫模型在Web信息抽取中的应用提出了改进的方法。基于隐马尔可夫模型的Web信息抽取方法是基于机器学习的抽取方法,可以有效提高抽取模型的灵活度,降低维护成本。
     本文阐述了Web信息抽取出现的背景和发展历史,剖析了Web信息抽取的典型系统所采用的方法,分析了信息抽取发展过程中有代表意义的利用机器学习算法学习文本特征的抽取技术和抽取系统。研究了隐马尔科夫模型与二阶隐马尔科夫模型的原理以及主要算法。如评估中的向前算法和向后算法;学习中用于完全标记训练样本的Maximum-Likelihood算法和用于部分标记训练样本的Baum-Welch算法;解码中的Viterbi算法。并着重探讨了隐马尔科夫模型在文本信息抽取中应该如何应用,对隐马尔科夫模型在文本信息抽取中的应用提出了改进的方法。并建立了基于HMM的Web信息抽取模型。
     通过对信息抽取后的数据进行对比和分析,验证了对HMM模型的改进是行之有效的,达到了在垂直搜索引擎中的应用标准。
With the development of the internet techniques, the information on the internet increases exponentially. One of important research focuses on how to deal with these great capacities of online documents. As above, General search engine and vertical search engine become the key points of the research. Different from general search engine, vertical search engine provides more focused and structured information. Information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Presently, information extraction technology develops a lot, but the information extraction method based on html template is already the usual method for information extraction in vertical search engine. This method forms a high precision and recall percent, but reduces the flexibility of the extraction system and increase the cost of maintenance.
     This thesis mainly researches on relative algorithms on Web information extraction based on hidden Markov model (HMM). Web information extraction based on HMM is one of the methods based on machine learning, which can increases the flexibility of the extraction system and reduces the cost of maintenance.
     This thesis expands the background and history of information extraction, and analyzes the typical technology and system of web information extraction which use machine learning to study characteristics of text. The principle and main algorithm of HMM and second order HMM are also expatiated, such as forward algorithm and backward algorithm for evaluation; Maximum-Likelihood algorithm and Baum-Welch algorithm to mark training samples in the study of the model; Viterbi algorithm for decoding. How to use HMM and how to mark data in text information extraction are discussed. And several methods to improve the hidden Markov model in information extraction are offered. Then, the web information extraction model based on HMM is established.
     After the comparison and analysis of the output of information extraction model, it is proved that the improvement of the HMM model is effective and achieves the standard of application in vertical search engine.
引文
[1]播显民.数据挖掘技术及其应用[J].湘潭师范学院学报(自然科学版), 2007,29(1):38-40.
    [2]蔡献花,张岐山.数据挖掘技术及其应用[J].管理科学文摘,2003, 11(7):23-25.
    [3]杜世平,李海.二阶隐马尔可夫模型及其在计算语言学中的应用[J].四川大学学报(自然科学版). 2004, 41 (2):284-289.
    [4]杜世平.对经典隐马尔可夫模型学习算法的改进[J].高等数学研究,2006 , 9 (4) : 58-60.
    [5]韩家炜,孟小峰,王静等.Web挖掘研究[J].计算机研究与发展,2001,12 (25):56-75.
    [6]郝杰,李星.对经典隐马尔可夫模型的经验性改进[J].计算机工程与应用,2001 , 37 (11) : 24-26 .
    [7]贺前华,陆以勤,一种新的琳加训练方法[J].电子学报,2000 ,25(9):56-58.
    [8]胡迎松,宁海霞一种新型的Web挖掘数据采集模型[J].计算机工程与科学,2007 , 29 (2) : 36-39.
    [9]姜吉发.一种跨语句汉语事件信息抽取方法[J].计算机工程, 2005, 31( 2) : 27-29.
    [10]李效东,顾毓清.基于DOM的Web信息抽取[J].计算机学报,2002;25(5):526-533.
    [11]史笑兴,王太君,何振亚.二阶隐马尔可夫模型的学习算法及其与一阶隐马尔可夫模型的关系[J].应用科学学报,2001,19(1):29-32 .
    [12]涂承胜,鲁明羽,陆玉昌.Web挖掘研究综述[J].计算机工程与应用,2003 , 39 (23) : 129-13
    [13]王庆一,王继成,周源远等.多信息块Web页面的信息抽取[J].计算机应用研究,2002,19(10):23-26.
    [14]王新民,一种改进的隐马尔可夫模型训练算法[J].孝感学院学报,2004,24 (3):74-77 .
    [15]邢永康,马少平一种基于Markov链模型的动态聚类方法[J] ,计算机研究与发展,2003,40 (2):129-135.
    [16]薛为民,石志国,王志良.基于隐马尔可夫模型的复杂数据挖掘实现[J].计算机工程,2003 , 29(9):3-5 .
    [17]易高翔,胡和平.基于软计算的Web挖掘研究进展与前景[J].计算机工程与设计,2006, 27(10):1805-1507.
    [18]赵妍妍,王啸吟,秦兵等.中文事件抽取中事件类别的自动识别[C].第三届学生计算语言学研讨会论文集,2006: 240-245.
    [19]钟敏娟,郝谦,刘云中.基于隐马尔可夫模型的文本信息抽取算法[J].计算机工程,2006 , 32(2):203-205.
    [20]朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006,32(10):74-76 .
    [21] Applet D E, Israel D J, Introduction to Information Extraction Technology[J].A Tutorial forIJCAI-99, 1999.
    [22] CALIFF M E, MOONEY R J. Relational learning of pattern-match rules for information extraction[Z]. Working Papers of the ACL-97 Workshop in Natural Language Learning, Madrid, Spain, 1997.
    [23] Chen H H, Ding Y W, Tsai S C, et al, Description of the NTU System Used for MET2[M].In Proceedings of the Seventh Message Understanding Conference, 1998.
    [24] Chinchor N, Marsh E, MUC-7 Information Extraction Task Definition (version 5.1)[M].In Proceedings of the Seventh Message Understanding Conference, 1998.
    [25] Chinchor N, Overview of MUC-7/MET-2[M].In Proceedings of the Seventh Message Understanding Conference, 1998.
    [26] Cowie J, Lehnert W, Information Extraction. Communications of the ACM, Vol. 39 No. 1, 1996.
    [27] Dayne Frietag, Andrew McCallum. Information extraction with HMMs and shrinkage [A]. Proceedings of the AAAI’99 Workshop on Machine Learning for Information Extraction[C]. Orlando: AAAI Press, 1999. 31- 36.
    [28] Dejong G, an Overview of the FRUMP System[M]. Strategies for Natural Language Processing. Lawrence Erlbaum, 1982, 149-176.
    [29] Douthat A, The Message Understanding Conference Scoring Software User's Manual[C].In Proceedings of the Seventh Message Understanding Conference, 1998.
    [30] Freitag D, McCallum A. Information extraction with HMM structures learned by stochastic optimization[A]. Proceedings of the Eighteenth Conference on Artificial Intelligence [C]. Edmonton: AAAI Press, 2002. 584 - 589.
    [31] FREITAG D, KUSHMERICK N. Boosted wrapper induction[A]. Proceedings of the Seventeenth National Conference on Artificial Intelligence[C]. California:AAAI Press, 2000.577-583.
    [32] Gaizauskas R, Wilks Y, Information Extraction: Beyond Document Retrieval[J]. Journal of Documentation, 1997.
    [33] Grishman R, Adaptive information extraction and sublanguage analysis, In Proceedings of IJCAI-2001 Workshop on Adaptive Text Extraction and Mining[Z], 2001.
    [34] Grishman R, Information Extraction: Techniques and Challenges[J]. In M-T. Pazienza, editor, Information Extraction: a Multidisciplinary Approach to an Emerging Information Technology, Springer, Berlin, 1997.
    [35] Grishman R, Sundheim B, Message Understanding Conference-6: A Brief History, In Proceedings of the 16h International Conference on Computational Linguistics(COLING-96)[M], August 1996.
    [36] Hobbs J, Appelt D, Bear J, FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text[J]. Finite State Devices for Natural Language Processing, MIT Press, Cambridge MA, 1996.
    [37] Hobbs J, the Generic Information Extraction System. In Proceedings of the Fifth Message Understanding Conference (MUC-5)[M], pages 87-91. Morgan Kaufman, 1993.
    [38] Kristie Seymore , Andrew McCallum and Ronal Rosenfel. Learning Hidden Markov Model Structure and Marking data for Information Extraction. In: AAAI’99 Workshop on Machine Learning for Information Extraction[J]. Orlando , Florida:AAAI Press/MIT Press, 1999, 37-42.
    [39] Line Eikvil. Information Extraction from World Wide Web A Survey[M]. 1999.
    [40] PESHKIN L, PFEFFER A. Bayesian information extraction network [A]. Proceedings of the Eighteenth International Joint Conference Artificial Intelligence[C]. CA,USA:Morgan Kaufmann, 2003.421-426.
    [41] Sager N, Natural Language Information Processing[M].Reading, Massachusetts: Addison Wesley, 1981.
    [42] SODERLAND S. Learning information extraction rules for semi-structured and free text[J]. Machine Learning,1999,34(1-3):233-272.
    [43] Souyma Ray, Mark Craven. Representing sentence structure in hidden Markov models for information extraction [A]. Proceedings of the Sev2 enteenth International Joint Conference On Artificial Intelligence [C] .Washington:Morgan Kaufmann ,2001. 1273-1279.
    [44] SUN A, NAING M M, LIM E P. Using support vector machines for terrorism information extraction[A]. Intelligence and Security Informatics:First NSF/ NIJ Symposium[C]. Heidelberg:Springer-Verlag, 2003.1-12.
    [45] T Scheffer,C Decomain ,S Wrobel . Active hidden Markov models for information extraction [A]. Proceedings of the Fourth International Sym2 posium on Intelligent Data Analysis .Lisbon: Springer, 2001. 301 -109.
    [46] The ACE 2002 Evaluation Plan[Z].ftp://jaguar.ncsl.nist.gov/ace/doc/ACE-EvalPlan-2002-v0 6.pdf , Site visited on August 30th , 2002.
    [47] Yangarber R, Scenario Customization for Information Extraction[M].Ph.D. Thesis, New York University, January 2001.
    [48] Yangarher R, Grishman R, NYU: Description of the Proteus/PET System as Used for MUC-7[M]. In Proceedings of the Seventh Message Understanding Conference, 1998.
    [49] Yu S H, Bai S H, Wu P, Description of the Kent Ridge Digital Labs System Used for MUC-7[M].In Proceedings of the Seventh Message Understanding Conference, 1998.
    [50] ZHANG H P, LIU Q, CHENG X Q,et al. Chinese lexical analysis using hierarchical hidden markov mode[Z]. 2nd SigHan Workshop, Sapporo, Japan, 2003.
    [51] ZHANG H P, YU HK, XIONG D Y,et al. HHMM-based Chinese lexical analyzer ICTCLAS[Z]. 2nd SigHan Workshop, Sapporo, Japan, 2003.
    [52] Zhang Y M, Zhou J F, A Trainable Method for Extracting Chinese Entity Names and Their Relations[M], In Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, Oct. 2000.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700