异构信息源的领域人物信息抽取研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

异构信息源的领域人物信息抽取研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Domain-Specific People Information Extraction from Heterogeneous Web Sources
作者：周婷
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：异构信息源 ; 信息抽取 ; 文本分类 ; 特征选择
英文关键词：Heterogeneous web resources ; Information Extraction ; Text Categorization ; Feature Selection
学位年度：2010
导师：刘秉权
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

随着互联网的社会的各个方面的渗透,个人信息越来越多地出现在网上。人物搜索引擎作为互联网技术的一部分,近几年刚刚兴起,而针对某一特定领域的人物搜索也是一项新生事物,对其研究还不成熟。目前高校教师的教学水平和研究水平越来越受到关注,高校教师的信息搜索需求也越来越大,本文以高校计算机专业的教师的信息抽取为应用背景,对异构信息源的领域人物信息抽取进行研究,并实现了一个高校计算机专业的教师人物志系统。本文重点对以下问题进行了研究:
     首先,本文采用基于主题爬虫的方法来获取人物信息网页和从搜索引擎返回的结果网页中识别出包含人物信息的网页两种方式来获取数据源,将该网页识别问题看作一个网页分类问题,根据网页的结构特征和网页的内容特征提取网页的特征,运用SVM模型进行分类。为了提高分类的时间效率,提出了两种特征选择方法,即特征项对类的贡献度以及SVM训练权重的特征选择方法。
     其次,根据包含人物信息网页的特点,本文对包含人物信息的网页进行分类。在分类方法上,结合网页的结构特征和内容特征,提出了基于规则与机器学习相结合的方法进行网页分类。在处理多记录网页的分类上,本文采用了基于HTML标签密度与基于内容的分类方法。在处理单记录网页的分类上,基于网页结构进行特征提取,并使用了SVM模型设计分类器,实验结果显示基于规则与基于网页结构特征的分类器取得了比较好的效果。
     第三,本文在对包含人物信息的网页进行分类的基础上,提出了基于规则的人物属性抽取方法。首先构造领域人物信息抽取的触发词库,同时根据领域人物信息提取的特点以及基于网页结构的人物信息网页的类别特点构造人物属性信息抽取的规则库,人物的属性信息抽取即建立在网页类别、触发词库与规则库以及属性自身的特点的基础上。实验显示人物属性抽取取得了比较好的结果。
     最后,本文将异构信息源的领域人物信息抽取方法应用到高校计算机专业的教师的信息抽取中,并实现了一个高校计算机教师人物志系统,该系统收集了全国120所高校总计4134名教师的信息,实现了按照多种方式查询教师的信息。
With the Internet’s influence on all aspects of society, more and more personal information appears on the internet. As a part of the internet, people-oriented search engine is springing up in recent years. Meanwhile, domain-specified people-oriented search engine is also a new thing, and the research on it is not mature. Nowadays,more and more attention is paid on the teaching level and researching level of the college teachers, and the demand of the information searching of the teachers is growing quickly. This paper is taking the college teachers’information extraction in computer science as the application background and concentrating on domain-specified people information extraction from heterogeneous web sources. Ultimately we build a system of college teacher-oriented searching in computer science. This paper is focus on the following issues.
     First, this paper gets the data resource by two methods, one of which is based on topic spider, and the other of which is to identify the informative web pages of teachers from the results searched by the search engine. We regard the second method as a web pages categorization problem. To solve the problem, we create a classifier according to SVM model based on the web pages’structure feature and content feature. And then, to cut down the processing time of the classifier, we proposed two new feature selection methods based on feature's contribution to the category and SVM training weight.
     Second, according to the characteristics of the informative web pages, we can design a classifier to categorize informative web pages. To improve classification methods, we combine rule-based method with machine learning method, considering both structure and content feature of web pages. To deal with classification of multi-record web pages, we use two methods, one of which bases on the density of HTML tags, and the other one of which bases on content feature. While dealing with single-record pages, we do extractions based on webpage structure, building a classifier designed according to the SVM model. Experimental results showed that rule-based and structure-based classifier plays well.
     Third, on the basis of the classification of informative web pages, we propose a rule-based method which can be used on person property extraction. First, we construct an inspiration word base of field person information extraction. Meanwhile, we construct a rule base of personal property extraction, according to the features of field personal information extraction and structure-based webpage. Then we can extract information of personal properties. Experimental results showed that method we proposed plays well.
     Finally, we apply methods in this paper to extract information of college teachers in computer science and build a system of college teacher-oriented searching in computer science. This system has now assembled 4134 teachers’information in computer science around 120 colleges in China in total and provided several searching methods.

引文

1 Xiangfen Wei,Ning Jia,Quan Zhang. Research on a Model of Extracting Persons’Information Based on Statistic Method and Conceptual Knowledge. The 7th International Conference on Chinese Computing, WuHan, China, 2007:2-6
    2任宁.大规模真实文本中的人物职衔信息抽取研究.北京语言大学硕士论文.2008:4-7
    3 A.McCallum. Information extraction: distilling structured data from unstructured text. ACM, 2005, 3(9): 48-57
    4 Houda Benbrahim, Max Bramer. Text and hypertext categorization. Lecture Notes In Computer Science, Berlin, Heidelberg,2009, Springer-Verlag: 11-38
    5 N.Venkat and Gudivada. Inofrmation Retrieval on the World Wide Web. IEEE Intenret Computing, 1997(4):58-68
    6成颖,史九林.自动分类研究现状与展望。情报学报,1999,1:20-26
    7 F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR), 2002,34(1):1-47
    8邹涛,王继成,黄源,张福炎,中文文档自动分类系统的设计与实现。中文信息学报,1999,13(3):26-32
    9李晓黎,刘继敏,史忠植.概念推理网络及其在文本分类中的应用.计算机研究与发展, 2000, 37(9):1032-1038
    10周水庚.基于N-gram信息得中文文档分类研究,中文信息学报, 2001, 15(1): 34-39
    11黄萱菁,吴立德,石崎洋之,徐国伟.独立于语种的文本分类方法.中文信息学报, 2000, 14(6): 1-7
    12 Helge Langseth, Thomas D.Nielsen. Classification using Hierarchical Na?ve Bayes models. Machine Learning, 2006,63(2):135-159
    13 Jaideep Vaidya, Murat Kantarcloglu, Chris Clifton. Privacy-preserving Na?ve Bayes classification. The VLDB Journal—The International Journal on Very Large Data Bases,2008,17(4):879-898
    14 Simon Tong and Dephne Koller. Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research,2002,2:45-66
    15 Pavel Laskov, Christian Gehl, Stefan Kruger and Klaus-Robert Muller. Incremental Support Vector Learning: Analysis, Implementation and Applications. The Journal of Machine Learning Research, 2006,7:1909-1936
    16 K.Nigam,J.Lafferty,A.McCallum.Semi-supervised learning for multi-component data classification. International Joint Conference On Artificial Intelligence, Hyderabad, India, 2007, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc: 2754-2759
    17 S.M.Weiss, C.Apte, F. Damerau, D.E.Johnson, F.J.Oles, T.Goetz and T.hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 1999,14(4): 63-69
    18 C.Apte, F.Damerau, S.M.Weiss. Towards Language Independent Automated Learning of Text Categorization Models. Proceedings of ACM SIGIR’94, Dublin, Ireland ,1994, Inc. New York, NY, USA, Springer-Verlag New York: 23-30,
    19 Y. Yang and C. Chute. An example-based mapping method for text classification and retrieval. ACM Transaction on Information Systems,1994,23(3):252-277
    20 Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999,1(1): 69-90
    21 C. Chang, M. Kayed, M. Girgis, K. Shaalan. A survey of Web information extraction systems. IEEE Transactions on Knowledge and Data Engineering. 2006,18(10):1411-1428.
    22 O. Etzioni, M. Banko, S. Soderland. Open Information Extraction from the Web. Communications of the ACM. Dec. 2008, 51(12): 2-7
    23 Jim Cowie, Wendy Lehnert. Information extraction. Communications of the ACM, 1996, 39(1):80-91
    24 Matthew Michelson, Craig A. Knoblock. Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web. International Journal on Document Analysis and Recognition,2007,10(3):211-226
    25 Q.Etzioni, M. Cafarella, D. Downey,A. Popescu. Unsupervised Named-entity Extraction from the Web: An Experimental Study, Artificial Intelligence, 2005,165(1): 91-134
    26 M. Pasca, D. Lin, J. Bigham, A. Lifchits. Names and similarities on the Web: Fact extraction in the fast lane. ANNUAL MEETING, Sydney, July. 2006, NJ,USA, ACL:809–816
    27 F. Suchanek, G. Kasneci, G. Weikum, Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 2008,6(3): 203-217
    28 K. Lerman, L. Getoor, S. Minton,C. knoblock. Using the Structure of Web Sites for Automatic Segmentation of Tables. SIGMOD, Paris, June. 2004, NY, USA,ACM:119-130
    29 S. Zheng, D. Wu, R. Song. Joint Optimization of Wrapper Generation and Template Detection. KDD’07, San Jose, California, USA, August. 2007, New York,ACM:894-901
    30 S. Soderland. Learning information extraction rules for semi-structured and free text. Machine learning, 1999,34(1): 233-272
    31 Rohit J. Kate. A dependency-based word subsequence kernel. Annual Meeting of the ACL, Honolulu, Hawaii, 2008,ACL:400-409
    32 A. Culotta, J. Sorensen. Dependency tree kernels for relation extraction. Proceedings of the Association for Computational Linguistics, Barcelona, Spain ,2004, Morristown, NJ, USA,ACL.
    33 D. Zelenko, C.Aone, A. Richardella. Kernel methods for relation extraction, The Journal of Machine Learning Research, March. 2003,3: 1083-1106
    34 D. Freitag, A. McCallum. Information extraction with HMM structures learned by stochastic optimization. Proceedings of AAAI, 2000, AAAI Press:584-589
    35 A. Culotta, A. McCallum, J. Betz. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. HLT-NAACL, New York ,2006,ACL:296-303
    36 F. Peng, A. McCallum. Information extraction from research papers using conditional random fields. Information Processing and Management, 2006,42(4): 963-979
    37 H. Poon, P. Domingos. Joint inference in information extraction. Aaai Conference On Artificial Intelligence, Vancouver, British Columbia, Canada, 2007, AAAI Press : 913
    38李保利,陈玉忠,俞士汶.信息抽取研究综述,计算机工程与应用,2003(10):10
    39张玲.Web信息抽取技术研究与应用,中国科学院研究生院硕士论文,2003:2-5
    40 L. Yi, B. Liu, X.L. Ii. Eliminating Noisy Information in Web Pages for DataMining. International Conference on Knowledge Discovery and Data Mining, Washington, D.C.,2003, New York, NY, USA,ACM: 296-305
    41 Luo Xiao, Dieter Wissmann, Michael Brown, Stephan Jablonski. Information Extraction from the Web: System and Techniques. Applied Intelligence, 2004,21(2):195-224
    42 C. H. Zhang, X. G. Wang, X. H. Gu. Web Information Extraction Using Ontology and Rule Expression. Computer Engineering, 2004,30(5):58
    43 S.J. Li, Z.Y. Peng, M.C. Liu. Extraction and Integration Information in HTML Tables. Proceedings the Fourth International Conference on Computer and Information Technology, Washington, DC, USA,2004,IEEE Computer Society :315-320
    44 Utku lrmak,Torsten Suel. Interactive wrapper generation with minimal user effort. International World Wide Web Conference, Edinburgh, Scotland, 2006, New York, NY, USA ACM:553-563
    45 S. Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1999,34(1):233-272
    46 Q. Chen, W. Su, G. C. Jisuanji. Automatic information extraction from semi-structured Web pages by pattern discovery. Decision Support Systems,2003,35(1):129-147
    47毛雪云,曾国荪,王伟.基于向量空间模型的网页文本可信性分类方法.计算机工程与应用, 2008,44(25):109-112
    48 F. Debole, F. Sebastiani. Supervised term weighting for automated text categorization, Proceedings of the 2003 ACM symposium on Applied computing, Melbourne, Florida ,2003, New York, NY, USA, ACM:784?788
    49 Xiaoguang Qi, Brian D. Davison. Web page classification: Features and algorithms, ACM Computing Surveys, 2009,41(2):12
    50 A. Kehagias, V. Petridis, V. G. Kaburlasos and P. Fragkou. A comparison of word and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 2003,21(3): 227?247
    51 Y. Yang, J. Zhang, B. Kisiel. A scalability analysis of classifiers in text categorization. The 26th annual international ACM, Toronto, Canada,2003, New York, NY, USA ,ACM: 96?103
    52石佑红.基于支持向量机的文本分类的研究.北京交通大学硕士论文.2007:6-9

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700