用户名: 密码: 验证码:
事件报道中地点实体的提取研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
实体是构成事件信息的基本单元,在事件中扮演着重要的角色。在自然语言处理领域,实体识别是信息提取、句法分析、机器翻译、面向Semantic Web的元数据标注等应用领域重要的基础性工具。在事件报道类文本中,事件发生的地点、事件中涉及到的人、事件中涉及到的物作为事件发生的三大要素,是实体在此类文本中充当的三大角色,若能实现这三者的自动提取,那么对事件信息的获取、文本的框架结构研究等奠定了基础,对于句法分析、篇章分析也有着重要的意义。
     本文从事件报道类文本出发,选择其中的地点实体来进行研究。这里说的地点实体,不同于命名实体识别中的地名,而是指在文本中充当事件发生地点角色的实体,从形式上看,是地点在文本中对应的最长名词短语形式。汉语句法成分特有的套叠现象决定了这类实体表达的复杂性。具体表现为:实体表达的长度不受限制,目前在文本中发现的最长的地点实体表达长达35个字长;实体表达内部可能含有逗号、顿号、双引号、小括号等多种标点符号;地点实体的表达形式多样,同一个地点可以以不同的形式在文中反复出现。这些特点增加了地点实体识别的难度,已有的应用于命名实体识别中的研究方法在地点实体的识别中不能取得好的效果。
     根据任务特点以及对多种提取方法的对比研究,本文选择用规则方法进行地点实体的提取研究。为此,本文做了以下工作:
     ①人工标注地震、火灾、中毒、恐怖袭击、交通事故五类共325篇报道文本(约16万字)。
     ②在人工标注的语料基础上,统计分析地点实体的内部结构特征、边界分布特征以及它们在文中出现上下文环境特征,并根据这些特征建立规则模型。该模型分为四个部分:核心规则模型、分类规则模型、前置及后置规则模型和修正规则模型。
     ③应用规则模型进行地点实体抽取实验,取得了较好的效果:于封闭测试中获得85.8%的精确率,开放测试获得79.7%的精确率。
Entities are basic units of event information, and are playing an important role in event. In the field of natural language processing, entity recognition is the key technique in many Chinese information Processing applications such as information extraction, syntactic analysis, machine translation, metadata annotate for Semantic Web and so on. As three elements of news, 'where', 'who' and 'which' in event are roles which entities act as in news reports. Their automatic extraction lay the foundation for getting event information and researching on Frame Semantics, and is of great significance for syntactic analysis and discourse analysis.
     The Thesis focuses on researching location entity based on news report. Location entity under discussion here is different from location name of named entity, is entity which act as place where the accident happened in news. In form, it is the maximal-length Chinese noun phrase. Special nesting phenomena of Chinese constituents determine the complexity of the entity. The repercussions will be most manifest in the following aspects: The length of entity is freedom from restraint. For the present the longest location entity has thirty-five words; the inner of entity maybe have punctuation marks such as comma, Chinese serial comma, quotation mark, parentheses and so forth; there are many kinds of expression in the location entity. So, it is the reason that methods of the named entity recognition can't get a good result in the location entity recognition.
     According to the characteristics of this task and a comparative study on several abstraction methods, the location entity is studied by rules in this thesis. And the main work includes:
     ①325 texts (about 1.6 million words) are artificially annotated, including five kinds of reports: earthquake, fire, intoxication, terroristic attack and traffic accident.
     ②Features of texts are statistically analyzed based on corpus of artificial annotation. These features involve both internal structure features and context features of location entity. Rule model which contains core rules, classification rules, context rules and modification rules is established according to features above mentioned.
     ③The experiment of location entities extraction by rules obtains good result: gets precision of 85.8% in closet test and 79.7% in open test.
引文
[3]国家社科基金项目。
    [4]见洪宇等《话题检测与跟踪的评测及研究综述》
    [1]Lu Jian-ming.Special nesting phenomena of Chinese constituents.In The Optional Paper Of Lu Jian-ming ZhengZhou,He'nan Education Press,1993,174-192
    [2]周强、孙茂松、黄昌宁,汉语最长名词短语的自动识别[J],软件学报,2000,11(2),p195-201
    [3]黄昌宁、林娟、孙承杰,何谓金本位[C],自然语言理解与大规模内容计算——全国第八届计算语言学联合学术会议(JSCL-2005)论文集,北京:清华大学出版社,2005,p11-20
    [4]赵军,基于转换的汉语基本名词短语识别模型,清华大学博士论文,1998
    [5]LiWen-jie,Zhou Ming,Pan Hai-hua et al Corpus-based maximal-length Chinese noun phrase extraction,In:Chen Li-wei,Yuan Qi eds Advances and Applications on Computational Linguistics Beijing:Tsinghua University Press,1995.119-124
    [6]ACE.ACE Chinese Annotation Guidelines for Entities(Version 5.5)[EB/OL]http://www.ldc.upenn.edu/projects/ACE/docs/Chinese-Entities-Guidelines_v5.5.pdf.2005a
    [7]ACE.ACE Chinese Annotation Guidelines for Events[EB/OL]
    [8]郑家恒、谭红叶、王兴义,基于模式匹配的中文专有名词识别,2007
    [9]孙宏林、俞士汶,浅层句法分析方法概述,2000
    [10]周强、孙茂松、黄昌宁,汉语句子的组快分析体系,1999
    [11]徐艳华,基于语料库的基本名词短语研究,语言文字应用,2008.02,第一期,120-125
    [12]黄德根、岳广玲、杨元生,基于统计的中文地名识别,中文信息学报,2002,第17卷第2期,p36-41
    [13]张卫国,三种定语、三种意义及三个槽位,中国人民大学学报,1996.04,p97-100
    [14]邹红建,突发事件信息的标注研究,北京语言大学硕士论文,2008
    [15]黄德根、王莹莹,基于SVM的组块识别及其错误驱动学习方法,中文信息学报,2005,第20卷第6期,p17-24
    [16]刘非凡、赵军、徐波,实体提及的多层次嵌套识别方法研究,中文信息学报,2007,第21卷第2期,p14-21
    [17]赵健、王晓龙、关毅,中文名实体识别中的特征组合和特征融合的比较,计算机应用,2005,第25卷第11期,p2648-2649
    [18]王立霞、孙宏林,现代汉语介词短语边界识别研究,中文信息学报,2005,第19卷第3期,p80-86
    [19]向晓雯、史晓东、曾华琳,一个统计与规则相结合的中文命名实体识别系统,计算机应用,2005,第25卷第10期,p2404-2406
    [20]袁毓林,信息抽取的语义知识资源研究,中文信息学报,2002,第16卷第5期,p8-14
    [21]孙宏林、俞士汶,浅层句法分析方法概述,当代语言学,2000年,第2卷第2期,p74-83

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700