文本聚类在话题检测与人名消歧中的应用研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

文本聚类在话题检测与人名消歧中的应用研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Text Clustering Method in Topic Detection and Person Name Disambiguation
作者：戴祥鹰
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：垂直搜索 ; 话题检测与跟踪 ; 凝聚层次聚类 ; 人名消歧
英文关键词：Vertical Search ; Topic Detection and Tracking ; Agglomerative Hierarchical Clustering ; Person Name Disambiguation
学位年度：2010
导师：王晓龙
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-12-01

摘要

对于金融信息服务而言,用户希望能够全面掌握一个公司或个股的重大事件以及事件的前因后果。金融门户网站存在着覆盖面不广和众多公司的相关新闻混杂在一起的缺点,与此同时,不同的新闻媒体就同一事件会发布大量相似及后续报道,而转载又使得网络上存在大量重复的新闻报道,使得通过浏览金融新闻网站来查找其持有股票所属公司的相关信息成为一件费时费力的事。金融领域垂直搜索引擎,如Google财经,能够按公司或个股来为用户提供新闻浏览服务,但其检索结果不是按时间和话题组织的,不易于用户查看事件的首次报道以及跟踪事件的前因后果。因此如何从检索结果中识别和跟踪个股或公司的重大事件,以时间为主线将其以话题形式呈现给用户就成为金融垂直检索下一步需要解决的问题。
     本文通过话题检测与跟踪(TDT)技术来解决上述问题。话题检测与跟踪是一种把新闻报道流中的新闻报道组织成新闻话题的技术,一个话题由很多与该话题相关的新闻报道组成,一个话题包含初始新闻报道和后续相关新闻报道。本文将话题检测与跟踪技术应用到金融垂直检索系统中,以个股或公司为单位将检索结果组织成若干话题,并以时间为主线将话题呈现给用户,以此方便用户查看公司或个股的重大事件以及事件的前因后果。本研究把两种现有的聚类方法结合在一起,同时加以改进,提出了一种新的聚类方法,并将该聚类方法应用与金融新闻话题检测中。在文本分类语料集、标准的话题检测语料集和手工标注的语料集上分别进行了实验验证,结果表明,上述方法由于经典的K-Means方法和传统的凝聚层次聚类方法,而且可以有效地实现在话题检测。本文的研究已经成功应用到海天园金融新闻话题检测与跟踪系统中。
     除了金融新闻话题检测以外,本研究关注的另一个聚类技术的应用点是解决人名歧义问题。在人名歧义问题中,许多人有相同的名字,这一事实导致了很多歧义出现在文本中,尤其是对于一些普通人的名字,这一问题困扰了很多信息检索和自然语言处理任务,人名歧义问题在中文文本中更为严重。因此,除了将凝聚层次聚类方法作为本文提出的AP-HAC聚类算法的一部分用于金融新闻话题检测与跟踪以外,本文还探索了凝聚层次聚类算法的另一个应用,即将其用于解决中文人名歧义问题。本文把凝聚层次聚类技术和信息抽取技术结合起来用以解决中文人歧义问题,实验表明这种方法取得了很好的效果。
For the financial information service, users hope that they can hold the important events of a stock as well as the causes and the effects of these events. There exist several drawbacks to the finance portals, such as the coverage of a stock’s news stories is not extensive and news stories which belong to different stocks are mixed together. Meanwhile, massive similar and follow-up news stories about the same event are published via different news media. What’s worse, the reprints of web pages cause that there exit a large quantity of repeated news stories. All of the deficiencies make it hard to search the related news stories of a stock by browsing finance portals. The vertical search engine in the financial field, such as Google Finance, can provide browsing service of news stories by stock for users. However, the results are not organized by topic and time. As a result, it is not convenient for users to browse the first story of an event and track the causes and the effects of the event. Therefore, how to detect and track the important events from the searched results, then showing these events to the users in time series order and in the form of topics will be the next problem to be resolved for the vertical search in the financial field.
     In this paper, we use the topic detection and tracking (TDT) technology to solve the above problem. TDT is a kind of technology which can organize the news stories into the news topics. All the news stories come from news story streams. A topic consists of many news stories related to it. These related stories cover from the initial news story to the follow-up news stories. In this research work, TDT techniques are applied to financial vertical search engine, returned results are grouped by company and organized into topics in time series order, in this case, user can easily learn about important events as well as cause and effect of the event. Our research proposed a new clustering method (AP-HAC) which based on agglomerative hierarchical clustering algorithm and affinity propagation clustering method, and then applied this method to topic detection of financial news. We verified this method on the standard TDT dataset, document classification dataset and manually labeled dataset, respectively. The results show that the proposed method outperforms classic clustering method K-Means and traditional agglomerative hierarchical clustering algorithm, and can effectively detect topics. This research work has been applied to topic detection and tracking of financial news system in Hai Tian Yuan.
     Apart from topic detection of financial news, this research work also focus on another application of clustering method——Chinese Person Name Disambiguation. Many people may have the same name which leads to lots of ambiguities in text, especially for some common person names. This problem puzzles many information retrieval and natural language processing tasks. The person name ambiguity problem becomes more serious in Chinese text. Thus, we not only use agglomerative hierarchical clustering algorithm (HAC) as first layer of AP-HAC, but also explore another application of HAC. We use HAC to solve Chinese person name ambiguity problem, this research work combines HAC and information extraction technology for Chinese person name disambiguation. Finally, the experiment results show that the method can effectively resolve the problem of Chinese person name ambiguity.

引文

1 Frey, B.J., Dueck, D., Clustering by Passing Messages between Data Points, Science 2007, 315(5814):972–976.
    2 Y. M. Yang, T. Pierce, J. Carbonell. A Study on Retrospective and On-Line event detection. Proceedings of the 21st Annual International ACM Special Interest Group on Information Retrieval Conference, Melbourne, Australia,1998:37-45.
    3 C. H. Wang, M. Zhang, S. P. Ma, et al. Automatic online news issue construction in web environment. Proceeding of the 17th International Conference on World Wide Web, Beijing, China, 2008:21-25.
    4 Fleischman M. B. and Hovy E., Multi-document Person Name Resolution, Proceedings of ACL-42, Reference Resolution Workshop, 2004
    5 Xin Li, Paul Morie, and Dan Roth, Robust Reading: Identification and Tracing of Ambiguous Names, Proceedings of NAACL, 2004:17-24.
    6 Bekkerman, Ron and McCallum, Andrew, Disambiguating Web Appearances of People in a Social Network, Proceedings of The International World Wide Web Conference 2005, 2005:463-470.
    7 Topic detection and tracking(tdt) project. URL:http://www.nist.gov/speech/tests/tdt/.
    8 J. Allan, R. Papka, V. Lavrenko V. On-Line New Event Detection and Tracking. Proceedings of the 21st Annual International ACM Special Interest Group on Information Retrieval Conference, Melbourne, Australia, 1998:37-45.
    9 B. Thorsten, C. Francine, F. Ayman. A System for New Event Detection. Proceedings of the 26th Annual International ACM Special Interest Group on Information Retrieval Conference, New York, USA, 2003:330-337.
    10 G. Kumaran, J. Allan. Text Classification and Named Entities for New Event Detection. Proceedings of the 27th Annual International ACM Special Interest Group on Information Retrieval Conference, Sheffield, UK, 2004:297-304.
    11 J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast New Tanscription and Understanding Workshop, 1998.
    12 Y. Hong, Y. Zhang, T. Liu, S. Li. Topic Detection and Tracking Review. Journal of Chinese Information Processing, 2007, 21(6):71-87.
    13 C. J. van Rijsbergen. Information Retrieval, second edition. Buttersworth, London, 1989.
    14 M. Steinbach, G. Karypis, V. Kumar. A comparison of document clusteringtechniques. KDD-2000 Workshop on Text Mining, Boston,MA,USA, 2000:1-20.
    15 B. J. Frey, D. Dueck. Non-metric Affinity Propagation for Unsupervised Image Categorization. In: IEEE International Conference on Computer Vision 2007:1-8.
    16 Zhang, X., Gao, J., Lu, P., Yan, Y.H.: A Novel Speaker Clustering Algorithm via Supervised Affinity Propagation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008:4369–4372.
    17王开军,张军英,李丹,张新,郭涛.自适应仿射传播聚类.自动化学报,2007,33 (12): 1242-1246.
    18 Velamuru P K, Renaut R A, Guo H B, Chen K W. Robust clustering of positron emission tomography data. In: Joint. Interface CSNA. USA: 2005 .
    19 Nallapati, R., Feng, A., Peng, F., Allan, J., "Event Threading within News Topics", in Proceedings Thirteenth Conference on Information and Knowledge Management, Washington, DC, USA, 2004: 446–453.
    20 W. Jiang, Y. Guan, amd XL. Wang. A Pragmatic Chinese Word Segmentation System. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, 2006:189~192
    21 Songbo Tan et al., "A Novel Refinement Approach for Text Categorization", in Proceedings Fourteenth Conference on Information and Knowledge Management, Bremen, Germany, 2005: 469-476.
    22 Niu, Cheng, Wei Li, and Rohini K. Srihari,Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction, Proceedings of ACL 2004.
    23 Ying Chen and Martin J.H. CU-COMSEM: Exploring Rich Features for Unsupervised Web Personal Name Disambiguation, Proceedings of ACL Semeval 2007.
    24 Fei Song, Robin Cohen, Song Lin, Web People Search Based on Locality and Relative Similarity Measures, Proceedings of the International World Wide Web Conference 2009.
    25 Artiles, Javier, Julio Gonzalo and Satoshi Sekine, The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task, Proceedings of Semeval 2007, Association for Computational Linguistics, 2007.
    26 Artiles, Javier, Julio Gonzalo and Satoshi Sekine.“WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task, In 2nd Web People Search Evaluation Workshop (WePS 2009), the 18th International World Wide Web Conference, 2009.
    27 Sekine, Satoshi and Javier Artiles. WePS 2 Evaluation Campaign: overview of theWeb People Search Attribute Extraction Task, Proceedings of 2nd Web People Search Evaluation Workshop (WePS 2009), the 18th International World Wide Web Conference, 2009.
    28 Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two Supervised Learning Approaches for Name Disambiguation in Author Citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, 2004.
    29 Pedersen, Ted, Amruta Purandare, and Anagha Kulkarni, Name Discrimination by Clustering Similar Contexts, Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, 2005.
    30 Ergin Elmacioglu, Yee Fan Tan, Su Yan, Min-Yen Kan, and Dongwon Lee. PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features. Proceedings of ACL SEMEVAL 2007, 2007:268-271.
    31 Pedersen, Ted and Anagha Kulkarni, Unsupervised Discrimination of Person Names in Web Contexts, Proceedings of the Eighth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2007.
    32 Rao, Delip, Nikesh Garera and David Yarowsky, JHU1: An Unsupervised Approach to Person Name Disambiguation using Web Snippets, In Proceedings of ACL Semeval 2007
    33 Banko, M., M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007). Open Information Extraction from the Web.
    34 Bunescu, R. and R. Mooney. Collective Information Extraction with Relational Markov Networks. In Proceeding of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 2004:438-445.
    35 Etzioni, O., M. Cafarella, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Arti_cial Intelligence 2005,165 (1), 91-134.
    36 Pasca, M. Weakly-supervised Discovery of Named Entities using Web Search Queries. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM-07), Lisboa, Portugal, 2007:683-690.
    37晋耀红等.基于语境框架的文本相似度计算.算机工程与应用.2004, 40(16):36~39.
    38李彬,刘挺,秦兵,李生.基于语义依存的汉语句子相似度计算.计算机应用研究.2003( 12): 15~17.
    39潘谦红,王炬,史忠植.基于属性论的文本相似度计算.计算机学报.1999, 22(6): 651~654.
    40 J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation:Establishing a benchmark for the web people search task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). ACL, 2007.
    41 M. T. Ozsu and P. Valduriez. 2nd Edition. Principles of Distributed Database Systems. Prentice Hall, 1999.
    42 Y. Ding, X. Wang, L. Lin, and Y. Wu, The design and implementation of the crawler-Inar. International Conference on Machine Learning and Cybernetics, 2006:13~16.
    43张刚,刘挺,郑实福,等.大规模网页快速去重算法[A].中国中文信息学学会二十周年学术会议论文集(续集)[C], 2001:18-25.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700