用户名: 密码: 验证码:
文本信息度量研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
度量是用来刻画对象之间相互关系的定量描述。在文本信息处理中,不同语言学粒度上的信息度量研究都有重要的理论价值和广泛的应用背景。近些年,Web2.0的蓬勃发展对文本信息度量提出了新的挑战。复杂多样的网络数据以及不规范的网络文本书写使得许多传统的自然语言信息度量方法不适用于互联网环境。比如,基于词典的词汇相似度度量无法很好地处理快速出现的新词;基于句法树的句子相似度度量无法很好地处理书写不规范的用户查询以及网络文档标题。特别地,中文网络语言的不规范性对中文自然语言处理提出的挑战更为明显。另外,传统基于网页链接分析的相关性度量方法并没有很好地利用社会协同百科全书的结构特点,因此无法解释概念之间的相关性。
     针对新形势下文本数据的特点,本文在四种不同的信息对象层面上提出了新的信息度量方法并进行了应用实现,具体如下。
     在短语层面,本文提出了一种短语非合成性度量,这种度量基于信息距离理论,具有完善的理论依据,可以用来判断一个给定的单词序列(在特定语境下)的合成性。由于所需的统计量来源于整个互联网,因此具有很强的适用性和鲁棒性,可用于问答系统后处理以及复杂名字实体识别。
     在概念层面,本文提出了一种新的网络百科全书(比如维基百科)概念相关性度量方法。和以往基于网页链接分析的方法不同,这种方法充分利用了维基百科的结构特点,使得其不仅能度量概念相关性,而且能用百科中的分类来解释概念之间的关系。
     在句子层面,本文提出了一种基于模板集的度量方法来计算自然语言问题之间的相似度。针对疑问句中虚词和实词的特点,我们采用硬模板和软模板来分别处理它们。这种度量可以在不借助句法树的前提下刻画单词间长距离的关系,并可以被有效地应用到问题分类任务中。
     在句子关系层面,本文提出了一种基于核方法的句子对类比相似度度量。这种方法将句子关系映射到重写规则空间,并用该空间上的内积来表示其相似度。这种方法可以在不借助句法树的前提下从结构上刻画句子关系的类比相似性,并在同义句识别以及句子蕴含关系识别上取得一流的准确率。
Metric is used for characterizing the relationship between objects. In natural lan-guage processing (NLP), researches on information metric on diferent linguistic unitsare of essential research value and wide application backgrounds.
     Recently, rapid development of web2.0poses great challenges to natural languageprocessing. Classical NLP information metrics are not able to handle complex and dy-namic internet data and informally written web texts. For example, lexical similaritymetrics based on local dictionary are not suitable to process new words emerged on inter-net. Sentence level similarity metrics based on syntactic trees are not sound to be used tomeasure the similarity between user queries and document titles, especially in Chinese.Moreover, classical metrics based on link analysis cannot make full use of the structuralfeatures of social collaborative data.
     In order to deal with these challenges, we proposed new information metrics on fourdiferent information objects, which are listed as follows.
     On phrase level, we proposed a non-compositionality metric for n-grams, which isbased on information distance with solid theoretical background. It can be used tomeasure the non-compositionality of a given n-gram (under certain contexts). Sincethis metric is approximately computed from the frequency counts on the internet, itis robust and widely applicable, which can be used for post-possessing of questionanswering and complex named entity recognition.
     On concept level, we proposed a new algorithm for measuring the semantic relat-edness between concepts on social collaborative encyclopedia (e.g. Wikipedia).Diferent from classical metrics based on link analysis, our method fully took ad-vantage of the structural feature of encyclopedia. It can not only measure related-ness, but also interpret the relatedness by using categories.
     On sentence level, we proposed a question similarity metric based on pattern set.To utilize function words and content words in questions, we built hard patternsand soft pattern on them respectively. The metric can model long range depen-dencies between words without using syntactic trees and be applied to questionclassification.
     On sentence relation level, we proposed a sentence relation similarity metric based on kernel method, which maps sentence pairs onto re-writing rules space and usesinner product on this space to represent similarity. The method can capture struc-tural similarity between sentence pairs without using syntactic analysis tools andstill achieve state-of-the-arts accuracy on paraphrasing identification and recogniz-ing textual entailment.
引文
[1]百度百科.度量词条[EB/OL][D].
    [2] Church K W, Hanks P. Word association norms, mutual information, and lexicography. Com-putational linguistics,1990,16(1):22–29.
    [3] Dunning T. Accurate methods for the statistics of surprise and coincidence. Computationallinguistics,1993,19(1):61–74.
    [4] Silva J F, Lopes G P. A local maxima method and a fair dispersion normalization for extractingmulti-word units from corpora. Proceedings of Sixth Meeting on Mathematics of Language,1999.369–381.
    [5] Pecina P. An extensive empirical study of collocation extraction methods. Proceedings of theACL Student Research Workshop. Association for Computational Linguistics,2005.13–18.
    [6] Schone P, Jurafsky D, et al. Is knowledge-free induction of multiword unit dictionary head-words a solved problem. Proceedings of the6th Conference on Empirical Methods in NaturalLanguage Processing (EMNLP2001),2001.100–108.
    [7] Zhang W, Yoshida T, Tang X, et al. Improving efectiveness of mutual information for substan-tival multiword expression extraction. Expert Systems with Applications,2009,36(8):10919–10930.
    [8] Lin D. Automatic identification of non-compositional phrases. Proceedings of the37th an-nual meeting of the Association for Computational Linguistics on Computational Linguistics.Association for Computational Linguistics,1999.317–324.
    [9] Dias G, Guillore′S, Lopes J G P. Mining textual associations in text corpora.6th ACMSIGKDD Work. Text Mining,2000..
    [10] Justeson J S, Katz S M. Technical terminology: some linguistic properties and an algorithmfor identification in text. Natural language engineering,1995,1(1):9–27.
    [11] Argamon S, Dagan I, Krymolowski Y. A memory-based approach to learning shallow nat-ural language patterns. Proceedings of the17th international conference on Computationallinguistics-Volume1. Association for Computational Linguistics,1998.67–73.
    [12] Park Y, Byrd R J, Boguraev B K. Automatic glossary extraction: beyond terminology identifi-cation. Proceedings of the19th international conference on Computational linguistics-Volume1. Association for Computational Linguistics,2002.1–7.
    [13] McCallum A, Freitag D, Pereira F. Maximum entropy Markov models for information extrac-tion and segmentation. Proceedings of the Seventeenth International Conference on MachineLearning, volume951,2000.591–598.
    [14] McCallum A, Li W. Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons. Proceedings of the seventh conference on Nat-ural language learning at HLT-NAACL2003-Volume4. Association for Computational Lin-guistics,2003.188–191.
    [15] Finkel J R, Grenager T, Manning C. Incorporating non-local information into informationextraction systems by gibbs sampling. Proceedings of the43rd Annual Meeting on Associationfor Computational Linguistics. Association for Computational Linguistics,2005.363–370.
    [16] Downey D, Broadhead M, Etzioni O. Locating complex named entities in web text. Proceed-ings of the20th international joint conference on Artifical intelligence. Morgan KaufmannPublishers Inc.,2007.2733–2739.
    [17] Bennett C H, Ga′cs P, Li M, et al. Information distance. Information Theory, IEEE Transactionson,1998,44(4):1407–1423.
    [18] Li M, Badger J H, Chen X, et al. An information-based sequence distance and its applicationto whole mitochondrial genome phylogeny. Bioinformatics,2001,17(2):149–154.
    [19] Li M, Chen X, Li X, et al. The similarity metric. Information Theory, IEEE Transactions on,2004,50(12):3250–3264.
    [20] Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. SCIENTIFIC AMERICAN-AMERICAN EDITION-,2003,288(6):76–81.
    [21] Chen X, Francia B, Li M, et al. Shared information and program plagiarism detection. Infor-mation Theory, IEEE Transactions on,2004,50(7):1545–1551.
    [22] Keogh E, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. Proceed-ings of Conference on Knowledge Discovery in Data: the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, volume22,2004.206–215.
    [23] Cilibrasi R L, Vitanyi P M. The google similarity distance. Knowledge and Data Engineering,IEEE Transactions on,2007,19(3):370–383.
    [24] Adafre S F, Rijke M. Discovering missing links in Wikipedia. Proceedings of In Workshop onLink Discovery: Issues, Approaches and Applications,2005.90–97.
    [25] Ollivier Y, Senellart P. Finding related pages using Green measures: An illustration withWikipedia. Proceedings of the AAAI, Vancouver, Canada,2007.1427–1433.
    [26] Hu J, Wang G, Lochovsky F, et al. Understanding user’s query intent with Wikipedia. Pro-ceedings of WWW-09, Madrid, Spain,2009.471–480.
    [27] Strube M, Ponzetto S. WikiRelate! Computing semantic relatedness using Wikipedia. Pro-ceedings of the AAAI, Boston, MA,2006.
    [28] Milne D. Computing Semantic Relatedness using Wikipedia Link Structure. Proceedings ofthe New Zealand Computer Science Research Student conference (NZCSRSC’07), Hamilton,New Zealand,2007.
    [29] Gabrilovich E, Markovitch S. Computing semantic relatedness using Wikipedia-based explicitsemantic analysis. Proceedings of the IJCAI, Hyderabad, India,2007.1606–1611.
    [30] Breese J, Heckerman D, Kadie C. Empirical Analysis of Predictive Algorithms for Collabo-rative Filtering. Proceedings of the14th Conference on Uncertainty in Artificial Intelligenc,1998.43–52.
    [31] Harris Z. Distributional Structure.1954.
    [32] Tan P N, et al. Introduction to data mining. Pearson Education India,2007.
    [33] Gusfield D. Algorithms on strings, trees and sequences: computer science and computationalbiology. Cambridge University Press,1997.
    [34] Lcvenshtcin V. BINARY coors CAPABLE or ‘CORRECTING DELETIONS, INSERTIONS,AND REVERSALS. Proceedings of Soviet Physics-Doklady, volume10,1966.
    [35] Li X, Roth D. Learning question classifiers. Proceedings of the19th international conferenceon Computational linguistics-Volume1. Association for Computational Linguistics,2002.1–7.
    [36] Le Nguyen M, Nguyen T T, Shimazu A. Subtree mining for question classification problem.Proceedings of the20th international joint conference on Artifical intelligence. Morgan Kauf-mann Publishers Inc.,2007.1695–1700.
    [37] Schapire R E. A brief introduction to boosting. Proceedings of International Joint Conferenceon Artificial Intelligence, volume16. LAWRENCE ERLBAUM ASSOCIATES LTD,1999.1401–1406.
    [38] Zhang D, Lee W S. Question classification using support vector machines. Proceedings of the26th annual international ACM SIGIR conference on Research and development in informaionretrieval. ACM,2003.26–32.
    [39] Moschitti A, Quarteroni S, Basili R, et al. Exploiting syntactic and shallow semantic kernels forquestion answer classification. Proceedings of ANNUAL MEETING-ASSOCIATION FORCOMPUTATIONAL LINGUISTICS, volume45,2007.776.
    [40] Lodhi H, Saunders C, Shawe-Taylor J, et al. Text classification using string kernels. TheJournal of Machine Learning Research,2002,2:419–444.
    [41] Leslie C, Eskin E, Noble W. The spectrum kernel: a string kernel for SVM protein classifica-tion. Proceedings of Pacific Symposium on Biocomputing, volume575,2002.564–575.
    [42] Leslie C, Kuang R. Fast string kernels using inexact matching for protein sequences. TheJournal of Machine Learning Research,2004,5:1435–1455.
    [43] Basilico J, Hofmann T. Unifying collaborative and content-based filtering. Proceedings of thetwenty-first international conference on Machine learning,2004.
    [44] Ben-Hur A, Noble W. Kernel methods for predicting protein–protein interactions. Bioinfor-matics,2005,21:i38–i46.
    [45] Kashima H, Oyama S, Yamanishi Y, et al. On pairwise kernels: An efcient alternative andgeneralization analysis. Advances in Knowledge Discovery and Data Mining,2009.1030–1037.
    [46] Lin D, Pantel P. DIRT-discovery of inference rules from text. Proceedings of ACM SIGKDDConference on Knowledge Discovery and Data Mining,2001.
    [47] Bhagat R, Ravichandran D. Large scale acquisition of paraphrases for learning surface patterns.Proceedings of ACL-08: HLT,2008.674–682.
    [48] Barzilay R, Lee L. Learning to paraphrase: An unsupervised approach using multiple-sequencealignment. Proceedings of the2003Conference of the North American Chapter of the Associ-ation for Computational Linguistics on Human Language Technology,2003.16–23.
    [49] Zhao S, Wang H, Liu T. Paraphrasing with search engine query logs. Proceedings of the23rd International Conference on Computational Linguistics. Association for ComputationalLinguistics,2010.1317–1325.
    [50] Zhang Y, Patrick J. Paraphrase identification by text canonicalization. Proceedings of theAustralasian Language Technology Workshop,2005.160–166.
    [51] Lintean M, Rus V. Dissimilarity Kernels for Paraphrase Identification. Proceedings of Twenty-Fourth International FLAIRS Conference,2011.
    [52] Qiu L, Kan M, Chua T. Paraphrase recognition via dissimilarity significance classification.Proceedings of the2006Conference on Empirical Methods in Natural Language Processing,2006.18–26.
    [53] Heilman M, Smith N. Tree edit models for recognizing textual entailments, paraphrases, andanswers to questions. Proceedings of Human Language Technologies: The2010Annual Con-ference of the North American Chapter of the Association for Computational Linguistics,2010.1011–1019.
    [54] Wan S, Dras M, Dale R, et al. Using dependency-based features to take the “Para-farce” out ofparaphrase. Proceedings of the Australasian Language Technology Workshop,2006.131–138.
    [55] Das D, Smith N. Paraphrase identification as probabilistic quasi-synchronous recognition.Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the4th Inter-national Joint Conference on Natural Language Processing of the AFNLP,2009.468–476.
    [56] Socher R, Huang E H, Pennington J, et al. Dynamic pooling and unfolding recursive autoen-coders for paraphrase detection. Advances in Neural Information Processing Systems,2011,24:801–809.
    [57] Giampiccolo D, Magnini B, Dagan I, et al. The third pascal recognizing textual entailmentchallenge. Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphras-ing,2007.1–9.
    [58] Marnefe M, MacCartney B, Grenager T, et al. Learning to distinguish valid textual entail-ments. Proceedings of the Second PASCAL Challenges Workshop,2006.468–476.
    [59] MacCartney B, Manning C. Modeling semantic containment and exclusion in natural languageinference. Proceedings of the22nd International Conference on Computational Linguistics,volume1,2008.521–528.
    [60] Harmeling S. An extensible probabilistic transformation-based approach to the third recog-nizing textual entailment challenge. Proceedings of the ACL-PASCAL Workshop on TextualEntailment and Paraphrasing,2007.137–142.
    [61] Zanzotto F, Pennacchiotti M, Moschitti A. Shallow semantics in fast textual entailment rulelearners. Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing,2007.72–77.
    [62] Moschitti A, Zanzotto F. Fast and Efective Kernels for Relational Learning from Texts. Pro-ceedings of the24th Annual International Conference on Machine Learning,2007.
    [63] Choueka Y. Looking for needles in a haystack or locating interesting collocation expressionsin large textual databases. Proceedings of the RIAO,1988.38–43.
    [64] Jackendof R. The architecture of the language faculty, volume28. MIT Press,1996.
    [65] Fellbaum C. WordNet. Theory and Applications of Ontology: Computer Applications,2010.231–243.
    [66] Baldwin T. Multiword expressions. Advanced course at the Australasian Language TechnologySummer School,2004..
    [67] Li M, Vitanyi P M. An introduction to Kolmogorov complexity and its applications. SpringerVerlag,1997.
    [68] Zhang X, Hao Y, Zhu X Y, et al. New information distance measure and its application inquestion answering system. Journal of Computer Science and Technology,2008,23(4):557–572.
    [69] Fano R. Transmission of Information: A Statistical Theory of Communications. MIT Press,1961.
    [70] Zhang Y, Kordoni V, Villavicencio A, et al. Automated multiword expression prediction forgrammar engineering. Proceedings of the Workshop on Multiword Expressions: Identifyingand Exploiting Underlying Properties. Association for Computational Linguistics,2006.36–44.
    [71] Ramisch C, Schreiner P, Idiart M, et al. An evaluation of methods for the extraction of multi-word expressions. Proceedings of the LREC Workshop-Towards a Shared Task for MultiwordExpressions (MWE2008),2008.50–53.
    [72] Magnini B, Negri M, Prevete R, et al. Is it the right answer?: exploiting web redundancy forAnswer Validation. Proceedings of the40th Annual Meeting on Association for ComputationalLinguistics. Association for Computational Linguistics,2002.425–432.
    [73] Klein D, Manning C D. Accurate unlexicalized parsing. Proceedings of the41st Annual Meet-ing on Association for Computational Linguistics-Volume1. Association for ComputationalLinguistics,2003.423–430.
    [74] Etzioni O, Cafarella M, Downey D, et al. Unsupervised named-entity extraction from the web:An experimental study. Artificial Intelligence,2005,165(1):91–134.
    [75] Laferty J, McCallum A, Pereira F C. Conditional random fields: Probabilistic models forsegmenting and labeling sequence data.2001..
    [76] Kleinberg J. Authoritative sources in a hyperlinked environment. Journal of the ACM,1999,46(5):604–632.
    [77] Ye S, Chua T S, Lu J. Summarizing definition from Wikipedia. Proceedings of the Joint Con-ference of the47th Annual Meeting of the ACL and the4th International Joint Conference onNatural Language Processing of the AFNLP: Volume1-Volume1. Association for Computa-tional Linguistics,2009.199–207.
    [78] Jeon J, Croft W B, Lee J H. Finding similar questions in large question and answer archives.Proceedings of the14th ACM international conference on Information and knowledge man-agement. ACM,2005.84–90.
    [79] Broder A. A taxonomy of web search. Proceedings of ACM Sigir forum, volume36. ACM,2002.3–10.
    [80] Rose D E, Levinson D. Understanding user goals in web search. Proceedings of the13thinternational conference on World Wide Web. ACM,2004.13–19.
    [81] Richardson M, Domingos P. Markov logic networks. Machine learning,2006,62(1):107–136.
    [82] Kullback S, Leibler R A. On information and sufciency. The Annals of Mathematical Statis-tics,1951,22(1):79–86.
    [83] Jansen B J, Booth D. Classifying web queries by topic and user intent. Proceedings of the28thof the international conference extended abstracts on Human factors in computing systems.ACM,2010.4285–4290.
    [84] Frakes W B, Baeza-Yates R. Information retrieval: data structures and algorithms.1992..
    [85] Lehnert W. Human and computational question answering. Cognitive Science,1977,1(1):47–73.
    [86] Graesser A C, Person N, Huber J. Mechanisms that generate questions. Lawrence ErlbaumAssociates, Inc,1992.
    [87] Li X, Roth D. Learning question classifiers: the role of semantic information. Natural Lan-guage Engineering,2006,12(3):229–250.
    [88] Lee U, Liu Z, Cho J. Automatic identification of user goals in web search. Proceedings of the14th international conference on World Wide Web. ACM,2005.391–400.
    [89] Liu Y, Li S, Cao Y, et al. Understanding and summarizing answers in community-based ques-tion answering services. Proceedings of the22nd International Conference on ComputationalLinguistics-Volume1. Association for Computational Linguistics,2008.497–504.
    [90] Bu F, Zhu X, Hao Y, et al. Function-based question classification for general QA. Proceedingsof the2010Conference on Empirical Methods in Natural Language Processing. Associationfor Computational Linguistics,2010.1119–1128.
    [91] Chang C, Lin C. LIBSVM: A library for support vector machines. ACM Transactions onIntelligent Systems and Technology,2011,2(3):27:1–27:27.
    [92] Scho¨lkopf B, Smola A. Learning with kernels: Support vector machines, regularization, opti-mization, and beyond. The MIT Press,2002.
    [93] Kimeldorf G, Wahba G. Some results on Tchebychefan spline functions. Journal of Mathe-matical Analysis and Applications,1971,33(1):82–95.
    [94] Vapnik V. The nature of statistical learning theory. Springer Verlag,2000.
    [95] Baldridge J. The opennlp project,2005.
    [96] Dolan W, Brockett C. Automatically constructing a corpus of sentential paraphrases. Proceed-ings of IWP,2005.
    [97] Quirk C, Brockett C, Dolan W. Monolingual machine translation for paraphrase generation.Proceedings of EMNLP2004,2004.142–149.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700