用户名: 密码: 验证码:
基于引文上下文分析的文献检索技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着大数据时代的到来,科学文献越来越多的以电子化文档的形式存在于网络中,这不仅能够促进文献的传播与推广,更能促进科学研究水平的发展,使研究者达到“站在巨人的肩膀上”的目的。然而,大量电子化学术文献的产生,不仅存在良莠不齐的问题,同时为文献管理提出了新的挑战,如何对文献进行有效的表示、筛选、应用,已经成为当今知识管理领域研究的热点问题之一。
     因此,本文将文本挖掘、信息检索等相关方法应用于文献检索技术的研究中,以引文分析方法为基础,利用引文上下文的相关语义信息,融合主题模型、排序算法、语言模型、网络图等理论,实现文献知识域可视化表示、文献排序算法的研究、文献检索模型的构建等,并选取相关学术论文数据对各个知识点进行实验验证。本文的主要研究内容可以包括:
     1.基于引文分析法提出一种引文概率分布距离的计算方法,并将其应用于文献知识域可视化的研究中。
     2.抽取引文上下文的文本信息,利用Labeled-LDA主题模型获得有向、加权引文网络中顶点权值与边权重两个先验概率,改进传统PageRank算法,实现基于引文上下文的文献排序方法(Context-Based Ranking Algorithm, CBRA)研究。
     3.将基于引文上下文的排序方法应用于作者权威度的分析实验中,针对每个主题建立相对应的作者权威度排名,并利用作者权威度信息改进文献排序结果,这样,文献排序不仅基于网络链接,同时考虑了作者权威度的影响因素。
     4.利用基于引文上下文的排序方法改进传统的基于语言建模的信息检索模型,利用系统开发的思想构建与主题相关的文献检索系统。
     5.将基于引文上下文的排序方法应用于段落检索研究中,构建基于主题的段落检索模型,从而提高传统文献检索的准确率以及有效性。
Currently, the time of Big Data is coming, so more and more scientific literature is shown as electronic documents in the Internet, which not only promotes the popularization of literature, but also accelerates the development of scientific research level, as well as achieves the goal of "standing on the shoulder of giants". However, along with these changes, the problem that the good and the bad literature are intermingled in the large amounts of electronic academic documents is becoming more conspicuous. Therefore, we are faced with the new challenges in literature visualization, retrieval, management and application, which have become a hotspot in the research of Bibliometrics and knowledge management.
     This thesis will put the focus on the related methodologies of scientific literature retrieval based on the theory of citation analysis, text mining and information retrieval. So, some methods will be considered in the following part, i.e., Topic Model, Ranking Algorithm, Language Model, and Graph Theory. First of all, a method of domin knowledge visulation is presented. And then, there is a ranking algorithm of scientific literature by analyzing the semantic knowledge of citation context. Finally, a scientific literature retrieval model was implemented. All of these methods have improved by the experiment. So, the main research content includes:
     1. Put forward a new computing method for the citation probability distribution distance based on citation analysis, and then apply it into the visualization of literature knowledge domain.
     2. Extract the text information of citation context, and use the topic model of Labeled-LDA to generate two prior probabilities (vertex weight, edge weight) in the directed and weighted citation network. So a Context-Based Ranking Algorithm (CBRA) was proposed that improving the traditional PageRank algorithm.
     3. Apply the CBRA into the experiment of author authority ranking analysis. For each topic, we can set up the author authority rankings, which will improve the literature rankings. So that the literature ranking is not only based on the network links, but also take consideration of the authority of author.
     4. In accordance with the CRBA, this thesis will improve the traditional information retrieval model which is based on language model. And then, establish a topic based literature retrieval system by system development methods.
     5. Apply the CBRA into passage retrieval and set up the passage retrieval system based on topic, which can improve the accuracy and relevance of literature retrieval.
引文
[1]孟小峰,慈祥.大数据管理:概念、技术与挑战.计算机研究与发展,2013,50(1):146-149.
    [2]何文娜.大数据时代基于物联网和云计算的地质信息化研究:(博士学位论文).长春:吉林大学,2013.
    [3]王珊,王会举,覃雄派,等.架构大数据:挑战、现状与展望.计算机学报,2011,34(10):1741-1752.
    [4]邹常诗.科学文献计量分析与文献关联性研究.情报资料工作,2000,4:18-20.
    [5]赵丹阳.数字环境下科技文献信息开发利用与服务模式研究:(博士学位论文).长春:吉林大学,2012.
    [6]石宝军.电子文献研究:(博士学位论文).中国科学院文献情报中心,2001.
    [7]范全青,郭维真,凤元杰.我国文献计量学研究30年之发展.情报资料工作,2009,3:30-33.
    [8]赵丹群.试论引文分析方法的网络化发展与应用.图书情报工作,2009,53(8):39-42.
    [9]Garf ield E. Citation indexes for science. A new dimension in documentation through association of ideas. International Journal of Epidemiology,2006,35:1123-1127.
    [10]杨思洛.引文分析存在的问题及其原因探究.中国图书馆学报,2011,37(193):108-117.
    [11]范并思.社会科学信息分析中的文本挖掘.图书情报工作,2012,56(8):6-9.
    [12]徐建锁.知识管理和文本挖掘的若干问题研究:(博士学位论文).天津:天津大学,2004.
    [13]http://en.wikipedia.org/wiki/Scientific_literature
    [14]Krauze T K,Hillinger C. Citations, references and the growth of scientific literature:A model of dynamic interaction. Journal of the American Society for Information Science,1971,22 (5):333-336.
    [15]靖培栋,康仲远.关于科技文献增长的数学模型.情报学报,2000,19(1):90-96.
    [16]Neuner E. Titles in medical articles:What do we know about them?. The Write Stuff,2007,16(4):158-160.
    [17]Rosen-Zvi M, Griffiths T, Steyvers M et al. The author-topic model for authors and documents. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence,2004:487-494.
    [18]Price D. Statistical studies of networks of scientific papers. Science,1965, 149:510-515.
    [19]Weinstock M. Citation indexes. Encyclopedia of Library and Information Science, 1971,5:16-41.
    [20]Thorne F C. The citation index:Another case of spurious validity. Journal of Clinical Psychology,1977,33(4):1157-1161.
    [21]http://en. wikipedia. org/wiki/Shepard's_Citations
    [22]周军.论20世纪发达国家情报研究学理论的进展.图书情报工作,2003,1:42-47.
    [23]Gross P L, Gross E M. College Libraries and Chemical Education. Science,1927,66 (1713):385-389.
    [24]Bradford S C. Documentation. London. Crosby Lockwood,1948.
    [25]Bradford S C. Sources of information on specific subjects. Journal of information Science,1985,10(4):173-180.
    [26]Garfield E. Citation indexes in sociological and historical research. American documentation,1963,14(4):289-291.
    [27]Hirsch J E. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America,2005, 102(46):16569-16572.
    [28]白文倩.基于引文分析方法的教育技术学科知识图谱构建:(硕士学位论文).武汉:华中师范大学,2011.
    [29]Pinski G, Narin F. Citation influence for journal aggregates of scientific publications:Theory, with application to the literature of physics. Information Processing and Management,1976,12(5):297-312.
    [30]Cronin B. The citation process:The role and significance of citations in scientific communication. London:Taylor Graham,1984,1.
    [31]Davis P M. Eigenfactor:Does the principle of repeated improvement result in better estimates than raw citation counts?. Journal of the American Society for Information Science and Technology,2008,59(13):2186-2188.
    [32]Egghe L, Rousseau R. Introduction to Informetrics:Quantitative methods in library, documentation and information science. Elsevier Science Publishers,1990.
    [33]Schubert A, Glanzel W, Thijs B. The weight of author self-citations:A fractional approach to self-citation counting. Scientometrics,2005,67(3):503-514.
    [34]Hyland K. Self-citation and self-reference:Credibility and promotion in academic publication. Journal of the American Society for Information Science and Technology, 2003,54(3):251-259.
    [35]Bonzi S, Snyder H W. Motivations for citation:A comparison of self citation and citation to others. Scientometrics,1991,21(2):245-254.
    [36]http://thomsonreuters.com/products_services/science/science_products/scholar ly_research_analysis/research_evaluation/journal_citation_reports
    [37]Walker D, Xie H, Yan K K et al. Ranking scientific publications using a simple model of network traffic. Journal of Statistical Mechanics:Theory and Experiment,2007 (6):6010.
    [38]Sayyadi H, Getoor L. FutureRank:Ranking scientific articles by predicting their future PageRank. Proceedings ofthe Ninth SIAM International Conference on Data Mining, 2009:533-544.
    [39]Jarvelin K, Persson 0. The DCI Index:Discounted cumulated impact-based research evaluation. Journal of the American Society for Information Science and Technology, 2008,59 (9):1433-1440.
    [40]Larison R R. Bibliometrics of the world wide web:An exploratory analysis of the intellectual structure of cyberspace. Proceedings of the Annual Meeting of the American Society of Information Science Baltimore,1996,33:71-78.
    [41]Almind T C, Ingwersen P. Informetric analysis on the World Wide Web Methodological Approaches to Webometrics. Journal of Documentation,1997,53(4):404-426.
    [42]尹丽春.科学学引文网络的结构研究:(博士学位论文).大连:大连理工大学,2006.
    [43]Page L, Brin S, Motwani R et al. the pagerank citation ranking:bringing order to the web. Technical report, Stanford Digital Library Technologies Project,1998.
    [44]Ma N, Guan J, Zhao Y. Bringing PageRank to the citation analysis. Information processing and management,2008,44(2):800-810.
    [45]Ding Y, Yan E, Frazho A et al. PageRank for ranking authors in co-citation networks. Journal of the American Society for Information Science and Technology,2009,6 (11):2229-2243.
    [46]陈畅.基于链接分析的搜索引擎反作弊技术研究:(硕士学位论文).广州:华南理工大学,2012.
    [47]Kleinberg J. Authoritative sources in a hyperlinked environment. Journal of the ACM,1999,46(5):604-632.
    [48]Chakrabarti S, Dom B, Gibson D et al. Automatic resroucen compilation by analyzing hyperlink structure and associated text. the Seventh International on World Wide Web Conference,1998:14-18.
    [49]Bharat K, MIhaila G A. When experts agree:Using non-affiliated experts to rank popular topics. the Tenth International World Wide Web Conference,2001:597-602.
    [50]Lempel R, Moran S. The stochastic approach for link-structure analysis. ACM Transactions on Information System,2001:131-160.
    [51]Rafiei D, Mendelzon A O. What is this page known for? Computing web page reputations. the Ninth International World Wide Web Conference,2000,30:823-835.
    [52]Richardson M, Domingos P.The intelligent surfer:Probabilistic combination of link and content information in PageRank. Advances in Neural Information Processing Systems,2002:1441-1448.
    [53]Haveliwala T H. Topic-sensitive PageRank:A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering,2003,15 (4):784-796.
    [54]Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research,2003,3(4-5):993-1022.
    [55]Liu X, Zhang J, Guo C. Full-Text Citation Analysis:Enhancing Bibliometrics and Scientific Publication Ranking. the 21st ACM international conference on Information and knowledge management,2012:1975-1979.
    [56]Ding Y. Topic-based PageRank on author co-citation networks. Journal of the American Society for Information Science and Technology,2011,62(3):449-466.
    [57]Gyongyi Z, Hector, Garcia-Molina et al. Combating Web Spam with TrustRank. the International Conference on Very Large Data Bases,2004,30:576-587.
    [58]http://torrez. us/archives/2005/07/13/tagrank. pdf
    [59]Hotho A, Jaschke, Robert et al. Information Retrieval in Folksonomies:Search and Ranking. The Semantic Web:Research and Applications,2006,40(11):411-426.
    [60]郭艳红.推荐系统的协同过滤算法与应用研究:(博士学位论文).大连:大连理工大学,2008.
    [61]Pazzani M J, Billsus, Daniel. Content-Based Recommendation Systems. The Adaptive Web,2007,4321:325-341.
    [62]Sarwar B, Karypis G, Konstan J et al. Item-based collaborative filtering recommendation algorithms. the 10th international conference on World Wide Web 2001:285-295.
    [63]Basu C, Cohen W W, Hirsh H et al. Technical Paper Recommendation:A Study in Combining Multiple Information Sources. Journal of Artificial Intelligence Research, 2001,14:231-252.
    [64]Chandrasekaran K, Gauch S, Lakkaraju P et al. Concept-Based Document Recommendations for CiteSeer Authors. Proceedings of the 5th international conference on Adaptive Hypermedia and Adaptive Web-Based Systems,2008:83-92.
    [65]Shaparenko B, Joachims T. Identifying the Original Contribution of a Document via Language Modeling. Machine Learning and Knowledge Discovery in Databases,2009, 5782:350-365.
    [66]McNee S, Albert I, Cosley D et al. On the recommending of citations for research papers. Proceedings of the 2002 ACM conference on Computer supported cooperative work, 2002:116-125.
    [67]Zhou D, Zhu S, Yu K et al. Learning multiple graphs for document recommendations. Proceedings of the 17th international conference on World Wide Web,2008:141-150.
    [68]Torres R, McNee S, Abel M et al. Enhancing digitial libraries with techlens. Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries,2004: 228-236.
    [69]Strohman T, Croft W B, Jensen D. Recommending citations for academic papers. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:705-706.
    [70]Tang J, Zhang J. A Discriminative Approach to Topic-Based Citation Recommendation. Advances in Knowledge Discovery and Data Mining,2009,5476:572-579.
    [71]He Q, Pei J, Kifer D et al. Context-aware citation recommendation. Proceedings of the 19th international conference on World wide web,2010:421-430.
    [72]He Q, Kifer D, Pei J et al. Citation recommendation without author supervision. Proceedings of the fourth ACM international conference on Web search and data mining, 2011:755-764.
    [73]陈晓云.文本挖掘若干关键技术研究:(博士学位论文).上海:复旦大学,2005.
    [74]程显毅,朱倩.文本挖掘原理.科学出版社,2010.
    [75]施聪莺,徐朝军,杨晓江.TFIDF算法研究综述.计算机应用,2009,29(S1):167-180.
    [76]王占一.Web文本挖掘中若干问题的研究:(博士学位论文).北京:北京邮电大学,2012.
    [77]曼宁,拉哈万,舒策著,王斌译.信息检索导论.人民邮电出版社,2010.
    [78]孟凡淇,信息检索模型研究综述,信息通信.2013,3:76.
    [79]李鹏,王斌,晋薇.一种基于社会化标签的信息检索方法.中文信息学报,2013,27(1):39-46
    [80]孔芳,周国栋.基于树核函数的中英文代词消解,软件学报,2010,23(5):1085-1099.
    [81]程凡.基于排序学习的信息检索模型研究.(博士学位论文).北京:中国科学技术大学,2012.
    [82]Jarvelin K, Kekalainen J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems,2002,20(4):422-446.
    [83]林琛.WEB环境下的社会网络挖掘研究:(博士学位论文).上海:复旦大学,2009.
    [84]张小平.主题模型及其在中医临床诊疗中的应用研究:(博士学位论文).北京:北京交通大学,2011.
    [85]张冬梅.文本情感分类及观点摘要关键问题研究:(博士学位论文).济南:山东大学,2012.
    [86]Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval,1999:50-57.
    [87]丁轶群.基于概率生成模型的文本主题建模及其应用:(博士学位论文).杭州:浙江大学,2010.
    [88]李晓旭.基于概率主题模型的图像分类和标注的研究:(博士学位论文).北京:北京邮电大学,2012.
    [89]Ramage D, Hall D, Nallapati R et al. Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1-Volume 1. Association for Computational Linguistics,2009:248-256.
    [90]李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法.计算机学报,2008,31(4):620-627.
    [91]Kessler M M. Bibliographic coupling between scientific papers. American Documentation,1963,14:10-25.
    [92]Kessler M M. Bibliographic coupling extended in time:Ten case histories. Information storage and retrieval,1963,1(4):169-187.
    [93]Giles C L, Bollacker K D, Lawrence S. CiteSeer:an automatic citation indexing system. Proceedings of the third ACM conference on Digital libraries,1998:89-98.
    [94]Small H. Co-Citation in Scientific Literature:A new measure of the relationship between two documents. Journal of the American Society for Information Science,1973, 24:265-269.
    [95]Marshakova I V. A system of document connection based on references. Scientific and Technical Information Serial of VINITI,1973,6(2):3-8.
    [96]张琳.基于期刊聚类的科学结构研究:(博士学位论文).大连:大连理工大学,2010.
    [97]White H D, Griffith B C. Author Cocitation:A Literature Measure of Intellectual Structure. Journal of the American Society for Information Science,1981,32 (3):163-171.
    [98]Ding Y, Chowdhury G, Foo S. Mapping Intellectual Structure of Information Retrieval: An Author Cocitation Analysis,1987-1997 Journal of Information Science,1999,25 (1):67-78.
    [99]White H D. Pathfinder networks and author cocitation analysis:A remapping of paradigmatic information scientists. Journal of the American Society for Information Science and Technology,2003,54(5):423-434.
    [100]White H D, McCain K W. Visualizing a Discipline:An Author Co-Citation Analysis of Information Science,1972-1995. Journal of the American Society for Information Science and Technology,1998,49(4):327-355.
    [101]Gipp B, Beel J. Citation Proximity Analysis (CPA)-A new approach for identifying related work based on Co-Citation Analysis. Proceedings of the 12th International Conference on Scientometrics and Informetrics,2009,2:571-575.
    [102]刘欣.基于阅读价值的科技文献排序方法研究:(硕士学位论文).大连:大连理工大学,2010.
    [103]尹莉.一种基于PageRank算法的期刊评价理论模型.情报科学,2012,30(12):1799-1803.
    [104]程凡,仲红.基于pairwise的改进ranking算法.计算机应用,2011,31(7):1740-1743.
    [105]杨立.基于领域知识的知识发现研究:(博士学位论文).北京:中国科学院研究生院(软件研究所),2005.
    [106]董颖.知识服务机制研究:(博士学位论文).北京:中国科学院研究生院(软件研究所),2003.
    [107]Garfield E. Citation Analysis as a Tool in Journal Evaluation-Journals Can Be Ranked by Frequency and Impact of Citations for Science Policy Studies.Science, 1972,178(4060):471-479.
    [108]Callon M, Courtial J P, Turner W A et al. From translations to problematic networks:An introduction to co-word analysis. Social Science Information,1983,22 (2):191-235.
    [109]White H D, McCain K W. Visualizing a discipline:An author co-citation analysis of information science,1972-1995. Jornal of American Society and Information Science Technology,1998,49(4):327-355.
    [110]Morel C M, Serruya S J, Penna G 0 et al. Co-authorship Network Analysis:A Powerful Tool for Strategic Planning of Research, Development and Capacity Building Programs on Neglected Diseases. PLoS neglected tropical diseases,2009,3(8):e501.
    [111]Mei Q, Zhang D, Zhai C. A general optimization framework for smoothing language models on graph structures. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,2008:611-618.
    [112]张春霞.领域文本知识获取方法研究及其在考古领域中的应用:(博士学位论文).北京:中国科学院研究生院(计算技术研究所),2005.
    [113]孙海生,韩红.引用认同用于科研人员评价的实证分析.情报杂志,2011,30(7):30-33.
    [114]http://www. google. com/patents/US7565358
    [115]王秀娟.文本检索中若干问题研究:(博士学位论文).北京:北京邮电大学,2006.
    [116]刘健.面向信息检索的文本信息组织关键技术研究:(博士学位论文).长沙:国防科学技术大学,2009.
    [117]孙卫琴.精通Hibernate Java对象持久化技术详解.北京:电子工业出版社,2005.
    [118]Bauer C, King G. Hibernate in Action. Manning Inc,2004.
    [119]邓牧Java对象持久化技术与Hibernate.计算机与现代化,2007,8:26-28.
    [120]Evans E. Domain-driven design:tackling complexity in the heart of software. Addison-Wesley Professional,2004.
    [121]黄水清,黄文昕,金洁琴.段落检索技术的综述.情报探索,2007,3:51-52.
    [122]Salton G, Allan J, Buckley C. Approaches to passage retrieval in full text information systems. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval,1993:49-58.
    [123]Liu X, Croft W B. Passage retrieval based on language models, the eleventh international conference on Information and knowledge management,2002:375-382.
    [124]Hearst M A, Plaunt C. Subtopic structuring for full-length document access. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval,1993:59-68.
    [125]Moffat A, Sacks-Davis R, Wilkinson R et al. Retrieval of Partial Documents. In Donna Harman, editor. Proceedings of the Second Text Retrieval Conference TREC-2, 1994:181-190.
    [126]Light M, Mann G S, Riloff E et al. Analyses for elucidating current question answering technology. Natural Language Engineering,2001,7(4):325-342.
    [127]Ittycheriah A, Franz M, Zhu W J et al. IBM's Statistical Question Answering System. Proceedings of the 9th Text Retrieval Conference TREC-9,2000:229.
    [128]Lee G G, Seo J, Lee S et al. SiteQ:Engineering High Performance QA System Using Lexico-semantic PatternMatching and Shallow NLP. Proceedings of the Tenth Text REtrieval Conference TREC-2001,2001:442.
    [129]Clarke C LA, Cormack G V, Lynam T R et al. Question Answering by Passage Selection. Advances in Open Domain Question Answering,2006,32:259-283.
    [130]Ponte J M, Croft W B. A Language Modeling Approach to Information Retrieval. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,1998:275-281.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700