用户名: 密码: 验证码:
信息检索中迁移Markov网络模型的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
Web信息的快速增长,给信息检索系统带来了巨大挑战。传统的检索模型需要在一个相对固定的数据集上通过训练得到,不具备开放的学习功能,而Web上的信息[0]是实时更新的,检索模型由于无法对新的数据进行学习,检索性能必将随着Web信息的更新而逐渐下降,现有的解决办法之一是将新数据加入到原有的数据集中对模型重新进行训练,由于模型对原有数据集学习所获得的知识并没有被保留,这势必浪费大量的时间进行重复学习;另一种方法则是在新数据集上对模型进行训练,若新的数据集数据量不够则会影响模型的学习效果。因而,如何使得检索模型能较好学习新的数据并实现在目标数据集上的准确检索[2,4,,8,11],成为了信息检索模型研究的热点之一。
     本文中我们将检索模型在发生变化的数据上的训练可以看作是一个学习迁移的过程[2,14],检索模型在原有数据集上的训练势必对其在新数据上的学习有所影响,即在原有数据集上建立模型中所获得的知识对该模型在新数据上的训练有所帮助,从而减少由于数据发生改变而导致模型需要重新进行学习所耗费的时间。
     因此,为解决上述问题,我们结合迁移学习理论[1]和Markov网络理论[9],提出了一种新的方法,思想是迁移Markov网络用于实现检索模型的学习,通过获取先验知识,实现在目标数据集上高效检索目的。具体的思路是:首先建立一个数据集上的初始检索模型,其次,衡量用于建模的先验数据集和目标数据集中数据分布的差异性,在本文中我们使用Kullback–Leibler divergence[3,7](KL偏离度)的测量公式来度量这一分布的不同。且所得的KL偏离值也可用于确定检索公式中的平衡参数(也就是确定迁移先验数据中知识的“量”),然后,通过迁移旧数据集上的先验知识到目标数据集上,我们对基于Markov网络的检索模型进行修正,并在目标数据集上进行检索。这样就可以将以将以往被摒弃的先验数据通过有效的迁移学习利用到目标数据集的学习中,使检索模型能够快速学习并实现目标文档的高效检索。多组实验结果验证了我们的新方法在性能上要优于BM25算法[6,21,26],T-检验[16,20]的结果也显示模型的性能提高水平是显著的。
     本文的工作和创新点在于:
     1.首次将迁移学习理论用于信息检索领域,将知识的迁移和有指导的迁移学习等思想成功应用在基于Markov网络的信息检索模型[5,6,63,68]中,从构造的Markov网络中成功地学习先验数据。
     2.本文通过迁移先验数据集的知识,并利用部分新数据集,来修正所提出的检索模型,并最终实现基于Markov网络的检索模型能在新的数据集上的高效检索。这样可以大大减少重新建立一个适用于新数据集的检索模型所耗费的时间。
     3.通过实验验证并分析了进行迁移学习后的基于Markov网络的信息检索模型的性能,并将常用的BM25算法与经过迁移学习的Markov网络信息检索模型的性能进行比较。实验表明本文提出迁移基于Markov网络模型的方法表现比较优异,在很大程度上提高了检索效率。
Along with the development of internet, a lot of new data appears in the web every day. It brings a great challenge to retrieval system. Information is in changing and our retrieval model is gotten by training on amount of dataset and need much time to rebuild. To construct a retrieval model to adapt the new data quickly and to retrieval the new documents accurately is becoming an important research topic.
     Traditional retrieval model is impossible in learning new data, retrieval effect become lower when the documents increase. And retrieval model need cost a lot of time to modify for adapting new data set. Because information is in the exponential growth, and the training of retrieval model is described as a process of transfer learning process. And the training of old data has influence on the training in new data. So in the process of achieving retrieval in the target data sets, we need not re-training the prior data and new data, but only transferring priori knowledge of old data and according to the characteristics of new data to revise the retrieval model. By this way, we can save time and man power.
     In this paper, we put forward boldly a new retrieval model by incorporating the theory of transfer learning with Markov Network. Firstly, comparing term spaces network of old dataset and new (target) dataset, and the distance between data sets is measured using the Kullback-Leibler divergence. Moreover, KL-divergence is used to decide the trade-off parameter in retrieval formula. Then we transfer the useful prior knowledge of old dataset to the new (target) dataset, and finally implement the retrieval process on the target dataset. Experiments on multiple datasets indicate that our new approach outperforms other methods. Furthermore, we perform several T-tests to demonstrate the improvements are statistically significant.
     The innovations in this paper are following:
     1. It is first time we apply transfer learning theory in information retrieval. In this paper, the knowledge of migration, supervised transfer learning successfully used in our retrieval model based on Markov Network. And by construct the Markov term-space, the model can learn the prior data quickly.
     2. Knowledge in prior data is transferred to target data, with the help of the new data, we can modify rapidly our retrieval model, and retrieval model based Markov network gets high retrieval performance on the target document sets. By this way, we need not rebuild a new retrieval model and save time and effort.
     3. Our model is tested by experiments, and after analyzing the results we find that our model which transferred learning become more efficient on search process. And compare with the basic retrieval model (BM25), our methods demonstrated statistically significant improvements.
引文
[1] Sinno Jialin Pan and Qiang Yang, A Survey on Transfer Learning, HKUST-CS08-08,2008.11
    [2] Ben-David, S., and Schuller, R. Exploiting task relatedness for multiple task learning. In Proceedings of the Sixteenth Annual Conference on Learning Theory. 2003.
    [3] Wenyuan Dai, Gui- Rong Xue. Transferring Naive Bayes Classifiers for Text Classification. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI 2007)
    [4] Dikan Xing, Wenyuan Dai, Gui-Rong Xue. Bridged Refinement for Transfer Learning. In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007)
    [5] Jiali Zuo, MingWen Wang, Xi Wang. Extended Information Retrieval Model Based on Markov Network. Tsinghua Science and Technology, 2005, 45(s1), 1847-1852.
    [6] Lixin Gan, MingWen Wang, Huawei Zhang. A Markov Network Information Retrieval Model Based on the Cliques. The 4th National Symposium of Search Engine and Web Mining page, 2006.
    [7]丁国栋.基于统计语言建模的信息检索及相关研究[D].北京.中国科学院研究生院(计算技术研究所).2006年.
    [8] Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen. Optimizing Web Search Using Web Click-through Data. In Proc. of the 13th Conference on Information and Knowledg Management 2004, Washington DC, USA.
    [9] Donald Metzler, W. Bruce Croft. A Markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 472--479. Salvador, Brazil, August 15-19, 2005.
    [10] Xiaoxiao Shi, Wei Fan, and Jiangtao Ren. Actively transfer domain knowledge. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Lecture Notes in Computer Science, pages 342–357,Antwerp, Belgium, September 2008.Springer.
    [11] Rich Caruana. Multi-task Learning, in Machine Learning (28). 1997
    [12] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, 1998.
    [13] Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning. 2004.
    [14] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the Conference on Empirical Methods in Natural Language, pages 120–128, Sydney, Australia, July 2006.
    [15] Duda R., Hart P.E. and Stock D.G.著.李宏东,姚天翔等译.模式分类(原书第二版) [M].北京:机械工业出版社, 2003.
    [16] Hastie T., Tibshirani R. and Friedman J.著.范明,柴玉梅,昝红英等译.统计学习基础-数据挖掘、推理与预测[M].北京.电子工业出版社, 2004.
    [17] Mitchell T.M.著.曾华军,张银奎等译.机器学习[M].北京:机械工业出版社, 2003.
    [18] Vapnic V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.
    [19]何盈捷,刘唯一.由Markov网到Bayesian网[J].计算机研究与发展. 2002, Vol 39.1:87-99.
    [20]白鸿钧.统计学原理[M].厦门.厦门大学出版社. 2003.
    [21]左家莉.基于Markov网络的信息检索模型: [硕士学位论文].南昌:江西师范大学计算机信息工程学院, 2005.
    [22] Wenyuan Dai, Qiang Yang, Guirong Xue, and Yong Yu. Self-taught clustering. In Proceedings of the 25th International Conference of Machine Learning, pages 200–207. ACM, July 2008.
    [23] S. Deerwester, S. T. Dumais, G. W. Furnas, et al. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
    [24] Hal Daum′eIII and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101–126, 2006.
    [25] Daum′eIII, H., and Marcu, D. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research 26:101–126.
    [26]盛俊,王明文,余俊英.一种基于潜在语义的Markov网络信息检索模型.见:第二届全国信息检索与内容安全学术会议.大规模信息检索与内容安全论文集.北京:清华大学出版社, 2005, 1-10.
    [27]盛俊.潜在语义的Markov网络检索模型的研究: [硕士学位论文].南昌:江西师范大学计算机信息工程学院, 2006
    [28] ECML/PKDD-2006. http://www.ecmlpkdd2006.org/
    [29] Rie Kubota Ando and Tong Zhang. A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 1–9, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
    [30] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval (English Edition)[M].北京:机械工业出版社,2004.
    [31] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Lecture Notes in Computer Science, pages 71–85, Antwerp, Belgium, September 2008. Springer.
    [32] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 25–32. MIT Press, Cambridge, MA, 2008.
    [33] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops, pages 77–82, Washington, DC, USA, 2007. IEEE Computer Society.
    [34] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Reserch, 4:83–99, 2003.
    [35] Elena Baralis, Silvia Chiusano, and Paolo Garza. A lazy approach to associative classification. IEEE Transactions on Knowledge and Data Engineering, 20(2):156–171, 2008.
    [36]丁江伟,刘挺,卢志茂,李生.隐马尔科夫模型和贝叶斯模型语义消歧对比研究.见:全国第七届计算语言学联合学术会议, 2003, 8.
    [37] Xiaoyong Liu, W. Bruce Croft. Statistical language modeling forinformation retrieval. In the Annual Review of information science and technology, 2004, vol. 39: 3-31.
    [38] Zhou, X., Hu, X., Lin, X., Han, H., and Zhang, X.,“Relation based Document Retrieval for Biomedical Literature Databases”, The 11th International Conference on Database Systems for Advanced Applications (DASFAA 2006), 12– 15 April, 2006, Singapore, pp. 689-701
    [39]张俊林.基于语言模型的信息检索系统研究[D]: [博士学位论文].北京:中国科学院软件研究所, 2004.
    [40] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pages 41–48. Vancouver, British Columbia, Canada, December 2007.
    [41] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Lecture Notes in Computer Science, pages 71–85, Antwerp, Belgium, September 2008. Springer
    [42] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops, pages 77–82, Washington, DC, USA, 2007. IEEE Computer Society.
    [43] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 137–144, Cambridge, MA, 2007. MIT Press.
    [44]曹瑛,王明文,陶红亮.基于Markov网络的信息检索模型.第四届全国搜索引擎和网上信息挖掘学术研讨会山东2006.9.
    [45]左孝凌,李为鑑等,《离散数学》[M]北京科学出版社。
    [46]贺宏朝,何丕廉,陈霞.利用人工和自动生成的资源进行中文信息检索查询扩展[J].计算机工程与应用. 2002, 21:18-20.
    [47]张敏,宋睿华,马少平.基于语义关系查询扩展的文档重构方法[J].计算机学报. 2004, 27(10): 1395-1401.
    [48] Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, Guihong Cao (2005)Query Expansion Using Term Relationships in Language Models for Information Retrieval. ACM Fourteenth Conference on Informationand Knowledge Management (CIKM) CIKM and Workshops 2005,pp 688-695
    [49] Bruno M. Fonseca, Paulo Golgher, Bruno P?ssas. Concept Based Interactive Query Expansion..Yuen-Hsien, Da-Wei Juang, Shiu-Han Chen. Glodbal and Local Term Expansion for Text Retrieval. Proceedings of NTCIR-4, Tokyo,April 2003-June 2004.
    [50] J.J. Rocchio, Relevance feedback in information retrieval. In: G. Salton (ed.), The Smart retrieval system: experiments in automatic document processing, Prentice Hall, 1971, 313-323.
    [51] Rajat Raina, Andrew Y. Ng, and Daphne Koller. Constructing informative priors using transfer learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 713–720, Pittsburgh, Pennsylvania, USA, June 2006.
    [52]梅立军,周强,臧路等.知网与同义词词林的信息融合研究[J].中文信息学报. 2005, 19(1): 63-70.
    [53] Voorees E. M. Using WordNet to disambiguate word senses for text retrieval. Research and Development on Information Retrieval-ACM-SIGIR, Pittsburgh, 1993:171-180.
    [54] M. W. Wang, J. Y. Nie. A Dempster-Shafer Model for Query Expansion. Journal of Jiangxi Normal University (Nature Science). No. 3, 2004.
    [55] Ji-Rong wen, Lao li, Wei-Ying Ma. Probabilistic model for contextual retrieval. SIGIR’04, Sheffield, South Yorkshire, UK, 2004:57-63.
    [56] Shuang Liu, Fang Liu, Clement Yu. An effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. SIGIR’04, Sheffield, United Kingdom, 2004:266-272.
    [57] Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim. Information Retrieval using Word Senses: Root Sense Tagging Approach. SIGIR’04, Sheffield, United Kingdom, 2004:258-265.
    [58] Vassilis Plachours, Iadh Ounis. Usefulness of Hyperlink Structure for Query-Biased Topic Distillation. SIGIR’04 July 25-29,2004,Sheffield, South Yorkshire,UK.
    [59] Jaime Teevan,Susan T.Dumais,Eric Horvitz. Personaizing Search via Automated Analysis of Interests and Activities. SIGIR’05. August 15-19,2005,Salvador, Brazil.
    [60]赵需要,张文德.网络信息检索模式及未来发展[J].情报探索.2006.02 .1005-8095
    [61]崔航,文继荣,李敏强.基于用户日志的查询扩展统计模型[J].软件学报.2003,14, No.9: 1593~1599.
    [62] Boris Gelfand, Marilyn Wulfekuhler, William F.Punch III. Automated Concept Extraction From Plain Text. http://garage.cps.msu.edu/papers/GARAGe98-07-02.ps.
    [63]甘丽新,王明文,张华伟.基于团的Markov网络信息检索模型[J].2005年第四届全国搜索引擎和网上信息挖掘学术研讨会.山东大学学报(理学版)。2006,Vol.41,Supp.2.
    [64] Makoto Iwayama, Astushi Fujii, Noriko Kando,et. An Empirical Study on Retrieval Model for different Document Genres:Patents and Newspaper Article[Z]. Canada: ACM, 2003.
    [65] H. R. Turtle. Inference networks for document retrieval: [Ph.D thesis]. USA: Computer and information science department, University of Massachusetts, Amherst, Oct, 1990.
    [66] F. H. Long, H. G. Zhang and D. D. Feng, Fundamentals of Information Retrieval. http://www.microsoft.com/, 2003.
    [67] Mingwen WANG, Lixin GAN, Jiali ZUO, Huawei ZHANG. An Information Retrieval Model Based on Markov Concept Graph. Journal of Computational Information Systems3:2 (2007) 203-213.
    [68] Meihua YU, Mingwen WANG, Jiali ZUO, Xiaofang ZOU. Transferring Markov Network for Information Retrieval.JCAI 2009, Hainan Island, China

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700