用户名: 密码: 验证码:
Web信息检索若干关联挖掘问题的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
信息爆炸是当今信息社会的一大特点,当前信息检索技术面临着Internet网络信息更新加快,用户要求检索结果愈加精确的严重挑战,因而如何帮助用户有效地找到所需信息成为了一个关键的问题。一方面,单纯以查询词的方式检索出包含用户所需信息的网页,在某些情况下并非最有效的方式。通过挖掘网页之间的关联关系,使得用户在已知某个网页包含他所需要的信息时,可以较容易地获得其他与该信息相关的网页;另一方面,由于Web信息检索系统的用户大多是普通用户,很难将自己复杂的检索目的转化成简单的查询词表示。同时,语言中又存在着大量的同义词、缩写词、关联词等,这种语言固有的模糊性使得同一个查询词可以代表不同的查询需求,同一个查询需求也可以有多种不同的表达方式。通过挖掘查询词之间的关联关系,将有助于帮助用户更好地构建查询词以检索到更多的有用信息。鉴于当前中文Web信息检索还远未达到理想的效果,本文对于网页之间以及中文词之间的关联关系,进行了细致的研究,论文的主要工作包括以下内容:
     1.本文以网页之间的链接关系为切入点,提出了一种新的挖掘网页之间关联关系的算法。该算法首次将网页分块算法引入到关联网页的挖掘过程中,并综合了链接锚文字的相似性和网页模板块过滤等方法,提高了关联网页的识别精度。考虑到算法在工程实际应用时所需处理的网页库规模,本文还具体给出了算法并行实现的步骤流程。
     2.因中文语言中存在着大量词汇与其缩写形式混用的情况,如何有效识别中文缩写词及其对应的同义全称词是中文信息检索中需要处理的一个重要问题。本文创新地提出了一种从网页链接的锚文字中挖掘中文缩写及全称之间对应关系的算法。它首先使用最长公共子序列算法从锚文字中获得缩写全称对的候选结果,并进一步使用支持向量机对候选结果进行过滤。实验表明本文提出的算法,能够有效地挖掘隐藏在锚文字中的中文缩写及对应的全称词,结果准确率较高。
     3.有效地挖掘中文词之间的关联关系,获得属于同一主题的中文词聚类,对于为中文Web信息检索系统提供多样性搜索结果,构建中文关联查询词等方面都是十分有意义的。本文从中文语言的标点特性入手,创新地提出了一种利用中文语句内的并列短语来挖掘中文词之间关联关系并对其进行聚类的算法。该算法利用二分图的密集子图挖掘近似算法,能够高效地对海量中文语料库中的并列短语进行聚类。为进一步对聚类结果进行改进,本文还提出了两个算法,可以有效挖掘出属于同一主题的大量中文关联词。实验表明本文提出的算法,能获得较高的聚类成功率和聚类精度,有较强的工程应用前景。
     4.如何让用户准确地构建查询词以表达其检索意图,也是信息检索技术研究的重要方向。本文提出了一种复合算法框架,可以有效地根据用户输入的查询词推荐关联的查询词。一方面根据查询词的关联度、流行度和有效性推荐查询词,帮助用户限定检索意图,以期获得更准确的搜索结果;另一方面,利用查询日志的点击信息、挖掘的中文缩写全称对、中文同主题词聚类、中文同义词对和中文语言模型,对用户输入的查询词进行合理的修改,以期获得更多满足用户检索意图的结果。实验表明,本文提出的算法框架能有效地向用户推荐关联查询词,有助于提高中文Web信息检索系统的查询效果。
In current century, information bomb becomes remarkable with a high-speed update, and users' requirements about search results continues increasing, so that how to achieve useful information from a huge mount of web information resources is one of the vital problems. On one hand, in some situations it is not most efficient to use the key words to search web pages which contain required information. Mining the association relationship among web pages can guide users to obtain more useful pages via one useful page. On the other hand, many web novices are not well in using few simple words to describe their complex search targets correctly. Due to many abbreviations, synonyms and associated words, it is easy to understand the inherent ambiguity in language. Accordingly, the same word can represent different search demands; likewise, the same searching demand can be described by different words. Therefore, it is helpful to mine the association relationship to construct the effective search words and find the resultant information. Since the quality of searching results of Chinese web information retrieval system is still not very good, this dissertation focuses on solving several association rule mining problems in web information retrieval system. The contributions are as follows:
     1. Based on the analysis of linkage relationship between web pages, a new algorithm for mining related pages is proposed in this dissertation. The HTML segmentation step is first introduced in the process of mining related pages. Combining with other technologies, such as page template filtering and anchor text similarity boosting, the precision of related pages is improved by the algorithm. In order to handle large corpus in practical engineering project, the detailed flowchart of how to implement the algorithm in parallel is also illustrated in this dissertation.
     2. Chinese abbreviations are widely used in Chinese texts for convenience or space saving. Since abbreviations and their original definitions can be substituted freely without changing article meaning, it has brought much challenge in web information retrieval. For this reason, an effective and novel approach is proposed to identify Chinese abbreviations and their definitions automatically. First, the longest common sequence algorithm is used to extract abbreviation-definition pair candidates from anchor texts. Further, a support vector machine model is trained to filter the genuine abbreviation-definition pair from candidates. Experiment results show an encouraging performance.
     3. Mining the association relationship between Chinese words and clustering them according to its topics can help web information system provide diverse searching results and generate related queries. In this dissertation, a simple but powerful algorithm to cluster Chinese words is proposed by using Chinese punctuation characteristics. The algorithm can efficiently cluster paratactic words in large Chinese corpus through the approximation of the dense sub-graph mining algorithm into bipartite graph. Two algorithms are also proposed to further improve the precision and recall of the words clusters. Many Chinese words within the same topic can be obtained from these algorithms. Experimental results indicate that the algorithm is very suitable for Chinese terms clustering and application in practical engineering.
     4. How to help users construct precise queries to describe their searching target is an important research area in web information retrieval. In this dissertation, a composite framework is proposed to suggest related queries for the original queries submitted by users. This framework suggests related queries according to several factors such as relevance, popularity and effectiveness, in order to narrow users' targets and obtain searching results with higher precision. In addition, the framework uses click information in query logs, Chinese abbreviation-definition pairs, Chinese words clusters and Chinese synonyms to modify original query without changing its meaning, which can help users get more results relevant to their searching target. Experiments show that the framework can suggest related queries for users with high efficiency. The quality of searching results of web information system may be improved by this framework.
引文
[1]中国互联网络信息中心.第23次中国互联网络发展状况统计报告,http://www.cnnic.net.cn/index/OE/00/11/index.htm.
    [2]李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统.科学出版社.2005.
    [3]Mei K,Koichi T.Information retrieval on the web.ACM Computing Surveys.2000,32(2):144-173.
    [4]梅翔.语义检索中若干关键问题的研究[PhD dissertation].北京邮电大学.2007.
    [1]Ricardo B Y,Berthier R N.Modern Information Retrieval[Book].Addison Wesley,2005.
    [2]Cho J,Carcia-Molina H.The evolution of the web and implications for an incremental crawler[C]//Proceedings of 26th international conference on very large databases.
    [3]Shivakumar N,Garcia M H.Finding near-replicas of documents on the web[C]//
    [4]李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[Book].科学出版社.2005.
    [5]袁津生,李群,蔡岳.搜索引擎原理与实践[Book].北京邮电大学出版社.2008.
    [6]黄连恩.历史网页的持续收藏及其再访问的关键技术研究[Phd dissertation].北京大学信息科学技术学院.2008.
    [7]张雷.语义搜索的模型和应用[Phd dissertation].上海交通大学.2005.
    [8]余传明.基于本体的语义信息系统研究——理论分析与系统实现[Phd dissertation].武汉大学.2005.
    [9]梅翔.语义检索中若干关键问题的研究[Phd dissertation].北京邮电大学.2007.
    [10]Jaimes A,Christel M and Gilles S,et al.Multimedia information retrieval:what is it,and why isn't anyone using it?[C]//Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval.Singapore:ACM.2005:3-8.
    [11]Sagarmay D.Multimedia Systems and Content-Based Image Retrieval[Book].Information Resources Press.2003.
    [12]Syad M R,Mahbubur R S.Multimedia Technologies[Book].Idea Group Inc(IGI).2008.
    [13]龚笔宏.基于用户反馈的个性化检索技术研究[Phd dissertation].2007.
    [14]李树青,韩忠愿.个性化搜索引擎原理与技术[Book].科学出版社.2008.
    [15]Tan P N.Introduction to Data Mining[Book].Addison-Wesley.2006.
    [16]Agrawal,R,Imielinski,T and Swami,A.Database Mining:A performance perspective[J].IEEE Transactions on Knowledge and Data Engineering.1993,5:914-925.
    [17]Agrawal,R,Imielinski,T and Swami,A.Mining association rules between sets of items in large databases[C]//Proceeding of ACM SIGMOD international conference of Management of Data.Washington:ACM.1993:207-216.
    [18]A Comparison of Commonly Used Interest Measures for Association Rules[Web page].http://michael.hahsler.net/research/association_rules/measures.html.
    [19]Jaccard,P.Distribution de la flore alpine dans le bassin des Dranses et dans quelques r(?)gions voisines[J].Bulletin del la Soci(?)t(?) Vaudoise des Sciences Naturelles 1901,(37):241-272.
    [20]C.J.van Rijsbergen.Information Retrieval[Book].London:Butterworths.1979.
    [21]Sergey B,Rajeev M and Jeffrey D U,et al.Dynamic itemset counting and implication rules for market basket data[C]//Proceedings ACM SIGMOD 1997 International Conference on Management of Data.Tucson:ACM,1997:255-264.
    [22]Piatetsky-Shapiro,G.Discovery,analysis,and presentation of strong rules[J].Knowledge Discovery in Databases.1991:229-248.
    [23]Garcia,E.An Information Retrieval Tutorial on Cosine Similarity Measures,Dot Products and Term Weight Calculations[Web page].http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html#Cosim.
    [24]Goodman,L A,Kruskal,W H.Measures of association for cross classifications.J.Amer.Statist.Ass.1954,(49):732-64.
    [25]Yao,Y Y.Information-theoretic measures for knowledge discovery and data mining,in Entropy Measures,Maximum Entropy Principle and Emerging Applications[Book].Springer,2003:pp.115-136.
    [26]Dean,J and Ghemawat J.MapReduce Simplified Data Processing on Large Clusters[J].Communications of the ACM.2004:51(1):107-113.
    [27]Hadoop[Online web page],http://lucene.apache.org/hadoop/.
    [1]Tombros,A.Ali Z.Factors affecting Web page similarity[C]//Proceedings of the 27th European Conference on IR Research(ECIR 2005).Santiago de Compostela:Springer Berlin,2005:487-501.
    [2]Loia,V,Senatore S.and Sessa M.I.Discovering related web pages through fuzzy-context reasoning[C]//Proceedings of the 2002 IEEE International Conference on Plasma Science.Honolulu:Institute of Electrical & Electronics Engineer,2002:150-155.
    [3]Fan,W B,Wang S F and Jin H,et al.Recognition of the topic-oriented Web page relations based on ontology[J].Journal of South China University of Technology(Natural Science),2004,32(SUPPL):37-41.
    [4]Dean,J,Henzinger M R.Finding related pages in the World Wide Web[J].Computer Networks,1999,31(11):1467-1479.
    [5]Kleinberg,J M.Authoritative sources in a hyperlinked environment[J].Journal of the ACM,1999.46(5):604-632.
    [6]Asano,Y,Imai H and Toyoda M,et al.Finding Neighbor Communities in the Web using an inter-site graph[J].IEICE Trans.on Information and Systems,2004,E87-D(9):2163-2170.
    [7]Tsuyoshi,M.Finding Related Web Pages Based on Connectivity Information from a Search Engine[C]//Poster of the 10th International World Wide Web Conference.Hong Kong:IEEE Computer Society. 2001: 18-19.
    
    [8] Hou, J, Zhang Y. Effectively finding relevant web pages from linkage information[J]. IEEE Trans. on Knowledge and Data Engineering, 2003, 15(4): 940-951.
    
    [9] Huang, S H S, Molina-Rodriguez C H and Quevedo-Torrero J U, et al. Exploring similarity among web pages using the hyperlink structure[C]//Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04). Las Vegas: IEEE Computer Society. 2004: 344-348.
    
    [10] Ubaldo Quevedo, J, Stephen Huang S H. Similarity among web pages based on their link structure[C]//Proceedings of the 2003 International Conference on Information and Knowledge Engineering. Las Vegas: CSREA Press. 2003: 232-238.
    
    [11] Chirita, P A, Olmedilla D and Nejdl W. Finding related pages using the link structure of the WWW[C]//Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004). Beijing: IEEE Computer Society. 2004: 632-635.
    
    [12] Brin, S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[C]//Proceedings of Seventh International world wide web conference. Brisbane: Elsevier Science.1998: 107-117.
    
    [13] Ollivier, Y, Senellart P. Finding Related Pages Using Green Measures: An Illustration with Wikipedia[C]//Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI 07).Vancouver: Pierre Senellart. 2007: 1427-1433.
    
    [14] Jeh, G, Widom J. SimRank: A measure of structural-context similarity[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Edmonton: ACM. 2002: 538-543.
    
    [15] Fogaras, D, Racz B. Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs[J]. IEEE Trans. on Knowledge and Data Engineering, 2007, 19(5): 585-598.
    
    [16] Lin, Z, King I and Lyu M R. PageSim: A novel link-based measure of web page similarity[C]//Proceedings of the 15th International Conference on World Wide Web. Hong Kong:IEEE Computer Society. 2006: 1019-1020.
    
    [17] Liben-Nowell, D, Kleinberg J. The Link Prediction Problem for Social Networks[C]//Proceedings of the 12th international conference on Information and knowledge management. New Orleans: ACM. 2003: 556-559.
    
    [18] Chakrabarti, S, Dom B and Raghavan P, et al. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text[C]//Proceedings of the 7th International Conference on World wide Web. Brisbane: IEEE Computer Society. 1998: 65-74.
    
    [19] Chakrabarti, S, Dom B and Indyk P. Enhanced Hypertext Categorization Using Hyperlinks[C]// Proceedings of 1998 ACM SIGMOD international conference on Management of data. Seattle: ACM. 1998: 307-318.
    
    [20] Haveliwala, T H, Gionis A and Klein D. Evaluating Strategies for Similarity Search on the Web[C]//Proceedings of the 11th international conference on World Wide Web. Honolulu: ACM. 2002:432-442.
    [21]Sadi,M S,Rahman M M H and Horiguchi S.A new algorithm to measure relevance among Web pages[C]//Proceedings of the 7th International Conference on Data Mining and Information Engineering.Prague:WIT Press.2006:243-251.
    [22]Zhu,X,Huang S and Yu Y.Recognizing the relations between web pages using artificial neural network[C]/Proceedings of the ACM Symposium on Applied Computing.Melbourne:ACM.2003:1217-1221.
    [23]Lehtonen,M.Similarity Browsing[Online web page].http://www.cs.helsinki.fi/u/linden/teaching/irr06/drafts/miro_lehtonen_similarity_browsing_draft.pdf.
    [24]Xin,Y,Peifeng X and Yuanchun S.Semantic HTML Page Segmentation using Type Analysis[C]//Proceedings of the 1st International Symposium on Pervasive Computing and Applications.Urumqi:IEEE Computer Society.2006:669-674.
    [25]Philippe L,H.2002.The W3C Document Object Model(DOM).http://www.w3.org/2002/07/26-dom-article.html
    [26]Cai D,Yu S and Wen J R.Extracting Content Structure for Web Pages based on Visual Representation[C]//Proceedings of the 5th Asia Pacific Web Conference(APWeb2003).Xi'an:Springer.2003:406-417.
    [27]Debnath,S,Mitra P,and Pal N,et al.Automatic identification of informative sections of Web pages[J].IEEE Trans.on Knowledge and Data Engineering,205,17(9):1233-1246.
    [28]SGML[Online web page],http://www.w3.org/MarkUp/SGML/.
    [29]Jain,Anil K.Fundamentals of Digital Image Processing[Book].New Jersey,United States of America.Prentice Hall.1989:68-73.
    [30]Lee,S H,Kim S J and Hong S H.On URL normalization[C]//Proceedings of the International Conference on Computational Science and its Applications(ICCSA 2005).Singapore:Springer Berlin.2005:1076-1085.
    [31]Pant,G,Srinivasan P and Menczer F.Crawling the Web.Web Dynamics[Book]:Adapting to Change in Content,Size,Topology and Use:153-178.
    [32]Porter Stemmer Algorithm[Online web page].http://www.tartarus.org/~martin/PorterStemmer.
    [33]ICTCLAS[Online web page],http://ictclas.org/index.html.
    [34]Dean,J and Ghemawat J.MapReduce Simplified Data Processing on Large Clusters[J].Communications of the ACM.2004:51(1):107-113.
    [35]李春葆.数据结构教程(第2版)[Book].北京:清华大学出版社.2007.
    [1]Nadeau,D,Turney P D.A supervised learning approach to acronym identification[C]//Proceedings of 8th Canadian Conference on Artificial Intelligence(AI'2005).Victoria:Springer Verlag,2005:319-329.
    [2]Xu,J,Huang Y.-L.A machine learning approach to recognizing acronyms and their expansion[C]//Proceedings of 2005 International Conference on Machine Learning and Cybernetics(ICMLC 2005).Guangzhou:Institute of Electrical and Electronics Engineers Inc.,2005:2313-2319.
    [3]Yu,Z,Tsuruoka,Y and Tsujii J.Automatic Resolution of Ambiguous Abbreviations in Biomedical Texts using Support Vector Machines and One Sense Per Discourse Hypothesis[C]//Proceedings of the 26th ACM SIGIR.Toronto:ACM,2003:57-62.
    [4]Xu,J,Huang Y.Using SVM to Extract Acronyms from Text[J].Soft Computing,2007,11(4):369-373.
    [5]Park,Y,Byrd R.J.Hybrid text mining for finding abbreviations and their definitions[C]//Proceedings of Empirical Methods in Natural Language Processing(EMNLP),2001:126-133.
    [6]Schwartz,A S,Hearst,M A.A Simple Algorithm For Identifying Abbreviation Definitions in Biomedical Text[C]//Proceedings of the Pacific Symposium on Biocomputing(PSB 2003).Hawaii:ACM,2003:451-462.
    [7]Toole,J.A Hybrid Approach to the Identification and Expansion of Abbreviations[C]//Proceedings of 6th Content-Based Multimedia Information Access(RIAO 2000).Paris:ACM,2000:725-736.
    [8]Taghva,K,Gilbreth,J.Recognizing acronyms and their definitions[J].International Journal on Document Analysis and Recognition,1999,1(4):191-198.
    [9]Larkey,L S,Ogilvie P and Price,M A,et al.Acrophile:An automated acronym extractor and server[C]//Proceedings of the ACM International Conference on Digital Libraries.San Antonio:ACM,2000:205-214.
    [10]Dannells,D.(2006).Automatic acronym recognition[C]//Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2006).Trento:The Association for Computer Linguistics,2006:167-170.
    [11]Yeates,S,Bainbridge,D and Witten,Ian H.Using compression to identify acronyms in text[C]//Proceedings of Data Compression Conference 2000(DCC 2000).Snowbird:Institute of Electrical and Electronics Engineers Inc., 2000: 582-592.
    
    [12] Yeates, S. Automatic Extraction of Acronyms from Text[C]//Proceedings of the 3rd New Zealand computer science research students' conference. Hamilton, 1999: 117-124.
    
    [13] Sun, X, Wang H. F and Zhang, Y. Chinese abbreviation-definition identification: A SVM approach using context information[C]//Proceedings of 9th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2006). Guilin: Springer Verlag, 2006: 495-504.
    
    [14] Fu, G H, Luke, K K and Zhang, M, et al. A hybrid approach to Chinese abbreviation expansion[C]//Proceedings of 21 st International Conference of Computer Processing of Oriental Languages (ICCPOL 2006). Singapore: Springer Verlag, 2006: 277-287.
    
    [15] Chang, J S, Lai Y T. A Preliminary Study on Probabilistic Models for Chinese Abbreviations[C]//Proceedings of the Third SIGHAN Workshop on Chinese Language Learning.Barcelona: ACM, 2004: 9-16.
    
    [16] Chang, J S, Teng W.-L. Mining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery[C]//Proceedings of 5th SIGHAN Workshop on Chinese Language Processing. Sydney: ACM, 2006: 17-24.
    
    [17] Chang, J S, Teng W.-L. Mining Atomic Chinese Abbreviation Pairs with a Probabilistic Single Character Word Recovery Model[J]. Language Resources and Evaluation. 2007, 40(3-4):367-374.
    
    [18] Huang, C M, Yang C P. Chinese Abbreviations and Expansion[C]// Proceedings of National Computer Symposium 2005. Taiwan: 2005: 1-11.
    
    [19] Sun, X, Wang, H F. Chinese abbreviation identification using abbreviation-template features and context information[C]//Proceedings of 21st International Conference of Computer Processing of Oriental Languages (ICCPOL 2006). Singapore: Springer Verlag, 2006: 191-198.
    
    [20] Fu, G H, Luke, K K and Zhou G D, et al. & Xu, R F. Automatic expansion of abbreviations in Chinese news text[C]//Proceedings of Asia Information Retrieval Symposium 2006 (AIRS 2006).Singapore: Springer Verlag, 2006: 530-536.
    
    [21] Lee, S H, Kim, S J and Seok, H. On URL normalization. In Singapore, Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005). 2005:1076-1085.
    
    [22] Pant, G, Srinivasan P, Menczer, F. Crawling the Web. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. 2004: 153-178.
    
    [23] Turney, P D. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL[C]// Proceedings of 12th International Conference on Machine Learning (ECML-01). Freiburg: ACM, 2001:491-502.
    
    [24] Higgins, D. Which statistics reflect semantics? Rethinking synonymy and word similarity[C]// Proceedings of International Conference in Linguistic Evidence. Tubingen, 2004:61-65.
    [25]Dean,J,Ghemawat J.MapReduce Simplified Data Processing on Large Clusters[C]//.Proceedings of 6th Symposium on Operating System Design and Implementation(OSDI04).San Francisco:ACM,2004:137-149.
    [26]Hadoop[Online web page],http://lucene.apache.org/hadoop/.
    [27]Weka[Online web page],http://www.cs.waikato.ac.nz/ml/weka/.
    [28]LIBSVM[Online web page],http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
    [29]Terada,A,Tokunaga,T and Tanaka,H.Automatic expansion of abbreviations by using context and character information[J].Information Processing and Management.2004,40(1):31-45.
    [1]Christiane,F.A semantic network of English:the mother of all WordNets[Book].EuroWordNet:a multilingual database with lexical semantic networks.Kluwer Academic Publishers.1998:137-148.
    [2]Landauer,T K,Foltz P W and Laham D.An Introduction to Latent Semantic Analysis[J].Discourse Processes,1998,25:259-284.
    [3]Courseault Trumbach,C,Payne D."Identifying synonymous concepts in preparation for technology mining[J].Journal of Information Science,2007,33(6):660-677.
    [4]Turney,P D.Mining the Web for synonyms:PMI-IR versus LSA on TOEFL[C]//Proceedings of 12th International Conference on Machine Learning(ECML-01).Freiburg:ACM,2001:491-502.
    [5]Higgins,D.Which statistics reflect semantics? Rethinking synonymy and word similarity[C]//Proceedings of International Conference in Linguistic Evidence.T(u|¨)bingen,2004:61-65.
    [6]梅家驹.《同义词词林》[Book].上海:上海辞书出版社,1983.
    [7]董振东.语义关系的表达和知识系统的建造[J].语言文字应用,1998,(3):76-82.
    [8]胡和平,曾庆锐,路松峰.中文词聚类研究[J].计算机工程与科学,2006.28(1):122-124.
    [9]孙静,朱杰,徐向华.一种新的中文词自动聚类算法[J].上海交通大学学报,2003.37(S2):139-142.
    [10]Hogg,R,McKean J and Craig A.Introduction to Mathematical Statistics[book].Upper Saddle River,NJ:Pearson Prentice Hall.2005:359-364.
    [11]Wu Fazhou,Su Hao and Zhou Ming,et al.Use web to extend synonym for new Chinese words[C].//NDBC 2006.Guang Zhou:CCF DBTC,2006.
    [12]Clauset,A.,Newman M E J,Moore C.Finding community structure in very large networks[J].Physical Review E-Statistical,Nonlinear,and Soft Matter Physics,2004,70(62):066-111.
    [13]Flake,G.W,Lawrence S.Efficient identification of web communities[C]//Proceedings of the 6th SIGKDD.Boston:ACM,2000:150-160.
    [14]Newman,M E J,Girvan M.Finding and evaluating community structure in networks[J].Physical Review E.2004,69(22):026-113.
    [15]Du,N,Wang B,Wu B.Community detection in complex networks[J].Journal of Computer Science and Technology.2008,23(4):672-683.
    [16]Du,N,Wu,B and Pei X,et al.Community detection in large-scale social networks[C]//Proceedings of the 9th WebKDD and First SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis.San Jose:Association for Computing Machinery.2007:16-25.
    [17]Gibson,D,Kumar R,and Tomkins A.Discovering large dense subgraphs in massive graphs[C]//Proceedings of the 31st international conference on Very large data bases.New York:ACM,2005:721-732.
    [18]Broder,AZ,Steven,C G.and Mark,S M,et al.Syntactic clustering of the Web[J].Computer Networks and ISDN Systems archive,1997,29(8):1157-1166.
    [19]Broder,AZ,Charikar,M and Frieze,A.M,et al.Min-wise independent permutations[C]//Proceedings of the Annual ACM Symposium on Theory of Computing.New York:ACM,1998:327-336.
    [20]Indyk,P,Motvani R.Approximate nearest neighbors:towards removing the curse of dimensionality[C]//Proceedings of STOC'98.Dollas:ACM,1998:604-613.
    [21]Carri(?)re,J,Kazman R.WebQuery:Searching and Visualizing the Web Through Connectivity[C]//Proceedings of the 6th International WWW conference.Santa Clara:IEEE Computer Society,1997:1257-1267.
    [22]Dean,J.and J.Ghemawat.MapReduce Simplified Data Processing on Large Clusters[C]//Proceedings of the 6th Symposium on Operating Systems Design and Implementation.San Francisco:ACM,2004:10-20.
    [23]Aggarwal,Datar G.and Rajagopalan M,et al.On the streaming model augmented with a sorting primitive[C]//proceedings of the Annual IEEE Symposium on Foundations of Computer Science.Los Alamitos:IEEE,2004:540-549.
    [24]李春葆.数据结构教程(第2版)[Book].北京:清华大学出版社.2007.
    [1]Beeferman,D,Berger,A.Agglomerative clustering of a search engine query log[C]//Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston:Association for Computing Machinery.2000:407-416.
    [2]Furnas,G W,Landauer,T K and Gomez,L M,et al.Furnas,G.W.,et al.,Vocabulary problem in human-system communication[J].Communications of the ACM.1987,30(11):964-971.
    [3]Wen,J R,Nie,J Y and Zhang,H J.Clustering user queries of a search engine[C]//Proceedings of the 10th international conference on World Wide Web.Hong Kong:ACM.2001:162-168.
    [4]Anick,PG,Tipirneni,S.The paraphrase search assistant:terminology feedback for iterative information seeking[C]//Proceedings of 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval(ACM SIGIR'99).Berkeley:ACM.1999:153-159.
    [5]Sparck Jones K,Staveley,M S.Phrasier:a system for interactive document retrieval using keyphrases[C]//Proceedings of 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval(ACM SIGIR'99).Berkeley:ACM.1999:153-159.
    [6]Xu,J,Croft,W B.Query expansion using local and global document analysis[C]//Proceedings of 19th International ACM SIGIR Conference on Research and Development in Information Retrieval(ACM SIGIR'96).Zurich:ACM.1996:4-11.
    [7]Dupret,G,Mendoza,M.Recommending better queries from click-through data[C]//Proceeding of the 12th international conference on string processing and information retrieval(SPIRE 2005).Buenos Aires:Springer Verlag.2005:41-44.
    [8]Fonseca,B M,Golgher,P B and de Moura,E S,et al.Using association rules to discover search engines related queries[C]//Proceeding of the 1st Latin American Web Congress.2003:66-71.
    [9]Tan P N.Introduction to Data Mining[Book].Addison-Wesley.2006.
    [10]Fonseca,B M,Golgher,P and Possas,B,et al.Concept-based interactive query expansion[C]//Proceeding of International Conference on Information and Knowledge Management.Bremen:Association for Computing Machinery.2005:696-703.
    [11]Shi,X D,Yang,C C.Mining related queries from web search engine query logs using an improved association rule mining model[J].Journal of the American Society for Information Science and Technology.2007,58(12):1871-1883.
    [12]Shi,X D,Yang,C C.Mining related queries from search engine query logs[C]//Proceeding of the 15th International Conference on World Wide Web.Edinburgh:ACM.2006:943-944.
    [13]Dan Gusfield.Algorithms on strings,trees,and sequences:computer science and computational biology[Book].Cambridge University Press.1997.
    [14]Huang,C K,Chien L F and Oyang Y J.Query-Session-Based Term Suggestion for Interactive Web Search.[C]//Proceedings of the 10th International Conference on World Wide Web.Hong Kong:ACM.2001:213-214.
    [15]Huang,C K,Chien L F and Oyang Y J.Relevant term suggestion in interactive web search based on contextual information in query session logs[J].Journal of the American Society for Information Science and Technology,2003,54(7):638-649.
    [16]Baeza-Yates,R,Hurtado C and Mendoza M.Query Recommendation Using Query Logs in Search Engines[C]//Proceeding of the Current Trends in Database Technology(EDBT 2004Workshops).2004:588-596.
    [17]Wen,J R,Nie,J Y and Zhang,H J.Query clustering using user logs[J].ACM Transactions on Information Systems.2002,20(1):59-81.
    [18]Liu,Z Y,Sun,M S.Asymmetrical query recommendation method based on bipartite network resource allocation[C]//Proceeding of the 17th International Conference on World Wide Web 2008.Beijing:Association for Computing Machinery.2008:1049-1050.
    [19]Rosie J,Benjamin,R and OmidJones,Mi,et al.Generating query substitutions[C]//Proceeding of the 15th International Conference on World Wide Web.Edinburgh:Association for Computing Machinery.2006:387-396.
    [20]Wang,X H,Zhai,C X.Mining term association patterns from search logs for effective query reformulation[C]//Proceeding of the 17th ACM conference on Information and knowledge management.Napa Valley:ACM.2008:479-488.
    [21]J(a|¨)rvelin K,Kek(a|¨)l(a|¨)inen J.IR evaluation methods for retrieving highly relevant documents[C]//Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval.Athens:ACM,2000:41:48.
    [22]Jeffrey F.Mastering Regular Expressions(Third Edition)[Book].O'Reilly Media.2006.
    [23]Wang J H,Teng J W and Cheng P J,et al.Translating Unknown cross-lingual queries in digitial libraries using a web-based approach[c]//Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries(JCDL 04).Tucson:ACM.2004:108-116.
    [24]Silva,J F,Dias,G and Guillore,S,et al.Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units[Book].Lecture Notes in Artificial Intelligence.Springer-Verlag.1999:113-132.
    [25]李春葆.数据结构教程(第2版)[Book].北京:清华大学出版社.2007.
    [26]宗成庆.统计自然语言处理[Book].清华大学出版社.2008.
    [27]Chen,S.F,Goodman,J.An empirical study of smoothing techniques for language modeling.Technical Report TR-10-98,Center for Research in Computing Technology(Harvard University),1998.
    [28]Frankie J.Modified Kneser-Ney Smoothing of n-gram Models.Technical Report.RIACS,2000
    [1]Kleinberg,J M.Authoritative sources in a hyperlinked environment[J].Journal of the ACM,1999.46(5):604-632.
    [2]Brin,S,Page L.The Anatomy of a Large-Scale Hypertextual Web Search Engine[C]//Proceedings of Seventh International world wide web conference.Brisbane:Elsevier Science.1998:107-117.
    [3]Beeferman,D,Berger,A.Agglomerative clustering of a search engine query log[C]//Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston:Association for Computing Machinery.2000:407-416.
    [4]Liu,Z Y,Sun,M S.Asymmetrical query recommendation method based on bipartite network resource allocation[C]//Proceeding of the 17th International Conference on World Wide Web 2008.Beijing:Association for Computing Machinery.2008:1049-1050.
    [5]王小平,曹立明.遗传算法——理论、应用与软件实现[Book].西安:西安交通大学出版社,2002.
    [6]Kirkpatrick,S,Gelatt C D and Vecchi M P.Optimization by Simulated Annealing[J].Science.220(4598):671-680.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700