视觉语言分析：从底层视觉特征表达到语义距离学习

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

视觉语言分析：从底层视觉特征表达到语义距离学习

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Visual Language Analysis: From Low Level Feature Representation to Semantic Distance Learning
作者：吴磊
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：视觉语言模型 ; 视觉分析 ; 概念表达 ; 语义建模 ; 距离度量图像标注 ; 标签推荐
英文关键词：Visual language model ; visual analysis ; concept representation ; semantic modeling ; distance metric learning ; image annotation ; tag recommendation
学位年度：2010
导师：俞能海 ; 李明镜
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2010-01-01

摘要

随着互联网的发展,网络图像资源与日剧增,伴随而来的是围绕着网络图像为对象的众多研究课题,比如图像标注、图像检索、图像搜索结果聚类、重复图像检测、图像标签推荐、图像索引、图像分类、物体检测等。这些相关研究都需要从根本上解决一个问题,即视觉语义的表达和度量。因此该问题成为了网络图像研究的一个基本和核心的问题,也是目前学术界和工业界的重要研究方向。
     目前来看,视觉语义的表达和度量主要包含四个基本问题,即图像表达、图像相似度量、概念表达和概念相关度量。图像表达是指图像的特征以及特征的组织形式。图像特征可以生成有一定分辨能力的视觉单词、视觉短语。图像的表达形式也是多种多样,比如有忽略特征间关系的,有考虑在一幅图像中空间关系的或者共发频率的,等等。图像的距离度量是在特定图像表达之上通过机器学习的方法得到的。选择不同的图像特征和图像表达,图像的距离度量或许不一样。同样,概念表达是指概念的特征及其组织形式。概念的特征是以众多包含某个概念的图像集合上提取的视觉特征为基础建立的模型,它的通常形式是某个视觉单词或者短语出现的频率、共发频率、条件分布、或者转移概率。目前有很多概念建模的方法,比如二维隐马尔可夫模型、条件随机场模型等。本文的后一部分章节将详细讨论我们提出的视觉语言模型,保语义单词包模型,并和其他各种模型进行比较。概念距离度量是建立在特定的概念表达模型或者文本语义关联之上的概念之间的相关性度量。目前常用的度量方法包括词网距离、谷歌距离、以及我们提出的Flickr距离。
     本文提出了解决视觉语义的表达和度量的一系列模型和方法,既有底层特征的创新,也有高层语义模型和距离度量方面的创新。发表的工作涉及了以上提及的四个方面挑战,为视觉语义的表达和度量相关研究提供了有意义的探索。具体来说本文的成果和创新之处包括以下几点：
     1.本文提出了视觉语言模型,减小了视觉领域和文本领域的语义分析的鸿沟。我们认为图像的局部特征和文本中的单词一样是满足一定的语法顺序的。利用计算这些局部特征在空间位置上的条件分布来表达这种语法顺序,就可以近似的估测图像中的视觉语义。因为该模型和文本分析中的自然语言模型形式上类似,因此很多文本分析中的方法可以很容易在该模型中推广。实验结果表明该模型效果和很多复杂的模型相近,但运算速度远远超过其它模型,可以很好的应用在大规模数据上。
     2.本文提出了保语义单词包模型来处理语义鸿沟问题。我们提出了一种语义鸿沟度量方法,并通过选择从视觉特征到视觉单词的映射空间来最小化语义鸿沟,从而使得我们产生的词典可以有更好的分辨能力。实验也证明了利用最小化语义鸿沟方法生成的词典在图像标注问题中效果明显优于其他方法。
     3.本文提出了概率相关成分分析方法用来改进图像相似度量。概率相关成分分析将图像之间的边信息表达为概率的形式而不是传统的非0即1的二进制表达,提高了图像距离学习的精度。网络图像标注的应用显示出该方法比传统的距离学习方法更加高效和准确。
     4.本文提出了基于视觉特征的概念相关性度量方法：Flickr距离。该距离可以用来度量两个概念的不相关度。我们认为相关的概念同时出现在同一幅图像中概率比较大。因此计算和两个概念分别相关的图像的视觉语言模型的差别,就可以有效地度量概念之间的不相关度。和其它基于文本的概念距离度量方法不同,Flickr距离应用了概念相关的图片信息,从视觉角度度量了概念的相关性。在多媒体相关的应用问题中可以显著地提高性能。和人工建立的词网距离比较,Flickr距离可以自动更新以覆盖更多的新概念,和传统的谷歌距离比较,Flickr距离利用了视觉信息,实验证明其更加符合人类的认知。
     5.本文将传统的线性空间距离学习推广到了非线性空间距离,提出了Bregman距离函数学习方法。传统的Mahalanobis距离学习是需要学习一个距离矩阵。该距离度量在整个空间中是一致的。而样本在空间的分布疏密可能是有差别的。利用Bregma距离学习可能得到一个和样本相关的度量,考虑了局部分布的特性,因此可能更加准确。实验表明该方法可以比其他方法更好地处理高维空间的距离学习问题。
     6.本文将传统的静态距离推广到了动态距离,提出了QOSS子空间选取方法。我们认为观测角度(度量空间)不同,对样本的距离度量会产生很大的影响。因此在判断两个样本是否相近的时候,在多个子空间中度量比在单个子空间度量要准确。我们提出了根据样本特性,自动选取多个子空间的策略对样本相似性进行度量。在网络图像近似重复检测中,我们发现经过不超过5轮迭代,检测精度可以显著提高。
With the development of Internet, there is proliferation of Web image resources. Lots of research problems come along with Web images, such as image annotation, image retrieval, search result clustering, near-duplicate detection, image tag recom-mendation, image indexing, image classification, object detection, etc. All these search topics have to deal with one intrinsic and fundamental problem, that is visual seman-tic representation and measurements. Thus this problem has become a hot research direction for both academy and industry.
     Currently, visual semantic representation and measurements can be divided into four elements, including image representation, image similarity measurement, concept representation and concept correlation measurement. Image representation refers to im-age feature and the arrangement of features. The feature types could be various and so is the arrangements. Image similarity measurement is based on the image representa-tion, and could be generated by machine learning technology. Similarity measurement could be quite different by choosing different image features and the representations. Concept representation refers to the features of concept and their arrangement. The features of concept can be generated from a collection of images related to the concept. Currently there are lots of successful concept modeling methods, such as 2D hidden Markov model, conditional random fields, etc. Comparing to these complex models, we propose the visual language model, which is simple and effective. We also propose the semantic preserving bag of words model to help solve the semantic gap problem. The correlation between concepts is based on the concept models. As far as we know, there are several concept distance measurements, such as WordNet distance, google distance, etc. These distance are based on human labor or text information, while our proposed Flickr distance measures the concept distance based on the visual informa-tion.
     This thesis proposes a series of models and methods to solve visual semantic rep-resentation and measurements. The contributions are not only on low level feature and representation, but also on high level model and measurements. The published papers cover all four elements of the research problem. It provides exploring report on the vi-sual semantic representation and measurements. The concrete contributions are listed as follows:
     1. We propose the visual language model (VLM) to bridge the gap be-tween text analysis and the visual analysis. We believe that the local visual features follow certain grammar, which is similar the the words in text documents. By ana-lyzing the local features, we can estimate the semantic in an image. Since this model is similar to the language model in text analysis, lots of similar techniques can be ex-tended. Experimental results show that the model is effective and much faster than other complex models.
     2. We propose the semantic preserving bag of words model to handle the semantic gap problem. We propose a novel measurement of the semantic gap, and try to find a best mapping space to translate the visual features to visual words that minimizing the semantic gap. In this way, we can better learn a dictionary with better discrim-ination. Experiments show that the optimal dictionary can significantly improve the performance of image annotation.
     3. We propose probabilistic relevance component analysis method (pRCA) to help improve image similarity measurement. pRCA represents the side information between images in a probabilistic form rather than the binary form, to help improve distance learning. Experiments on Web image annotation show that the method is much better in accuracy and efficiency than other distance metric learning methods.
     4. We propose a visual information based concept distance measurement, called Flickr distance. We believe correlated concepts have better chance to appear in the same image. Thus we can effectively measure the distance between concepts by the difference between their visual language models. Different from other text based con-cept distance measurements, Flickr distance adopts the visual information related to the concept. It can be effective for multimedia related tasks and is more consistent to human cognition.
     5. We extend the traditional linear distance metric learning to the non-linear dis-tance function learning by proposing Bregman distance function learning. The tra-ditional Mahalanobis distance aims to learn a distance matrix, which is static for the whole sample space. Since the distribution of samples is quite different, it makes sense to consider local information of the sample distribution by adopting the non-learning distance function. Experiments show that the proposed Bregman distance can better handle distance learning problems in high dimensional space.
     6. We extend the statistical distance measurement to dynamic distance measure-ment by proposing the QOSS subspace shifting method. We believe distance can be quite different in different metric space. In order to judge whether the two samples are similar or not, it is better to measure their distance in multiple spaces rather than simple space. The proposed method can automatically choose subspace for distance measure-ments. Experiments on Web image near duplicate detection show that our method can converge in less than 5 iterations and the detection precision can be significantly im-proved.

引文

[1]A. Agarwal and B. Triggs. Hyperfeatures-multilevel local coding for visual recognition. Proc. of ECCV'06,2006.
    [2]Wasfi Al-Khatib, Y. Francis Day, Arif Ghafoor, and P. Bruce Berra. Semantic modeling and knowledge representation in multimedia databases. IEEE Trans, on Knowl. and Data Eng., 11(1):64-80,1999.
    [3]Aditya Vailaya Anil, Anil Jain, and Hong Jiang Zhang. On image classification:City images vs. landscapes. Pattern Recognition,31:1921-1935,1998.
    [4]A.Opelt, A. Pinz, and A. Zisserman. A boundary-fragment-model for object detection. Proc. of ECCV'06,2006.
    [5]E.Y.Chang A.Qamra, Y.Meng. Enhanced perceptual distance functions and indexing for image replica recognition. Transactions on Pattern Analysis and Machine Intelligence, 27(3):379-391, March 2005.
    [6]Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. In Journal of Machine Learning Research, pages 234-245,2004.
    [7]Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning a maha-lanobis metric from equivalence constraints. JMLR,6:937-965,2005.
    [8]James C. Bezdek and Richard J. Hathaway. Convergence of alternating optimization. Neural, Parallel Sci. Comput.,11(4):351-368,2003.
    [9]J. Bi, Y. Chen, and J. Wang. A sparse support vector machine approach to region-based image categorization. Proc. ofCVPR'05, pages 1121-1128,2005.
    [10]Jinbo Bi, Yixin Chen, and James Z. Wang. A sparse support vector machine approach to region-based image categorization. In CVPR'05:Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)-Volume 1, pages 1121-1128, Washington, DC, USA,2005. IEEE Computer Society.
    [11]D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learn-ing Research,3(5):993-1022,2003.
    [12]Susanne Boll, Philipp Sandhaus, Ansgar Scherp, and Utz Westermann. Semantics, content, and structure of many for the creation of personal photo albums. In Proceedings of ACM Multimedia'07,2007.
    [13]Steve Borgatti. Netdraw. http://www.analytictech.com/Netdraw/netdraw.htm,2008.
    [14]L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmenta-tion and classification. Proc. of ICCV'07,2007.
    [15]Liangliang Cao and Li Fei-Fei. Spatially coherent latent topic model for concurrent segmen-tation and classification of objects and scenes. In IEEE 11th International Conference on Computer Vision, pages 1-8,2007.
    [16]G Carneiro, A B Chan, P Moreno, and N Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Tran. PAMI, pages 394-410,2006.
    [17]Gustavo Carneiro and Nuno Vasconcelos. Formulating semantic image annotation as a su-pervised learning problem. In IEEE CVPR, pages 163-168,2005.
    [18]Shih-Fu Chang, Dan Ellis, Wei Jiang, Keansub Lee, Akira Yanagawa, Alexander C. Loui, and Jiebo Luo. Large-scale multimodal semantic concept detection for consumer video. In Proc. of the international workshop on Workshop on multimedia information retrieval,2007.
    [19]Yixin Chen and James Z. Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research,5:913-939,2004.
    [20]Rudi Cilibrasi and Paul M. B. Vitanyi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering,19:370,2007.
    [21]Philip Clarkson and Ronald Rosenfeld. Statistical language modeling using the CMU-cambridge toolkit, pages 2707-2710,1997.
    [22]Chris Dance, Jutta Willamowski, Lixin Fan, Cedric Bray, and Gabriela Csurka. Visual cate-gorization with bags of keypoints. Proc. of ECCV'04,2004.
    [23]Ritendra Datta, Joshi Dhiraj, Li Jia, and James Z. Wang. Image retrieval:Ideas, influences, and trends of the new age. ACM Computing Surveys,2008.
    [24]Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In ICML, pages 209-216,2007.
    [25]P. Duygulu, Kobus Barnard, J.F.G. de Freitas, and David A. Forsyth. Object recognition as machine translation:Learning a lexicon for a fixed image vocabulary. In ECCV, pages 97-112,2002.
    [26]Pinar Duygulu, Kobus Barnard, Nando de Freitas, P. Duygulu, K. Barnard, and David Forsyth. Object recognition as machine translation:Learning a lexicon for a fixed image vocabulary,2002.
    [27]Jianping Fan, Yuli Gao, and Hangzai Luo. Multi-level annotation of natural scenes using dominant image components and semantic concepts. In ACM Multimedia, pages 540-547, 2004.
    [28]Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples:An incremental bayesian approach tested on 101 object categories. Proc. ofCVPR'04,12:178,2004.
    [29]Li Fei-Fei and Pietro Perona. A bayesian hierarchical model for learning natural scene cate- gories. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition,2005.
    [30]R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. Proc. of ICCV'05,2:1816-1823 Vol.2,2005.
    [31]Yoav Freund, Raj Iyer, Robert E. Schapire, Yoram Singer, and G. Dietterich. An efficient boosting algorithm for combining preferences. In Journal of Machine Learning Research, pages 170-178,2003.
    [32]Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Elsevier,1990.
    [33]Amir Globerson and Sam Roweis. Metric learning by collapsing classes. In NIPS'05,2005.
    [34]Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighborhood component analysis. In NIPS,2004.
    [35]Monika M. Gorkani and Rosalind W. Picard. Texture orientation for sorting photos "at a glance". In TR-292, M.I.T., Media Labortory, Perceptual Computing Section, pages 459-464,1994.
    [36]S. Ketchpel H. Garcia-Molina and N. Shivakumar. Safeguarding and charging for information on the Internet. In Proc. of International Conference on Data Engineering,1998.
    [37]J. Hayes and A. Efros. Scene completion using millions of photographs. In SIGGRAPH, pages 835-846,2007.
    [38]Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online con-vex optimization. Mach. Learn.,69(2-3):169-192,2007.
    [39]Xuming He and Richard S. Zemel. Learning hybrid models for image annotation with par-tially labeled data. In NIPS, pages 625-632,2008.
    [40]V. Hedau, H. Arora, and N. Ahuja. Matching images under unstable segmentations. In IEEE CVPR, pages 1-8,2008.
    [41]Thomas Hofmann. Probabilistic Latent Semantic Indexing. In Proc. of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, pages 50-57, Berkeley, California, August 1999.
    [42]Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and Wei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In Proc. CVPR2006, New York, US, June 17-222006.
    [43]T. S. Huang, C. K. Dagli, S. Rajaram, E. Y. Chang, M. I. Mandel, G. E. Poliner, and D. P. W. Ellis. Active learning for interactive multimedia retrieval. In Proc. of the IEEE,2008.
    [44]Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors:towards removing the curse of dimensionality. In STOC'98:Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, New York, NY, USA,1998. ACM Press.
    [45]G. Hinton J. Goldberger, S. Roweis and R. Salakhutdinov. Neighbourhood components anal-
    ysis. In NIPS17,2005.
    [46]J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In SIGIR'03:Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 119-126, New York, NY, USA,2003. ACM.
    [47]J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In SIGIR'03:Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 119-126, New York, NY, USA,2003. ACM.
    [48]J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In SIGIR'03, pages 119-126, Toronto, Canada,2003.
    [49]Jiwoon Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In In Proc. CIVR, pages 24-32,2004.
    [50]Rong Jin, Joyce Y. Chai, and Luo Si. Effective automatic image annotation via a coherent language model and active learning. In MULTIMEDIA'04:Proceedings of the 12th annual ACM international conference on Multimedia, pages 892-899, New York, NY, USA,2004. ACM.
    [51]S. Katz. Estimation of probabilities from sparse data for the language model component of a .speech recognizer. IEEE Transaction on Acoustics Speech and Signal Processing 35(3):400 401.,1997.
    [52]John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields:Proba-bilistic models for segmenting and labeling sequence data,2001.
    [53]Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Beyond sliding win-dows:object localization by efficient subwindow search. In CVPR,2008.
    [54]V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures.
    [55]V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In in NIPS. MIT Press,2003.
    [56]S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories, volume 2, pages 2169-2178,2006.
    [57]Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. A maximum entropy framework for part-based texture and object recognition. Proc. ofICCV'05, pages 832-838,2005.
    [58]Douglas B. Lenat. CYC:A large-scale investment in knowledge infrastructure. Communi-cations of the ACM,38(11):33-38,1995.
    [59]Liza Leslie, Tat-Seng Chua, and Jain Ramesh. Annotation of paintings with high-level se-mantic concepts using transductive inference and ontology-based concept disambiguation.
    In Proc. of ACM Multimedia'07,2007.
    [60]Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. Pfp:Parallel fp-growth for query recommendation. In ACM Recommendation Systems, Lausanne,,2008.
    [61]J. Li and J. Wang. Automatic linguistic indexing of pictures by a statistical modeling ap-proach. IEEE Transaction on Pattern Analysis and Machine Intelligence,25(10),2003., 2003.
    [62]Jia Li, Amir Najmi, and Robert M. Gray. Image classification by a two dimensional hidden markov model. IEEE Trans. Signal Processing,48:517-533,1998.
    [63]Jianguo Li, Weixin Wu, Tao Wang, and Yimin Zhang. One step beyond histograms:Image representation using markov stationary features. In Computer Vision and Pattern Recogni-tion,2008. CVPR 2008. IEEE Conference on, pages 1-8,2008.
    [64]S. Z. Li. New York:Springer-Verlag,2001.
    [65]Wei Li and Maosong Sun. Semi-supervised learning for image annotation based on condi-tional random fields. In CIVR, pages 463-472,2006.
    [66]Xirong Li, Cees G.M. Snoek, and Marcel Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proceedings of MIR'08,2008.
    [67]Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang. Tag ranking. In Proc. of World Wide Web 2009 (WWW'09),2009.
    [68]Huiying Liu, Shuqiang Jiang, Qingming Huang, Changsheng Xu, and Wen Gao. Region-based visual attention analysis with its application in image browsing on small displays. In Proc. of ACM Multimedia'07,2007.
    [69]Jing Liu, Bin Wang, Mingjing Li, Zhiwei Li, Weiying Ma, Hanqing Lu, and Songde Ma. Dual cross-media relevance model for image annotation, In MULTIMEDIA'07:Proceedings of the 15th international conference on Multimedia, pages 605-614, New York, NY, USA, 2007. ACM.
    [70]D. Lowe. Distinctive image features from scale-invariant keypoints. In International Journal of Computer Vision, volume 20, pages 91-110,2003.
    [71]David G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150-1157,1999.
    [72]David G. Lowe. Object recognition from local scale-invariant features. Proc. of ICCV'99, pages 1150-1157,1999.
    [73]David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,60:91-110, 2004.
    [74]P C Mahalanobis. On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India, pages 49-55,1936.
    [75]Rapha?1 Maree, Pierre Geurts, Justus Piater, and Louis Wehenkel. Random subwindows for robust image classification. In In CVPR, pages 34-40. IEEE,2005.
    [76]Raphael Maree, Pierre Geurts, Justus Piater, and Louis Wehenkel. Random subwindows for robust image classification. Proc. of CVPR'05,1:34-40, June 2005.
    [77]Raphael Maree, Pierre Geurts, Justus Piater, and Louis Wehenkel. Random subwindows for robust image classification. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR'05), pages 34-40, Washington, DC, USA,2005.
    [78]Oded Maron and Tomas Lozano-Perez. A framework for multiple-instance learning. In NIPS'97:Proceedings of the 1997 conference on Advances in neural information processing systems 10, pages 570-576, Cambridge, MA, USA,1998. MIT Press.
    [79]Retrieval Donald Metzler, Donald Metzler, and R. Manmatha. An inference network ap-proach to image. In In Proceedings of the International Conference on Image and Video Retrieval, pages 42-50. Springer,2004.
    [80]K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. Proc. of European Conference Computer Vision, ECCV'02, pages 128-142,2002.
    [81]George A. Miller and et.al. Wordnet, a lexical database for the english language. Cognition Science Lab, Princeton University,1995.
    [82]Apostol (Paul) Natsev, Alexander Haubold, Jelena Tesic, Lexing Xie, and Rong Yan. Se-mantic concept-based query expansion and re-ranking for multimedia retrieval. In Proc. of ACM Multimedia'07,2007.
    [83]Aude Oliva and Antonio Torralba. Modeling the shape of the scene:A holistic representation of the spatial envelope. International Journal of Computer Vision,42:145-175,2001.
    [84]H. Otluman and T. Aboulnasr. Low complexity 2-d hidden markov model for face recog-nition. Proc. of IEEE Conference on International Symposium on Computer Architecture, 2000.
    [85]Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In SIGIR'98:Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275-281, New York, NY, USA,1998. ACM.
    [86]Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Tao Mei, and Hong-Jiang Zhang. Correlative multi-label video annotation. In Proc. of ACM Multimedia'07,2007.
    [87]P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. Van Gool. Mod-eling scenes with local descriptors and latent aspects. Proc. of ICCV'05, pages 883-890, 2005.
    [88]P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. Van Gool. Mod-eling scenes with local descriptors and latent aspects. In ICCV'05:Proceedings of the Tenth IEEE International Conference on Computer-Vision (ICCV'05) Volume 1, pages 883-890, Washington, DC, USA,2005. IEEE Computer Society.
    [89]Bryan C. Russell, William T. Freeman, Alexei A. Efros, Josef Sivic, and Andrew Zisserman. Using multiple segmentations to discover objects and their extent in image collections. Proc. of IEEE Conference on Computer Vision and Pattern Recognition, (CVPR'06), pages 1605-1614,2006.
    [90]Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Labelme:A database and web-based tool for image annotation. Int. J. Comput. Vision,77(1-3):157-173, 2008.
    [91]S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlatons. Proc. of CVPR'06, pages 2033-2040,2006.
    [92]Shai Shalev-Shwartz and Yoram Singer. Logarithmic regret algorithms for strongly convex repeated games (technical report). The Hebrew University.,2007.
    [93]Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:Primal estimated sub-gradient solver for svm. In ICML'07:Proceedings of the 24th international conference on Machine learning, pages 807-814, New York, NY, USA,2007. ACM.
    [94]J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on, pages 1-8,2008.
    [95]Luo Si, Rong Jin, Steven C. H. Hoi, and Michael R. Lyu. Collaborative image retrieval via regularized metric learning. ACM Multimedia Systems Journal,12(1):34-44,2006.
    [96]Borkur Sigurbjdrnsson and Roelof van Zwol. Flickr tag recommendation based on collective knowledge. In WWW'08,2008.
    [97]J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering object categories in image collections. In Technical Report A. I. Memo 2005-005, MIT,,2005.
    [98]J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering object categories in image collections. Proc. of ICCV'05,2005.
    [99]Josef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman, and William T. Freeman. Discovering objects and their localization in images. Proc. of ICCV'05, pages 370-377, 2005.
    [100]A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. PAMI,22(12):1349-1380,2000.
    [101]N. Snavely, S. Seitz, and R. Szeliski. Photo tourism:Exploring photo collections in 3d. In SIGGRAPH, pages 835-846,2006.
    [102]Martin Szummer and Rosalind W. Picard. Indoor-outdoor image classification. Content-Based Access of Image and Video Databases, Workshop on,0:42,1998.
    [103]Pierre Tirilly, Vincent Claveau, and Patrick Gros. Language modeling for bag-of-visual words image categorization. In CIVR'08:Proceedings of the 2008 international conference on Content-based image and video retrieval, pages 249-258, New York, NY, USA,2008. ACM.
    [104]E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense matching. In Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on, pages 1-8,2008.
    [105]Antonio Torralba, Yair Weiss, and Rob Fergus. Small codes and large databases of images for object recognition. In CVPR,2008.
    [106]Bin Wang, Zhiwei Li, Mingjing Li, and Wei-Ying Ma. Large-scale duplicate detection for web image search. In Proc. ofICME'06,2006.
    [107]Changhu Wang, Feng Jing, Lei Zhang, and Hong-Jiang Zhang. Content-based image anno-tation refinement.2007.
    [108]Changhu Wang, Lei Zhang, and Hong-Jiang Zhang. Learning to reduce the semantic gap in web image retrieval and annotation. In SIGIR'08, pages 355-362, Singapore,2008.
    [109]J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity:semantics-sensitive integrated match-ing for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:947-963,2001.
    [110]J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity:Semantics-sensitive integrated matching for picture LIbraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947-963,2001.
    [111]Mei Wang, Xiangdong Zhou, and Tat-Seng Chua. Automatic image annotation via local multi-label classification. In ACM CIVR, pages 17-26, New York, NY, USA,2008. ACM.
    [112]Xin-Jing Wang, Lei Zhang, Feng Jing, and Wei-Ying Ma. Annosearch:Image auto-annotation by search. In CVPR'06, pages 1483-1490,2006.
    [113]K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS,2006.
    [114]Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS 18, pages 1473-1480,2006.
    [115]L. Wu, R. Jin, S. C. Hoi, J. Zhu, and N. Yu. Learning bregman distance functions and its application for semi-supervised clustering. In Advances in Neural Information Processing Systems 2009,2009. in press.
    [116]Lei Wu, Steven C.H. Hoi, Rong Jin, Jianke Zhu, and Nenghai Yu. Distance metric learning from uncertain side information with application to automated photo tagging. In MM'09: Proceedings of the seventeen ACM international conference on Multimedia, pages 135-144, New York, NY, USA,2009. ACM.
    [117]Lei Wu, Yang Hu, Mingjing Li, Nenghai Yu, and Xian-Sheng Hua. Scale-invariant visual language modeling for object categorization. Multimedia, IEEE Transactions on,11(2):286-294, Feb.2009.
    [118]Lei Wu, Mingjing Li, Zhiwei Li, Wei-Ying Ma, and Nenghai Yu. Visual language modeling for image classification. In Proc. of 9th ACM SIGMM International Workshop on Multimedia Information Retrieval, (MIR'07),2007.
    [119]Lei Wu, Mingjing Li, Zhiwei Li, Wei-Ying Ma, and Nenghai Yu. Visual language mod-eling for image classification. In Proc. Int. workshop on multimedia information retrieval (MIR'07), pages 115-124, Augsburg, Bavaria, Germany,2007.
    [120]Lei Wu, Linjun Yang, Nenghai Yu, and Xian-Sheng Hua. Learning to tag. In 18th Interna-tional World Wide Web Conference, pages 361-361, April 2009.
    [121]E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In NIPS2002,2002.
    [122]L. Huston Y. Ke, R. Sukthankar. Efficient near-duplicate detection and sub-image retrieval. In Proc. of ACM Multimedia (MM'04),2004.
    [123]Rong Yan and Alexander Hauprmann. Query expansion using probabilistic local feedback with application to multimedia retrieval. In Proceedings ofCIKM'07,2007.
    [124]Rong Yan, Apostol Natsev, and Murray Campbell. A learning-based hybrid tagging and browsing approach for efficient manual image annotation. In IEEE CVPR'08,2008.
    [125]Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efficient algorithm for local distance metric learning. In Proceedings of the Twenty-Second Conference on Artificial Intelligence (AAAI),2006.
    [126]Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efficient algorithm for local distance metric learning. In AAAI,2006.
    [127]Alexei Yavlinsky, Edward Schofield, and Stefan Ruger. Automated image annotation us-ing global features and robust nonparametric density estimation. In In Proceedings of the International Conference on Image and Video Retrieval, pages 507-517. Springer,2005.
    [128]J. Yu and Q. Tian. Semantic subspace projection and its application in image retrieval. IEEE Transactions on Circuits and Systems for Video Technology (CSVT), pages 544-548,2008.
    [129]Junsong Yuan, Ying Wu, and Ming Yang. Discovery of collocation patterns:from visual words to visual phrases. Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, (CVPR'07),2007.
    [130]Ying; Yang Ming Yuan, Junsong; Wu. Discovery of collocation patterns:from visual words to visual phrases. Proc. of CVPR'07,2007.
    [131]Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst.,22(2):179-214,2004.
    [132]D-Q. Zhang and S-F. Chang. Detecting image near-duplicate by stochastic attributed rela-tional graph matching with learning. In Proc. of the 12th ACM Multimedia,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700