用户名: 密码: 验证码:
基于概念融合的网页筛选技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着网络的迅速发展,互联网在人们日常信息交流中占据越来越重要的地位,网页资源日益丰富,给用户获取信息带来便捷,但同时也带来一些对社会有害的敏感信息。为了给用户提供健康、安全的信息,就有必要对敏感网页进行筛选处理。网页内容安全筛选是通过智能分析网页内容,研究高效分类技术以达到准确筛选网页内容的目的,主流技术包括基于文本内容的网页筛选(TBIF)和基于图像内容的网页筛选(IBIF)。网页通常具有多模态特性,不仅包含图像信息还包含用来描述图像的文本信息,显然,要满足网页筛选的准确性和完整性,在技术设计方案上必须同时考虑两种模态信息的融合处理。融合处理过程中,网页内容特征表示的有效性,多模态数据之间的异构性以及筛选时判定数据所需满足的实时性,都是影响网页内容筛选精度和速度的重要因素。为了提高网页筛选的精度和速度,本文针对网页内容安全筛选中网页表示、异构特征融合、高性能内容筛选等关键问题进行了深入的研究,主要研究内容如下:
     1)基于文本和图像概念融合的网页筛选框架
     网页中通常包括文本和图像两种模态信息,利用单一模态信息表示网页,只能筛选部分敏感信息,因此,文本和图像融合处理是改善多模态网页内容筛选准确性的关键技术之一。同时,为了解决文本和图像在融合中所存在的异构性问题,提出基于文本和图像概念融合的网页筛选框架。
     2)面向文本和图像概念空间的有意义串提取算法
     特征准确描述是网页内容筛选的基础。有意义串表示网络中频繁使用、具有特定的新词和短语信息,可以用来优化文本描述模型。当前有意义串提取方法一般考虑单个词串的评定,缺乏考虑词串之间的相关性;同时,融合框架中文本和图像之间的异构性,也是在提取有意义串时需要考虑的一个重要因素。本文提出一种面向文本和图像概念空间的有意义串提取算法(Concept-based Meaningful Extraction, CME),采用聚类算法提取网页中文本和图像有意义串集合,可通过设置相同的聚类参数k,形成文本和图像统一描述的网页概念空间。实验表明,利用提取的有意义串集合形成的概念表示网页能大幅度优化向量空间模型,可以获得较高的分类性能。
     3)基于高斯局部多核权重模型的多特征概念融合算法
     特征融合是网页内容筛选准确性和完整性的重要保障,传统的特征融合方法没有考虑特征内部潜在的相关性以及特征之间的异构性。在研究一般多核理论基础上,提出了一种基于高斯局部多核权重模型的多特征概念融合算法(Multiple Feature Concept fusion based on Gaussian Local Multiple Kernel, MLMKL),在文本和图像统一概念空间描述上,充分考虑多个特征的局部信息,利用高斯模型模拟数据分布形成局部权重模型,为每个核空间中局部特征分配不同的权重。MLMKL方法有效解决了特征融合异构性问题以及一般多核中缺乏有效局部权重模型描述的问题。MLMKL与已有方法相比能综合提高网页筛选的准确性和测试速度。
     4)基于最小圆覆盖区域划分的索引筛选算法
     基于统计的模式分类方法是一种有效的内容筛选方法,该方法在小数据集的处理上具有较好的分类精度,但无法应对海量数据的实时处理。针对这个问题,索引技术被提出,通过数据划分构建索引,提高数据的查询速度。已有方法没有考虑内容安全筛选中实际的数据分布特性,构建的索引结构并不能满足筛选所需的实时性能。考虑到实际网络中正例(正常信息)多,反例(敏感信息)少的非平衡数据分布特性,提出一种基于最小圆覆盖区域划分的索引筛选算法(Minimum Enclosing Circle Index Filtering, MECI),引入图象学中最小圆覆盖理论进行数据区域划分,生成最大否定判定区域,构建适合内容安全筛选的高性能索引结构F-tree。F-tree使得需要判定的正例以最大概率落入否定区域,可以加快内容筛选的数据判定速度。
     本文的研究工作在深入分析现有网页融合筛选技术不足的基础上,提出了一种基于文本和图像概念融合的网页筛选框架。通过深入研究网页特征表示、多模态信息的特征融合、高性能内容筛选几个关键技术,设计了有效的解决方案,有效提高了网页内容筛选的精度和速度,从而为多模态网页的管控提供了良好的技术基础,具有广阔的应用前景。
With the rapid development of the information technology, the Internet is playing a more and more important role in daily communications. As the development of the web has been paralleled by the proliferation of a harmful content on its pages, the emergence of harmful content on the web has led to the necessity of providing filtering systems designed to secure the access to the Internet. Web filtering intends to prevent access to harmful web page, and it generally depends on effective classification technology to analyze web page content intelligently. At present, there are two mainstream technologies for web page content filtering, i.e., text-based (TBIF) and image-based (IBIF) filtering technology. However, web pages usually contain visual image contents and textual information in the actual network environment, and current technology that simply based on image filtering or text filtering, could not achieve a sound performance in content filtering. In this case, we mainly focus on the fusion of textual and image, and demonstrated that it can be applied to improve the filtering efficiency. These factors, i.e., the effectiveness of feature, heterogeneity of multi-modal and real-time of filtering, these factors are very important for improving performance during processing. As a result, this paper employs some relevant technologies around the following problems about web page represent, heterogeneous feature fusion and filtering speed, in order to improve the performance of web page fusion filtering. The main research works are as follows:
     1) Web page filtering framework based on textual and image concept fusion
     Web page contains both textual and image on the actual Internet. We usually use single model of information to represent webpage, but this method can only filter a part of harmful information. Therefor the fusion of textual and image is a key technology for improving the accuracy of Web page filtering. We proposed a Web page concept fusion filtering framework based on both textual and image to solve the heterogeneity problem for textual and image fusion.
     2) Meaningful string extraction algorithm for textual and image concept space
     Accuracy feature representation is a basic step for Web page content filtering processing. Meaningful strings represent some specific new words and phrases, which are usually in used frequently on Internet and can be applied to optimize text description model. The existing meaningful string extraction methods have not consider the correlation between strings, the heterogeneity between textual and image in fusion framework is a very important factor for extracting meaningful string. In this case, we apply the clustering technology to extract the collection of string and propose a meaningful string extraction algorithm for textual and image concept space. Our results show that the representation of web page content with concepts can optimize vector space model, and the proposed method can improve the efficiency of classification.
     3) Multiple feature concept fusion based on Gaussian local multiple kernel weight
     Feature fusion is a key step for web page content filtering with accuracy. The traditional feature fusion methods do not consider the potential correlation and heterogeneity between features. On the basis of multiple kernel learning theory research, we proposed a multiple feature concept fusion based on Gaussian local multiple kernel weight(MLMKL). Since the local information of multiple features for the uniform concept space, we obtained local weight model by using Gaussian model to simulate data distribution. Afterwards we can get the different kernel weight for multiple local features. MLMKL solves the heterogeneity of feature fusion, and simultaneously solves the effective description problem with local multiple kernel weight model. The results demonstrate that compared to the existing method, the MLMKL method we proposed gets better accuracy and test speed.
     4) Index filtering algorithm based on minimum enclosing circle
     For now, web filtering execution generally applies pattern classification methods by using statistics. Though these methods can obtain high classification accuracy, they can become inefficient with very largescale sizes. In order to solve this problem, index technology has been proposed to improve the speed of data query by considering the efficient partition of data space. In the traditional method for index building, the imbalance data distribution on the real Internet was not taken into account. Therefore, we proposed a new index filtering algorithm based on minimum enclosing circle area partition (MECI). Considering the imbalance distribution with more normal information and less harmful information in real network, we apply the smallest circle enclosing to divide the data area, and then get the biggest negative area. So, an index F-tree with high performance has been built for special content security filtering. And because F-tree can take most of normal data query with the biggest probability to negative area, F-tree can improve filtering performance comprehensively.
     In tiais paper, we proposed a concept fusion framework based on the analysis of existing feature fusion algorithms. We studied the following topics:the efficient representation of web page content, the efficient fusion of multi-modal information and the high performance of filtering technology. According to above topics, we proposed the efficient solution, which can improve the accuracy and speed of web page filtering. In the meantime, our research can provide the good basis of technology to manage and monitor multi-modal web page content.
引文
[1]第31次中国互联网发展状况统计报告,中国互联网络信息中心(CNNIC),2013年1月.
    [2]Paul Resnick, Miller. PICS:Internet access controls with out censorship. Communications of the ACM,39(10),1996,87-93.
    [3]NetProtect Research Group. Report on filtering techniques and approaches NETPROTECT:WP2:D2.3 V1.0, Technical Report, Oct.2001.
    [4]Pui Y.Lee, Siu C. Hui, Alvis Cheuk M. Fong. Neural Networks for Web Content Filtering. Intelligent Systems,17(5),2002,48-57.
    [5]O. W. Kwon and J. H. Lee. Text categorization basedon k-nearest neighbor approach for web site classification. Information Processing and Management, 39(1),2003,25-44.
    [6]http://www.acm.org/sigs/sigir/
    [7]http://trec.nist.gov
    [8]http://www.nist.gov/speech/tests/tdt
    [9]http://www-nlpir.nist. gov/related_projects/muc/
    [10]http://www.nist.gov/speech/tests/ace/
    [11]Radhouane Guermazi, Mohamed Hammami, Abdelmajid Ben Hamadou. WebAngels Filter:A violent web filtering engine using textual and structural content-based analysis. Advances in Data Mining,2008,268-282.
    [12]Jonathan P.Caulkins, Ding Wenxue, Duncan George, et al. A method form managing access to web pages:Filtering by statistical classification (FSC) applied to text. Decision Support Systems,42(1),2006,144-161.
    [13]Su Guiyang, Li Jianhua, Ma Yinghua, et al. Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model. Journal of Zhejiang University Science,5(9),2004,1106-1113.
    [14]Du Rongbo, Reihaneh Safavi-Naini, Susilon Willy. Web filtering using text classification. The 11th IEEE International Conference on Networks,2003, 325-330.
    [15]Gao Zhong, Lu Guanming, Dong Hao, et al. Applying a novel combined classifier for hypertext classification in pornographic web filtering. Internet Computing in Science and Engineering,2008,270-273.
    [16]Polpinij, J., Chotthanom, A., Sibunruang, C., et al. Content-based text classifiers for pornographic web filtering. Systems, Man and Cybernetics,2006,1481-1485.
    [17]Lee, P.Y., Hui, S.C., Fong, A.C.M. Neural networks for web content filtering. IEEE Intelligent Systems,17(5),2002,48-57.
    [18]Ho, W.H., Watters, P.A., Statistical and structural approaches to filtering internet pornography. Systems, Man and Cybernetics,2004,4792-4798.
    [19]Zhang, J., Qin, J., Yan, Q. The role of urls in objectionable web content catego-rization. In:Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence,2006,277-283.
    [20]Nitin Agarwal, Liu Huan, Zhang Jianping. Blocking objectionable web content by leveraging multiple information sources. ACM SIGKDD Explorations Newsletter,8(1),2006,17-26.
    [21]Will Archer Arentz, Bjorn Olstad. Classifying offensive sites based on image content. Computer Vision and Image Understanding,94(1),2004,17-26.
    [22]Wang James Ze, Li Jia, Wiederhold Guo, et al. Classifying objectionable websites based on image content. Interactive Distributed Multimedia Systems and Telecommunication Services,1998,113-124.
    [23]Sibunruang, C., Polpinij, J., Chamchong, R., et al. A pornographic web patrol system based on hierarchical image filtering techniques. JCIS,2006.
    [24]E. Angelopoulou. Understanding the color of human skin. Human Vision and Electronic Imaging VI (SPIE),2001,243-251.
    [25]L. Bretzner, I. Laptev, T. Lindeberg. Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering. Proc. Face and Gesture, 2002,423-428.
    [26]R.-L. Hsu, M. Abdel-Mottaleb, A. Jain. Face detection in color images. IEEE Transaction on Pattern Analysis and Machine Intelligence,24 (2002),2002,696-706.
    [27]B. Martinkauppi, M. Soriano, M. Laaksonen. Behavior of skin color under varying illumination seen by different cameras in different color spaces. Machine Vision Application in Industrial Inspection IX, Vol.4301,2001,102-112.
    [28]H. Kruppa, M.A. Bauer, B. Schiele. Skin patch detection in real-world images.In Annual Symposium for Pattern Recognition of the DAGM 2002, Springer LNCS, 2002,109-117.
    [29]Shin, M.C., Chang, K.I., Tsap, L.V. Does colorspace transformation make any difference on skin detection? IEEE Workshop on Applications of Computer Vision,2002,275-279.
    [30]M. Storring, H.J. Andersen, E. Granum. Skin colour detection under changing lighting conditions. The 7th International Symposium on Intelligent Robotic Systems,1999,187-195.
    [31]Yang Jie, Lu Weier, Waibel Alex. Skin-color modeling and adaptation. ACCV'98 Proceedings of the Third Asian Conference on Computer Vision-Volume, vol.2, 1998,687-694.
    [32]Denoyer, L., Vittaut, J.N., Gallinari, P., et al. Structured multimedia document classification. ACM Symposium on Document Engineering,2003,153-160.
    [33]Hammami, M., Chahir, Y., Chen, L. A web filtering engine combining textual, structural, and visual content-based analysis. IEEE Transactions on Knowledge and Data Engineering,18(2),2006,272-284.
    [34]Hu, W., Wu, O., Chen, Z., et al. Recognition of pornographic web pages by classifying texts and images. IEEE Transaction Pattern Anal. Mach. In-tell,29(6), 2007,1019-1034.
    [35]Ali Ahmadi, Mehran Fotouhi, Mahmoud Khaleghi. Intelligent classification of web pages using contextual and visual features. Applied Soft Computing,11(2), 2011,1638-1647.
    [1]Douglas W. Oard, Gary Marchionini. A conceptual framework for text filtering. University of Maryland, College Park,1996.
    [2]Howard Turtle and W. Bruce Croft. Inference networks for document retrieval. Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1990,1-24.
    [3]Howard R. Turtle and W. Bruce Croft. A comparison of text retrieval models. The Computer Journal,35(3),1992,279-290.
    [4]G.Salton. The SMART retrieval system-experiments in automatic document processing. Prentice Hall,1971,115-411.
    [5]William B Cavnar, John M Trenkle. N-gram based text categorization. In proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval,1994,161-175.
    [6]Li Baoli, Chen Yuzhong, Bai Xiaojing, et al. Experimental study on representing units in Chinese text categorization. In proceedings of the Third Annual Symposium on Computational Linguistics and Intelligent Text Processing,2003, 602-613.
    [7]宋枫溪.自动文本分类若干问题研究[博士学位论文].南京理工大学,2004.
    [8]Yang Yiming, Jan O. Pedersen. A comparative study on feature selecion in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML),1997,412-420.
    [9]代六玲,黄河燕,陈肇雄.中文文本类中中的特征抽取方法的比较研究.中文信息学报,2004,17(2),26-32.
    [10]D.M.Bikel, S.Miller, R.Schwartz, et al. A High-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Proceeding,1997,194-201.
    [11]Sun Maosong, Luo Shengfen, K T'sou Benjamin. Word extraction based on semantic constrains in Chinese word-formation. Computational Linguistics and Intelligent Text Processing, Volume 3406,2005,202-213.
    [12]Sun Maosong, Xu Dongliang, K T'sou Benjamin. Disyllabic Chinese word extraction based on character thesaurus and semantic constraints in word-formation. In proceedings of the 11th international conference on Text, Speech and Dialogue,2008,141-151.
    [13]Peter D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval,2(4),2000,303-336.
    [14]Masayuki Asahara, Yuji Matsumoto. Training multi-classifiers for Chinese unknown word detection. Institute of Computational Linguistics,15(1),2005, 1-12.
    [15]Xue Nianwen. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing,8(1),2003,29-48.
    [16]Peng Fuchun, Feng Fangfang, Andrew McCallum. Chinese segmentation and new word detection using conditional random fields. In proceeding of the 20th International Conference on Computational Linguistics,2004,562-568.
    [17]Stephen O'Hara, Bruce A.Draper. Introduction to the bag of features paradigm for image classification and retrieval. CORR 2011.
    [18]Gabriella Csurka, Christopher R.Dance, Fan Lixin, et al. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV,2004,1-22.
    [19]Eric.Nowak, Bill.Triggs, Frederic.Jurie. Sampling strategies for bag of features image classification. ECCV,2006,490-503.
    [20]Mikolajczyk K, Schmid C. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence,27(10),2005, 1615-1630.
    [21]G.Lowe D. Object recognition from local scale-invariant features. In Proceeding of the International Conference on Computer Vision, Volume 2,1999, 1150-1157.
    [22]Pavlidis P, Weston J, Cai J, et al. Gene functional classification from heterogeneous data. In Proceedingsof the 5th Annual International Conference on Computational Biology, ACM,2001,242-248.
    [23]Fang Yachun, Tan Tieniu, Wang Yunhong. Fusion of global and local features for face verification. Pattern Recognition, volume 2,2002,382-385.
    [24]Yang Jian, Yang Jingyu, Zhang David, et al. Feature fusion:parallelstrategy vs. serial strategy. PatternRecognition,36(6),2003,1369-1381.
    [25]Jian Yang, Jing-yu Yang. Generalized K-L transform based combined feature extraction, Pattern Recognition,35 (1),2002),295-297.
    [26]Sun Q S, Zeng S G, LiuY, et al. A new method of feature fusionand its application in image recognition. Pattern Recognition,2005,38(12),2437-2448.
    [27]Lanckriet G R G, Deng M, Cristianini N, et al. Kernel-based data fusion and its applicationto protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputting,2004,300-311.
    [28]Lin Y Y, Liu T L, Fuh C S. Local ensemble kernel learningfor object category recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,2007,1-8.
    [29]Lampert C H, Blaschko M B. A multiple kernel learning approach to joint multi-class object detection. In Proceedings of the 30th DAGM Symposium on Pattern Recognition,2008.31-40.
    [30]Kumar A, Sminchisescu C. Support kernel machines forobject recognition. In Proceedings of the IEEE International Conference on Computer Vision,2007, 1-8.
    [31]Damoulas T, Girolami M A. Pattern recognition with a Bayesian kernel combination machine.Pattern Recognition Letters,30(1),2009,46-54.
    [32]Vedaldi A, Gulshan V, Varma M, et al. Multiple kernels for object detection. In Proceedings of the International Conference on Computer Vision,2009,1-8.
    [33]Zheng D N, Wang J X, Zhao Y N. Non-flat function estimation with a multi-scale support vector regession.Neurocomputing,70(1-3),2006,420-42.
    [34]Yang Z, Guo J, Xu W, et al. Multi-scale support vector machine for regression estimation. In Proceedings of the 3th International Symposium on NeuralNetworks, Chengdu, China, Springer,2006,1030-1037.
    [35]Bennett K P, Momma M, Embrechts M J. MARK:a boosting algorithm for heterogeneous kernel models. In Proceedings of 8th ACM-SIGKDD International Conferenceon Knowledge Discovery and Data Mining. Edmonton, Canada, ACM,2002,24-31.
    [36]Asa Ben-Hur, William Stafford Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics,21(1),2005,38-46.
    [37]Lewis D P, Jebara T, Noble W S. Nonstationary kernel combination. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, USA, ACM,2006,553-560.
    [38]Sonnenburg S, Ratsch G, Schafer C, et al. Large scale multiple kernel learning. The Journal of Machine Learning Research,7(7),2006,1531-1565.
    [39]Damoulas T, Girolami M A. Probabilistic multi-class multi-kernel learning:on protein fold recognition and remote homology detection. Bioinformatics,24(10), 2008,1264-1270.
    [40]Bach F R. Consistency of the group Lasso and multiple kernel learning. The Journal of Machine Learning Research,9(6),2008,1179-1225.
    [41]樊兴华,孙茂松.一种高性能的两类文本分类方法.计算机学报,2006,29(1),124-131.
    [42]王斌,许洪波,王申.基于结构特征的nBayes双层过滤模型.计算机应用,2006,26(1),191-194.
    [43]任家东,黄辉宇.基于人工神经网络的有害信息过滤智能决策.计算机工程,2004,30(16),149-150.
    [44]Vapnik V N.The Nature of statistical learning theory. Berlin, Springer,1995.
    [45]Lu Guojun. Techniques and data structures for efficient multimedia retrieval based on similarity. IEEE Tran. on Multimedia,4(3),2002,372-384.
    [46]Yu C. High-dimensional indexing:transformational approaches to high-dimensional range and similarity searches. Lecture Notes in Computer Science, Springer-Verlag Heidelberg,2002.
    [47]Berchtold S, Keim A. High-dimensional index structure:database support for next decades's applications. Tutorial Notes:ACM SIGMOD-98 Conference on Management of Data, Seattle,1998,501.
    [48]Gaede V, Gunther O. Multidimensional access method. ACM Computing Surveys,30(2),1998,170-231.
    [49]Bohm C, Berchtold S, Keim D A. Searching in high-dimensional spaces-indexing structures for improving the performance of multimedia databases. ACM Computing Surveys,33(3),2001,322-373.
    [50]Chavez E, Navarro G, Baeza-Yates R, et al. Searching in metric spaces. ACM Computing Surveys,33(3),2001,273-321.
    [51]http://sisap.org/Home.html
    [52]董道国,薛向阳,罗航哉.多维数据检索结构回顾.计算机科学2002,29(3),1-6.
    [53]刘芳洁,董道国,薛向阳.度量空间中检索高维索引结构回顾.计算机科学.2003.30(7),64-68.
    [54]周项敏,王国仁,于戈。度量空间中索引方法的研究.计算机科学,2002,29(B),265-267.
    [55]Guttman A. R-Trees:A dynamic index structe for spatial searching. ACM SIGMOD Record,14(2),1984,47-57.
    [56]Berkmann N, Krigel HP. The R*-tree:an efficient and robust access method for points and rectangles. ACM SIGMOD Record,19(2),1990,322-331.
    [57]KatayamaN, SatohS. The SR-tree:an index structure for high-dimensional nearest neighbor queries. ACM SIGMOD Record,26(2),1997,369-380.
    [58]White David A, Jain Ramesh. Similarity indexing with the SS-tree. In Proceedings of the Twelfth International Conference on Data Engineering,1996, 516-523.
    [59]C.Faloutsos, K.Lin. Fastmap:a fast algorithm for indexing, data mining and visualization of traditional and multimedia dataset. ACM SIGMOD Record, 24(2),1995,163-174.
    [1]B.Choi, Z.Yao. Web page classification. Studies in Fuzziness and Computing, Volume 180,2005,221-274.
    [2]Qi Xiaoguang, Brian D. Davison. Web page classification:features and algorithms. ACM Computing Surveys,41(2),2009,'1-29.
    [3]Yang Yuhang, Lu Qin, Zhao Tiejun. A delimiter-based general approach for Chinese term extraction. Journal of the American Society for Information Science and Technology,61(1),2010,111-125.
    [4]Yang Yuhang, Zhao Tiejun, Lu Qin, et al. Chinese term extraction using different types of relevance. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Singapore, August 4,2009,213-216.
    [5]Maria Grineva, Maxim Grinev, Dmitry Lizorkin. Extracting key terms from noisy and multi theme documents[C]. Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, April 20-24,2009, 661-670.
    [6]Jin Honglan, Wong Kam-Fai. A Chinese dictionary construction algorithm for information retrieval. ACM Transaction on Asian Language Information Processing,1(4),2002,281-296.
    [7]Lai Yu-Sheng, Wu Chung-Hsien. Meaningful term extraction and discriminative term selection in text categorization via unkown-word methodolog. ACM Transactions on Asian Language Information Processing, 1(1),2002,34-64.
    [8]Piao Scon S. L., Sun Guangfan. Automatic extraction of Chinese multiword expression with a statistical tool. In Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 3-7,2006, 17-24.
    [9]O. E. Zamir. Clustering web documents:a phrase-based method for grouping search engine results. Doctoral Dissertation, University of Washington,1999.
    [10]D. Mladenic, M. Grobelnik. Word sequences as features intext-learning. In Proceedings of the 7th Electrotechnical and Computer Science Conference, 1998,145-148.
    [11]J. Furnkranz. A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Insititute for Artificial Intelligence,1998.
    [12]M. F. Caropreso, S. Matwin, F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text Databases and Document Management,2001,78-102.
    [13]He Min, Chinese meaningful string mining for internet. [Master's thesis], Institute of Computing technology Chinese Academy of Sciences,2007.
    [14]Huang Yu-lan, Gong Cai-chun, Xu Hong-bo. A meaningful string extraction algorithm based on locality. The fourth NCIRCS 2008, Beijing, Novl5-16, 2008,56-64.
    [15]Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. The Journal of Machine Learning Research, Volume 3,2003, 1137-1155.
    [16]Goodman Joshua T. A bit of progress in language modeling. Computer Speech and Langusge,15(4),2001,403-434.
    [17]Chelba C, Jelinek F. A study on richer syntactic dependencies for structured language modeling. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia,2002,191-198.
    [18]David G.Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision,60(2),2004,99-110.
    [19]Stephen O'Hara, Bruce A. Draper. Introduction to the bag of Features paradigm for image classification and retrieval. CORR 2011.
    [20]MacQueen J B. Some methods for classification and analysis of multivariate observations.Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,University of California Press,1967, 281-297.
    [21]孙吉贵,刘杰,赵连宇.聚类算法研究.软件学报,2008,19(1),48-61.
    [22]Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing,17(4),2007,395-416.
    [23]Eibe Frank, Gordon W. Paynter, Lan H.Witten, et al. Domain-specific keyphrase extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, San Francisco, July 31-August 06,1999,668-673.
    [24]Wang Lei, Shi Shui-cai, Lv Xue-qiang, et al. Research and application to automatic indexing. Advances in Neural Networks, Shanghai, China, June 6-9,2010,330-336
    [25]郭怀恩,朱礼军,徐硕.词聚类技术研究综述.数字图书馆论坛,2010年第5期:15-19.
    [26]James A Thom. Justin Zobel. A model for word clustering, Royal Melbourne Institute of Technology,1992.
    [27]陈浪舟,黄泰冀.一种新颖的词聚类算法和可变长统计语言模型,计算机学报,22(9),1999.
    [28]Shinsuke Mori. A Stochastic language model using dependency and its improvement by word clustering. In Proceeding of the 17th International Conference on Computation Linguistics,1998,898-904.
    [29]John G. McMahon, Francis J.Smith. Improving statistical language model performance with automitically generated word hierarchies, volume 22, 1996,217-247.
    [30]Bassiou,N. K., Kotropoulos, C. L. Interpohted distanced bigram language models for robust word clustering. Nonlinear Signal and Image Processing, 2005.
    [31]Shinsuke Mori, Nobuyasu Ito, Masafumi Nishimura. Language model adaptation using word clustering, volume 14,2003,89-94.
    [32]陈炯,张永奎.一种基于词聚类的中文文本主题抽取方法.计算机应用,2005,25(4),754-756.
    [33]Saeedeh Momtazi. A word clustering approach for language model-based sentence retrieval in question answering systems. In Proceeding of the 18th ACM Conference on Information,2009,1911-1914.
    [34]Michiko Yasukawa, Hidetoshi Yokoo. Term clustering based on lengths and co-occurrences of terms. In Proceedings of the 14th Australasian Document Computin Symposium, Sydney, Australia,4 December,2009.
    [1]B. Stayrynkevitch, Poesia Software Architecture definition document. Technical Report, Poesia Consortium, December,2002.
    [2]W. Ho, P. Watters. Statistical and structural approaches to filtering internet pornography. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics,2004,4792-4798.
    [3]Ali Ahmadi, Mehran Fotouhi, Mahmoud Khaleghi. Intelligent classification of web pages using contextual and visual features. Applied Soft Computing,11(2), 2011,1638-1647.
    [4]Yang J, Yang JY, Zhang D, et al. Feature fusion:parallel strategy vs. serial strategy. Pattern Recognition,36(6),2003,1369-1381.
    [5]Chau Michael, Chen Hsinchun. A mchine learning approach to web page filtering using content and structure analysis, Decision Support Systems,44(2), 2008,482-494.
    [6]Radhouane Guermazi, Mohanmen Hammami, Abdelmajid Ben Hamadou. Combining classifiers for web violent content detection and filtering. In Proceedings of the International Conference on Computational Science, ICCS 2007,773-780.
    [7]Radhouane Guermazi, Mohamed Hammami, Abdelmajid Ben Hamadou. Web angels filter:a violent web filtering engine using textual and structural content-based Analysis. Advances in Data Mining, ICDM 2008,268-282.
    [8]Sun Q S, Zeng S G, Liu Y, et al. A new method of feature fusion and its application in image recognition. Pattern Recognition,38(12),2005,2437-2448.
    [9]侯书东,孙权森.稀疏保持典型相关分析及在特征融合中的应用.自动化学报.2012,38(4),599-665.
    [10]Hou Shudon, Sun Quansen, Xia Deshen. Feature fusion using multiple component analysis. Neural Process Lett,34(3),2011,259-275.
    [11]Schoelkopf B, Smola A, Muller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation,10(5),1998,1299-1319.
    [12]Scholkopf B, Mika S, Burges C J C, et al. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Network,10(5),1999, 1000-1017.
    [13]Muller K R, Mika S, Ratsch G, et al. An introduction to kernel based learning algorithms. IEEE Transactions on Neural Networks,12(2),2001,181-201.
    [14]Mehmet Gonen, Ethem Alpaydin. Multiple kernel learning algorithms. Journal of machine earning research,12 (7),2011,2211-2268.
    [15]汪洪桥,孙富春,蔡艳宁等.多核学习方法.自动化学报.2010,36(8),1037-1050.
    [16]Yen-Yu Lin,Tyng-Luh Liu. Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence,33(6),2011, 1147-1160.
    [17]Mak B, Kwok J T, Ho S. A study of various composite kernels for kernel eigenvoice speaker adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, IEEE,2004,325-328.
    [18]Zhang Deyuan. Learning to combine kernels for object categorization. Computer and Information Science,4(3),2011,116-124.
    [19]Lee W J, Verzakov S, Duin R P. Kernel combination versus classifier combination. In Proceedings of the 7th International Workshop on Multiple Classifier Systems, Prague, Czech Republic, Springer,2007,22-31.
    [20]Gonen M, Alpaydin E. Localized multiple kernel learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, ACM, 2008,352-359.
    [21]GAonen M, Alpaydm E. Multiple kernel machines using localized kernels. In Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics, Sheffield, UK, University of Sheffield,2009,1-10.
    [22]Bach, F. R., Lanckriet, G, R, G., Multiple kernel learning, conic duality, and the SMO algorithm. Proceedings of the 21st International Conference on Machine Learning,2004,42-48.
    [23]Rakotomamonjy, A., Bach, F., Canu, et al. More efficiency in multiple kernel learning. Proceedings of the 24th International Conference on Machine Learning, 2007,775-782.
    [24]He Junfeng, Chang Shih-Fu, Xie Lexing. Fast kernel learning for spatial pyramid matching. In Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition,2008.
    [25]Manik Varma and Bodla Rakesh Babu. More Generality in Efficient Multiple Kernel Learning. In Proceedings of the 26th International Conference on Machine Learning,2009,1-7.
    [26]Marius Kloft, Ulf Brefeld, Soren Sonnenburg, et al. Non-sparse regularization and efficient training with multiple kernels. Technical report, Electrical Engineering and Computer Sciences, University of California at Berkeley,2010.
    [27]Xu Zenglin, Jin Rong, Haiqin Yang, et al. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th International Conference on Machine Learning,2010,1175-1182.
    [1]Bustos B, Navarro G, Chavez E. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters,24(14),2003, 2357-2366.
    [2]Bustos B, Pedreira O, Brisaboa N. A dynamic pivot selection technique for similarity search. Proceedings of lth Workshop on Similarity Search and Applications. Washington, USA,2008,105-112.
    [3]Chavez E, Navarro G. A compact space decomposition for effective metric indexing. Pattern Recogn letter,26(9),2005,1363-1376.
    [4]Ciaccia P, Patella M, Zezula P. M-tree:an efficient access method for similarity search in metric spaces. Proceedings of the International Conference on Very Large Data Bases, San Francisco, USA,1997,426-435.
    [5]周项敏,王国仁.基于关键维的高维空间划分策略.软件学报.2004,15(9),1361-1374.
    [6]Fischer K, Gartner B, Kutz M. Fast smallest-enclosing-ball computation in high dimensions. Proceedings of the 11th Annual European Symposium on Algorithms, Budapest, Hungary,2003,630-641.
    [7]Kumar P, Joseph S B.MitehellE.Computing core-sets and approximate smallest enelosing hyperSpheres in high dimensions.Special Issue with Selected Articles from the Fourth Workshop on Algorithms and Engineering,2003,1084-6654.
    [8]Xu S, Freund R, Sun J. Solution methodologies for the smallest enclosing circle problem. Computational Optimization and Applications,25(1),2001,283-292.
    [9]范克磊.高维空间近似最小球覆盖问题的研究[硕士学位论文].山东大学,2010.
    [10]汪卫,王文平,汪嘉业.求一个包含点集所有点的最小圆的算法.软件学报.2000,11(9),1237-1240.
    [11]Chavez E, Navarro G, Baeza-Yates R, et al. Searching in metric spaces. ACM Compute Surveys,2010,33(3),273-21.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700