高维数据挖掘中若干关键问题的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

高维数据挖掘中若干关键问题的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Research on a Few Key Issues in High Dimensional Data Mining
作者：杨风召
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 高维数据 ; 相似性度量函数 ; 相似性搜索 ; 协同过滤 ; 聚类分析 ; 异常检测
英文关键词：data mining ; high dimensional data ; proximity measurement function ; collaborative filtering ; cluster analysis ; outlier detection
学位年度：2003
导师：朱扬勇
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2003-04-25

摘要

数据挖掘指的是从大量的数据中提取隐含的、事先未知的、并且潜在有用的知识的技术，是目前国际上数据库和信息决策领域最前沿的研究方向之一。在实际应用中经常会碰到高维数据，如交易数据、文档词频数据、用户评分数据、WEB使用数据及多媒体数据等。由于这种数据存在的普遍性，使得对高维数据挖掘的研究有着非常重要的意义。但由于“维灾”的影响，也使得高维数据挖掘变得异常地困难，必须采用一些特殊的手段进行处理。
     随着数据维数的升高，高维索引结构的性能迅速下降，在低维空间中，我们经常采用L_p距离作为数据之间的相似性度量，在高维空间中很多情况下这种相似性的概念不复存在，这就给高维数据挖掘带来了很严峻的考验，一方面引起基于索引结构的数据挖掘算法的性能下降，另一方面很多基于全空间距离函数的挖掘方法也会失效。解决的方法可以有以下几种：一个可以通过降维将数据从高维降到低维，然后用低维数据的处理办法进行处理；对算法效率下降问题可以通过设计更为有效的索引结构、采用增量算法及并行算法等来提高算法的性能；对失效的问题通过重新定义使其获得新生。
     本文对高维数据挖掘中的相似性搜索、高维数据聚类、高维数据异常检测及电子商务中的协同过滤技术进行了研究，指出了高维给这些领域带来的影响，提出了一些解决问题的方法，具有一定的理论意义和现实的指导意义。
     本文的主要工作如下：
     (1)通过对高维数据特点的分析，提出了一种新的相似性度量函数Hsim()，该函数可以避免在高维空间中分辨能力下降的问题，还可以将数值型的数据和二值型数据相似性的计算整合在一个统一的框架中。并将它与其它的相似性函数进行了比较；
     (2)结合量化交易数据的特点，提出了一种新的量化交易数据相似性搜索方法，这种算法基于一种称为特征表的结构，对数据有较高的修剪率，能大大提高相似性搜索的速度；
     (3)提出了一种新的基于用户评分数据的协同过滤算法，并通过实验证明该算法不仅提高了推荐的效率，还对推荐精度有一定的提高；
     (4)分析了高维数据聚类的算法，提出了基于对象相似性的高维数据聚类框架；
     (5)对高维对异常检测算法的影响进行了分析，给出了投影异常检测的概念。提出了一种动态环境下局部异常的增量挖掘算法IncLOF，并通过实验和LOF算

     摘要
    法进行了比较，结果表明在动态高维的环境下，当高维索引结构失效的情况下。
    能大大提高局部异常的挖掘效率。
Data mining refers to extracting implicit, previously unknown and usable knowledge from large amounts of data. It is one of the frontiers of research in the fields of database and DSS. The high dimensional data are frequently met when we apply data mining, for example transaction data, term-frequency data, rating data, WEB usage data and multimedia data. The universality of high dimensional data makes researches on high dimensional data mining very important. But mining in high dimensional data is extraordinarily difficult because of the curse of dimensionality. So we must adopt some special means to solve these problems.
    The performance of similarity indexing structures in high dimensions degrades rapidly. In lower dimensional space, we often use Lp-norm to measure the proximity between two points, but in many case the concept of this proximity is never meaningless in high dimensional space. These issues bring high dimensional data mining two challenges. One is the performance of data mining algorithms degrades, the other is many distance-based and density-based algorithms maybe not effective. These problems can be solved by the following methods: l)Transport the data from high dimensional space to lower dimensional space by dimensionality reduction, then process the data as lower dimensional data. 2)To improve the performance of mining algorithms, we can design more effective indexing structures, adopt incremental algorithms and parallel algorithms and so on. 3)Redefine some concepts in a meaningful way for high dimensional domains.
    Similarity search, cluster analysis and outlier detection in high dimensional data mining, as well as collaborative filtering in e-commerce are studied in this paper. We point the effect of high dimensionality on these domains and present some method to solve these problems. The researches in this paper have much important theoretical and practical significance.
    The majority of our work is summarized here:
    (1)A new function Hsim( ) to measure the proximity of objects in high dimensional spaces is presented by analyzing the characteristic of the high dimensional data. The function can not only avoid the problem which the Lp-norm lead to the non-contrasting behavior of distance in high dimensional space, but also adapt to both binary and numerical data. We also made a comparison between Hsirn()



    and other similarity functions.
    (2) According to the characteristic of quantitative transaction data, a new method based on signature table for similarity search on quantitative transaction data is presented. Experiments demonstrate this method have very good pruning efficiency for similarity search on the quantitative transaction data, so can greatly speed the
    similarity search.
    (3) Put forward a new algorithm for performing ratings-based collaborative filtering. Our preliminary experiments based on a number of real data sets show that the new method can both improve the scalability and quality of collaborative filtering.
    (4) Analyze the algorithms for high dimensional data cluster, and present a framework of similarity-based cluster analysis for high dimensional data.
    (5) Analyze the effects of high dimensionality on outlier dection algorithms, give a concept of projected outlier detection. The first incremental outlier detection algorithm IncLOF is presented and compared with LOF algorithm. The results from a study on synthetic data sets demonstrate that the runtime of IncLOF is much less than LOF in dynamic and high dimensional environment where high indexing structures are ineffective.

引文

[AAR96] A. Arning, R. Agrawal, P. Ragaran. A Linear Method for Deviation Detection in Large Databases. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp164-169, 1996．
    [AB95] D. W. Aha, R. L. Bankert. A Comparative Evaluation of Sequential Feature Selection Algorithms. In Proceedings of the Fifth Intl. Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL,1995．
    [ABKS99] M. Ankerst, M. Breunig, H.-P. Kriegel, J. Sander. OPTICS: Ordering Points to Identify Clustering Structure. In Proc. of the ACM SIGMOD Conference, pp.49-60, Philadelphia, PA, 1999．
    [ACM92] ACM. Special Issue on Information Filtering. Communciations of the ACM, 35(12) , December 1992．
    [AFS93] R. Agrawal, C. Faloutsos, A. Swami. Efficient Similarity Search in Sequence Databases. In Proc. of the 4th Conference on Foundations of Data Organization and Algorithm, 1993．
    [Agg01] C. C. Aggarwal. On the Effects of Dimensionality Reduction on High Dimensional Similarity Search. In Proceedings of the ACM PODS Int'l Conference, 2001．
    [AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proc. of the ACM SIGMOD Conference, pp.94-105,Seattle, WA, 1998．
    [AHK01] C. C. Aggarwal, A. Hinneburg, D. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In ICDT Conference Preceedings, 2001．
    [AIS93] R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules Between Sets of Items in Large Databases. In Proc. Of the ACM-SIGMOD Int. of Conf. on Management of Data. Washington D.C., pp.207-216, 1993．
    [APWYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, J. S. Park. Fast Algorithms for Projected Clustering. In Proceedings of the ACM SIGMOD Conference, pp.61-72, Philadelphia, PA, 1999．
    [AS94] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th VLDB Conference. Santiago Chile, pp.487-499, 1994．
    [AWY99] C. C. Aggarwal, J. L. Wolf, P. S. Yu. A New Method for Similarity Indexing of Market Basket Data. In ACM SIGMOD Conferencce Preceedings, pp.407-418, 1999．
    [AY00a] C. C. Aggarwal, P. S. Yu. The Igrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space. In ACM SIGKDD Conference Preceedings, 2000．
    [AY00b] C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces. In Proc. of ACM SIGMOD Intl. Conf. Management of Data, pp.70-81,2000．
    [AY01] C. C. Aggarwal and P. Yu. Outlier Detection for High Dimensional Data. In ACM SIGMOD Conference Proceedings, 2001．
    [BBJKS00] S. Berchtold, C. B6hm, H. V. Jagadish, H.-P. Kriegel, and J. Sander. Independent

    Quantization: An Index Compression Technique for high-Dimensional Data Spaces. In Proc. of IEEE 16th International Conference on Data Engineering, pp.577-588, 2000．
    [BBK98] S. Berchtold, C. Bohm, H.-P. kriegel. The Pyramid-Technique: Towards Indexing beyond the Curse of Dimensionality. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, pp.142-153, 1998．
    [BBK98] S. Berchtold, C. Bohm, H.-P. Kriegel. The Pyramid Technique: Towards Breaking the Curse of Dimensionality. In Proc. of ACM SIGMOD, 1998．
    [Bel61] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, New Jersey, 1961．
    [Ben75] J. L. Bentley. Multidimensional Dinary Search Trees Used for Associative Searching. CACM, 18(9) :509-517, 1975．
    [Ber92] M.-W. Berry. Large-scale Sparse Singular Value Computations. International Journal of Super-Computer Applications, 6(1) : 13-49, 1992．
    [BGRS99] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft. When is Nearest Neighbors Meaningful? In ICDT Conference Preceedings, Jerusalem Israel, pp217-235,1999．
    [BHK98] J. S. Breese, D. Heckerman, and C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence(UAI-98) , San Francisco, pp43-52,1998．
    [BKK96] S. Berchtold, D. A. Keim, H.-P. Kriegel. The X-tree: An Index Structure for High-Dimensional Data. In Proc. of the 22nd International Conference on Very Large Data Base(VLDB), pp.28-39, Bombay, September 1996．
    [BKNS00] M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. In ACM SIGMOD Conference Proceedings, 2000．
    [BKNS99] M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. OPTICS-OF: Identifying Local Outliers. In Conf of PKDD , 1999．
    [BKSS90] N. Beckman, H. P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In Proc. ACM SIGMOD Conf., pp.322-331, Atlantic City, NJ, May 1990．
    [BL94] V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994．
    [BL97] A. L. Blum, P. Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Inteligence, vol.97, pp.245-271,1997．
    [BO97] T. Bozkaya, M. Ozsoyoglu. Distance-based Indexing for High-dimensional Metric Spaces. In Proc. ACM SIGMOD International Conference on Management of Data, Sigmod Record Vol.26, No.2, pp.357-368, 1997．
    [BP98] D. Billsus and M. J. Pazzani. Learning Collaborative Information Fillers. In Proceedings of ICML, pp 46-53, 1998．
    [CC00] Y. Cheng, GChurch. Biclustering of Expression Data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000．
    [CF99] K. Chan, W. Fu. Efficient Time Series Matching by Wavelets. In Proceedings of th 15th IEEE International Conference on Data Engineering.
    [CFZ99] C. Cheng, A. Fu, Y. Zhang. Entropy-based Subspace Clustering for Mining Numerical Data. In Proceedings of the 5th ACM SIGKDD, pp.84-93, San Diego,


    [CM00] K. Chakrabarti, S. Mehrotra. Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. In proceedings of the 26th VLDBConference, Cairo, Egypt, 2000．
    [CM99] K. Chakrabarti, S. Mehrotra. High Dimensional Feature Indexing Using Hybrid Trees. The Proc. of The International Conference on Data Engineering, 1999．
    [CPZ97] P. Ciaccia, M. Patella, P. Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In Proc. of the 23th International Conference on Very large Data Bases (VLDB), Athens, August 1997．
    [DDFL90] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer. Indexing by Latent Semantic Analysis. Jonrnal of the Am. Soc. For Information Science, 41(6) :391-407,1990．
    [Don00] D. L. Donoho. High Dimensional Data Analysis: The Curses and Blessings of Dimensionality. American Math. Society Conference: Mathematical Challenges of the 21st Century, Los Angeles, CA, August, pp6-11, 2000．
    [EKSWX98] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, Xiaowei Xu. Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of the 24th VLDB Conference, New York, USA, 1998．
    [EKSX96] M.Ester, H.-P. Kriegel, J.Sander, X. Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining(KDD'96) , pp226-231, Porland.Oregon, August 1996．
    [ESK01] L. Ertoz, M. Steinbach, V. Kumar. Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach. In Proceeding of Text Mining Workshop, First International SIAM Data Mining Conference, Chicago, IL, 2001．
    [ESK02] L. Ertoz, M. Steinbach, V. Kumar. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, Technical Report. 2002．
    [FPSU96] U. M. Fayyad, G. Piatetsky-Shaperio, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
    [FRM95] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. ACM DIGMOD International Conference on Management of Data, pp. 163-174, May 1995．
    [GG98] V. Gaede, O. Gunther. Multidimensional Access Methods. ACM Computing Surveys, Vol.30, No.2, pp 170-231, 1998．
    [GGR99] V. Ganti, J. Gehrke, R. Ramakrishnan. CACTUS-Clustering Categorical Data Using Summaries. In Proc. of the 5th ACM SIGKDD, 71-83, San Diego, CA, 1999．
    [GKR98] D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In Proc. of the 24th International Conference on Very Large Database, pp.311-323, New York, 1998．
    [GL83] G. H. Golub, C. F. van Loan. Matrix Computations. North Oxford Academic, Oxford, UK, 1983．
    [GNC99] s. Goil, H. Nagesh, A. Choudhary. MAFIA: Efficient and Scalable Subspace

    Clustering for Very Large Data Sets. Technical Report CPDC-TR-9906-010, Northwestern University.
    [GNOT92] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 35(12) : 61-70, December 1992．
    [GRGPF99] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, J. French. Clustering Large Datasets in Arbitrary Metric Spaces. In Proc. of the 15th ICDE, 502-511, Sydney, Australia, 1999．
    [GRS98] S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD Conference, Seattle, WA. pp73-84, 1998．
    [GRS99] S. Guha, R. Rastogi, K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proceedings of the 15th ICDE, Sydney, Australia, pp512-521, 1999．
    [GSKBSHR99] N. Good, B. Schafer, J. Konstan, A. Borchers, B. Sanvar, J. Herlocker, and J. Riedl. Combining Collaborative Filtering with Personal Agents for Eletter Recommendations. In Proceedings of the AAAI-'99 conference, pp 439-446, 1999．
    [Gut84] A. Guttman. R-tree: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, pp.47-57, June 1984．
    [HAK00] A. Hinneburg, C. C. Aggarwal, D. Keim. What is the Nearest Neighbor in High Dimensional Spaces? In Proc. of the 26th International Conference on Very large Data Bases (VLDB), Cairo, Egypt, 2000．
    [Haw80] D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980．
    [Hen98] A. Henrich. The LSDh-tree: An Access Structure for Feature Vectors. In Proc. 14th Int. Conf. on Data Engineering, Orlando, 1998．
    [HK98] A. Hinneburg, D. Keim. An Efficient Approach to Clustering Large Multimedia Database with Noise. In Proc. of the 4th ACM SIGKD.D, pp58-65, New York, NY, 1998．
    [HK99] A. Hinneburg, D. A. Keim. Optimal Grid-clustering: Towards Breaking the Curse of Dimensionality in High-dimensional Clustering. In Proc. 25th Int. Conf. on Very Large Data Base, pp.506-517, 1999．
    [HKP97] J. Hellerstein, E. Koutsoupias, C. Papadimitriou. On the Analysis of indexing schemes. In Proc. PODS, pp249-256, 1997．
    [HS95] G. R. Hjaltason, H. Samet. Ranking in Spatial Databases. In Proc. 4th Int. Syrnp. on Large Spatial Databases, Portland, ME, pp.83-95, 1995．
    [JD88] A. Jain, R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988．
    [JKN98] T. Johnson.I. Kwok, R. Ng. Fast Computation of 2-Dimentional Depth Contours. In Proc 4th Int. Conf. On Knowledge Discovery and Data Mining, New York, 1998．
    [JMF99] A. K. Jain, M. N. Murty, P. J. Flynn. Data Clustering: A Review. ACM Computing Survey, 31(3) : 264-323,1999．
    [JTH01] W. Jin, K. H. Tung, J. W. Han. Mining Top-n Local Outliers in Large Databases. In Preoceedings of ACM SIGKDD International Conference, 2001．


    [JTWF00] C. T. Jr, A. Traina, L. Wu, C. Faloutsos. Fast Feature Selection Using Fractal Dimension.
    [KAKS97] G. Karypis, R. Aggarwal, V. Kumar, S. Shekhar. Multilevel hypeirgraph partitioning: Application in VLSI domain. In: Proc. ACM/IEEE Design Automation Conference, 1997．
    [Kar01] G. Karypis. Evaluation of Item-based Top-N Recommendation Algorithms. In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM), Atlanta, 2001．
    [KAS98] K. V. Ravi Kanth, D. Agrawal, and A. K. Singh. Dimensionality Reduction for Similarity Searching Dynamic Databases. In Proc. of SIGMOD, 1998．
    [KFV00] B. Kitts, D. Freed, and M. Vrieze. Cross-sell: A Fast Promotion-tunable Customer-item Recommendation Method Based in Conditional Independent Probabilities. In Preoceedings of ACM SIGKDD International Conference, pp 437-446, 2000．
    [KHK99] G. Karypis, E.-H. Han, V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32:68-75, 1999．
    [KK98] G. Karpis, V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal of Scientitic Computing, 20(1) :359-392,1998．
    [KL70] B. Kernighan, S. Lin. An Efficient Heuristic Procedure for Partitioning Graghs. Bell Systems Technical Journal, 49:292-307, 1970．
    [KMMHGR97] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3) , pp. 77-87, 1997．
    [KN98] E. Knorr and R. Ng. Algorithms for Mining Distance-based Outliers in Large Data Sets. In VLDB Conference Proceedings, 1998．
    [KN99] E. Knorr and R. Ng. Finding Intensional Knowledge of Distance-based Outliers. In VLDB Conference Proceedings, 1999．
    [KR90] L. 'Kaufman, P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons, 1990．
    [KS97] N. Katayama, S. Satoh. The SR-tree: An index Structure for High-Dimensional Nearest Neighbor Queries. In Proc. ACM DIGMOD International Conference on Management of Data, pp.368-380, May 1997．
    [LJF94] K. Lin, H. V. Jagadish, and C. Faloutsos. The TV-tree-An lindex Structure for High Dimensional Data. In Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile, 1994．
    [Lom90] D. Lomet. The hB-tree: A Multiattribute Indexing Method with Good Guaranteed Performance. ACM Transactions on Database Systems. pp625-658, December 1990．
    [LS01] S. J. Lee, K. Siau, A review of data mining techniques, Industrial Management & Data Systems, 101/l[2001] 41-46．
    [MC02] B. L.Milenova, M. M. Campos. O-Cluster: Scalable Clustering of Large High Dimensional Data Sets. In Proc. of the International Conference on Data Mining, Maebashi City, Japan, 2002
    [MCL03] N. Mamoulis, D. W. Cheung, W. Lian. Similarity Search in Sets and Categorical

    Data Using the Signature Tree. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, March 2003．
    [NGC01] H. Nagesh, S. Goil, A. Choudhary. Adaptive Grids for Clustering Massive Data Sets. In Proc. of the 1st SIAMICDM, Chicage, IL, 2001．
    [NH94] R. Ng, J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. In Proc. 1994 Int. Conf. Very Large Data Bases(VLDB'94) , pp. 144-155, Santiago, Chile, 1994．
    [OTYB00] B. C. Ooi, K. L. Tan, C. Yu, S. Bressan. Indexing the Edge: A Simple and yet Efficient Approach to High-dimensional Indexing. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ppl66-174,2000．
    [PJAM02] c. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali. A Monte Carlo Algorithm for Fast Projective Clustering. In Proc. of ACM SIGMOD Conference, Madison, Wisconsin, USA, 2002．
    [PS88] F. Preparata, M. Shamos. Computational Geometry : An Introduction. Springer-Verlag, 1988．
    [RISBR94] P. Resnick, N. lacovou, M. Suchak, P. Bergstrom, and J. Riedl. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of CSCW'94, Chapel Hill, NC, 1994．
    [RKV95] N. Roussopoulos, S. Kelley, F. Vincen. Nearest neighbor queries. In ACM SIGMOD Conference Procceedings, San Jose CA, pp.71-79 1995．
    [Rob81] J. T. Robinson. The K-D-B-tree: a Search Structure for Large Multidimensional Dynamic Indexes. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 10-18, Ann Arbor, USA, Apr. 1981．
    [RR96] I. Ruts, P. Rousseeuw. Computing Depth Contours of Bivariate Point Clouds. Computational Statistics and Data Analysis, 1996．
    [RRS00] S. Ramaswamy, R. Rastogi, K. Shim. Efficient Algorithms for Mining Outliers from Large Data Sets. In ACM SIGMOD Conference Proceedings, 2000．
    [Sam90] H.Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990．
    [Sch91] M. Schroeder. Fractals, Chaos, Power Laws, 6ed. New York: W. H. Freeman and Company, 1991．
    [Sco92] D. W. Scott. Multivariate Density Estimation. Wiley & Sons, 1992．
    [SCZ98] G. Sheikholeslami, S. Chatterjee, A. Zhang. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Database. In Proc. of the 24th Conference on VLDB, pp428-439, New York, NY, 1998．
    [SKBHMR98] B. M. Sarwar, J. A. Konstan, A. Borchers, J. Herlocker, B. Miller, and J. Riedl. Using Filtering Agents to Improve Prediction Quality in the GroupLens idesearch Collaborative Filtering System. In Proceedings of CACW'98, Seattle, WA.1998．
    [SKKR00a] B. Sarwar, G. Karypis, J. Kinstan, and J. Riedl. Analysis of Recommendation Algorithms for E-commerce. In Proceedings of ACM E-Cimmerce, 2000．
    [SKKR00b] B. M. Sarwar, G Karypis, J. A. Konstan, and J. Riedl. Application of Dimensionality Reduction in Recommender System-A Case Sudy. In ACM WebKDD 2000 workshop, 2000．


    [SRF87] T. Sellis, N. Roussopoulos, C. Faloutsos. The R+ tree: A Dynamic Index for Multi-dimension objects. In Proc. 13th International Conference on Very Large Databases, pp.507-518, England, Sept. 1987．
    [SYUK00] Y. Sakurai, M. Yoshikawa, S. Uemura, H. Kojima. The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. In Proc. of the 26th International Conference on Very large Data Bases (VLDB), pp.516-526, Cairo, Egypt, 2000．
    [TTF00] C. Traina, A. J. M. Traina, C. Faloutsos. Distance Exponent: a New Concept for Selectivity Estimation in Metric Trees. In IEEE Intl. Conf. on Data Engineering(ICDE), San Diego-CA, 2000．
    [VJ93] H. Vafaie, K. A. D. Jong. Robost Feature Selection Algorithm. In Intl. Conf. on Tools with AI, Boston, MA, 1993．
    [VMNK02] A. P. Vries, N. Mamoulis, N. Nes, M. Kersten. Efficient k-NN Search on Vertically Decomposed Data. In ACM SIGMOD Conference Procceedings, Madison, Wisconsin, USA, 2002．
    [WAWY99] J. Wolf, C. Aggarwal, K. Wu, and P. Yu. Horting hatches and egg: A New Graph-theoretic Approach to Collaborative Filtering. In Proceeding of ACM SIGKDD International Conference on Knowledge Discovery &Data Mining, 1999．
    [WJ96a] D. A. White, R. Jain. Similarity Indexing with the SS-tree. In Proceedings of the 12th International Conference on Data Engineering, pp.516-523, Feb. 1996．
    [WJ96b] D. A. White, R. Jain. Similarity Indexing: Algorithms and Performance. In Proc. SPIE Vol.2670, San Diego, USA, pp.62-73, Jan. 1996．
    [WSB98] R. Weber, H.-J. Schek, S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. In Proc. of 24th VLDB Conference, pp194-205, New York, USA, 1998．
    [WWYY02] H. Wang, W. Wang, J. Yang, P. S. Yu. Clustering by Pattern Similarity in Large Data Sets. In Proceeding of the ACM SIGMOD Conference on Management of Data (SIGMOD), pp. 394-405,2002．
    [WYM97] W. Wang, J. Yang, R. R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Ming. In Proc. of the 23rd Conference on VLDB, pp 186-195, Athens, Greece, 3997．
    [YOTJ01] C. Yu, B. C. Ooi, K. L. Tan, H. V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. In Proc. of the 27th VLDB Conference, Roma, Italy, 2001．
    [YWWY02] J, Yang, W. Wang, H. Wang, P. S. Yu. 8-cluster: Capturing Subspace Correlation in a Large Data Set. In ICDE, pp517-528,2002．
    [YZS03] F. Z. Yang, Y. Y. Zhu, B. L. Shi. A new Algorithm for Performing Ratings-based Collaborative Filtering, In Proceedings of APWEB2003 Conference. Lecture Notes in Computer Science(LNCS 2642) , pp.239-250, Springer-Verlag, 2003．
    [ZFCH00] Y. Zhang, A. W. Fu, C. H. Cai, P.-A. Heng. Clustering Categorical Data. In Proceedings of 16th ICDE, pp.305, San Diego, CA, 2000,
    [ZRL96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Porc. Of ACM SIGMOD, Montreal Canada,

    pp．103-114，June 1996．
    [杨03a] 杨风召，朱扬勇．一种有效的量化交易数据相似性搜索方法，《计算机研究与发展》，已录用，2003年内发表。
    [杨03b] 杨风召，朱扬勇，施伯乐．IncLOF：动态环境下局部异常的增量挖掘算法，《计算机研究与发展》，已录用，2003年内发表。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700