大数据管理系统的历史、现状与未来

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

大数据管理系统的历史、现状与未来

详细信息查看全文 | 推荐本文 |

英文篇名：History, Present, and Future of Big Data Management Systems
作者：杜小勇 ; 卢卫 ; 张峰
英文作者：DU Xiao-Yong;LU Wei;ZHANG Feng;Key Laboratory of Data Engineering and Knowledge Engineering,MOE(Renmin University of China);School of Information,Renmin University of China;
关键词：大数据管理系统 ; 数据存储 ; 数据模型 ; 模块化 ; 松耦合
英文关键词：big data management system;;data storage;;data model;;modularity;;loose coupling
中文刊名：RJXB
英文刊名：Journal of Software
机构：数据工程与知识工程教育部重点实验室(中国人民大学);中国人民大学信息学院;
出版日期：2018-11-21 09:52
出版单位：软件学报
年：2019
期：v.30
基金：国家重点研发计划(2018YFB1004401);; 国家自然科学基金(61732014,61502504,61802412);; 北京市科技计划(Z171100005117002)~~
语种：中文;
页：RJXB201901008
页数：15
CN：01
ISSN：11-2560/TP
分类号：130-144

摘要

大数据管理技术正在经历以软件为中心到以数据为中心的计算平台的变迁,传统的关系型数据库管理系统无法满足现在以数据为中心的大数据管理的需求,设计新型大数据管理系统迫在眉睫.首先回顾了数据管理技术的发展历史;之后,从大数据管理的存储、数据模型、计算模式、查询引擎等方面分析了大数据管理系统的现状,指出目前大数据管理系统具有模块化和松耦合的特点,并进一步介绍了大数据管理系统应具备的数据特征、系统特征和应用特征,指出大数据管理系统技术还在快速进化之中,预测未来的大数据管理系统应具备多数据模型并存、多计算模式融合、可伸缩调整、新硬件驱动、自适应调优等特点.
Big data management systems are migrating from software-centric computing platforms to data-centric computing platforms,and traditional relational databse management systems(a.b.a. RDBMS) do not entirely meet the need for data-centric data management.Hence, it is urgent to design a new kind of big data management systems. In this The paper, we first reviews the history of the development of data management systems. Second, we it analyzes the current situation of big data management systems in terms of data storage, data model, and query engines of big data management systems, and points out that current big data management systems have the characteristics of modularity and loose coupling. After that, the data characteristics and application characteristics of the big data management systems are we introduced the data characteristics and application characteristics of the big data management systems,, and it is pointed out that big data management systems are still rapidly evolving, but do not mature. Finally, the future of big data management systems is we predicted, i.e., the future of big data management systems. Big big data management systems in future should have the characteristics of multiple data models coexistence, multi-computation platform fusion, elastic adjustment, new hardware driven,self-adaptive tuning, and so on.

引文

[1] Kshetri N. Big data's role in expanding access to financial services in China. Int'l Journal of Information Management, 2016,36(3):297-308.
    [2] Yazdanifard R, Li MTH. The review of Alibaba's online business marketing strategies which navigate them to present success.Global Journal of Management and Business Research:EMarketing, 2014,14(7):33-39.
    [3] Lien CH, Cao Y. Examining WeChat users'motivations, trust, attitudes, and positive word-of-mouth:Evidence from China.Computers in Human Behavior, 2014,41:104-111.
    [4] Codd EF. A relational model of data for large shared data banks. Communications of the ACM, 1970,13(6):377-387.
    [5] Gray J, Reuter A. Transaction Processing:Concepts and Techniques. Morgan Kaufmann Publishers, 1993.
    [6] Han J, Pei J, Kamber M. Data Mining:Concepts and Techniques. 3rd ed., Morgan Kaufmann Publishers, 2002.
    [7] Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters. In:Proc. of the 6th Symp. on Operating System Design and Implementation. 2004. 137-150.
    [8] Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable:A distributed storage system for structured data. In:Proc. of the 7th Symp. on Operating Systems Design and Implementation. 2006. 205-218.
    [9] Ghemawat S, Gobioff H, Leung ST. The Google file system. In:Proc. of the 19th ACM Symp. on Operating Systems Principles.2003. 29-43.
    [10] White T. Hadoop:The Definitive Guide. 4th ed.,O'Reilly Media, Inc., 2012.
    [11] George L. HBase:The Definitive Guide:Random Access to Your Planet-Size Data. O'Reilly Media, Inc., 2011.
    [12] Leavitt N. Will NoSQL databases live up to their promise? IEEE Computer, 2010,43(2):12-14.
    [13] Zhou AY. Understanding on the big data:Beyond the data management and analytics. Big Data Research, 2017,3(2):3-18(in Chinese with English abstract).
    [14] Taobao file system. 2018. http://tfs.taobao.org/
    [15] Weil SA, Brandt SA, Miller EL, Long DDE, Maltzahn C. Ceph:A scalable, high-performance distributed file system. In:Proc. of the 7th Symp. on Operating Systems Design and Implementation. 2006. 307-320.
    [16] Amazon S3. 2018. https://aws.amazon.com/s3/
    [17] Palankar MR, Iamnitchi A, Ripeanu M, Garfinkel S. Amazon S3 for science grids:A viable solution? In:Proc. of the 2008 Int'l Workshop on Data-Aware Distributed Computing. 2008. 55-64.
    [18] Liu XY, Yu Q, Liao JW. FastDFS:A high performance distributed file system. ICIC Express Letters, Part B, Applications:An Int'l Journal of Research and Surveys, 2014,5(6):1741-1746.
    [19] Santos MND, Cerqueira R. GridFS:Targeting data sharing in grid environments. In:Proc. of the 6th IEEE Int'l Symp. on Cluster Computing and the Grid. 2006. 17.
    [20] MogileFS. 2018. https://github.com/mogilefs
    [21] Li H, Ghodsi A, Zaharia M, et al. Tachyon:Reliable, memory speed storage for cluster computing frameworks. In:Proc. of the ACM Symp. on Cloud Computing(SOCC 2014). ACM Press, 2014. 1-15.
    [22] Alluxio. 2018. https://www.alluxio.org/
    [23] Redis. 2018. https://redis.io/
    [24] Memcache. 2018. http://pecl.php.net/package/memcache/
    [25] DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo:Amazon's highly available key-value store. In:Proc. of the 21st ACM SIGOPS Symp. on Operating Systems Principles. 2007.205-220.
    [26] Voldemort. 2018. http://www.project-voldemort.com/voldemort/
    [27] Apache Cassandra. 2018. https://cassandra.apache.org/
    [28] Ousterhout J, Agrawal P, Erickson D, Kozyrakis C, Leverich J, Mazieres D, Mitra S, Narayanan A, Parulkar G, Rosenblum M,Rumble SM, Stratmann E, Stutsman R. The case for RAMClouds:Scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review, 2010,43(4):92-105.
    [29] Neo4j. 2018. https://neo4j.com/
    [30] OrientDB. 2018. https://orientdb.com/
    [31] Microsoft Azure. 2018. https://azure.microsoft.com/zh-cn/services/cosmos-db/
    [32] Paz JRG. Introduction to Azure Cosmos DB. Microsoft Azure Cosmos DB Revealed, Berkeley:Apress, 2018. 1-23.
    [33] Arenas M, Perez J. Querying semantic Web data with SPARQL. In:Proc. of the 30th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems. 2011. 305-316.
    [34] Cypher. 2018. http://www.neo4j.org/learn/cypher/
    [35] Gremlin. 2018. https://github.com/tinkerpop/gremlin/wiki
    [36] Graph DBMS. 2018. https://db-engines.com/en/ranking/graph+dbms/
    [37] Apache storm. 2018. http://storm.apache.org/
    [38] Apache spark streaming. 2018. http://spark.apache.org/streaming/
    [39] Apache flink. 2018. http://flink.apache.org/
    [40] Neumeyer L, Robbins B, Nair A, Kesari A. S4:Distributed stream computing platform. In:Proc. of the 2010 IEEE Int'l Conf. on Data Mining Workshops. 2010. 170-177.
    [41] JStorm. 2018. http://www.jstorm.io/
    [42] Blink. 2018. https://flink.apache.org/poweredby.html/
    [43] Chen GJ, Wiener JL, Iyer S, Jaiswal A, Lei L, Simha N, Wang W, Wilfong K, Williamson T, Yilmaz S. Realtime data processing at Facebook. In:Proc. of the 2016 ACM Int'l Conf. on Management of Data. 2016. 1087-1098.
    [44] StreamBase. 2018. https://www.tibco.com/products/tibco-streambase
    [45] Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX:Graph processing in a distributed dataflow framework.In:Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. 2014. 599-613.
    [46] Xin RS, Gonzalez JE, Franklin MJ, Stoica I. GraphX:A resilient distributed graph system on spark. In:Proc. of the 1st Int'l Workshop on Graph Data Management Experiences and Systems. 2013. 2:1-2:6.
    [47] Apache giraph. 2018. http://giraph.apache.org/
    [48] Low YC, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM. Distributed GraphLab:A framework for machine learning and data mining in the cloud. Proc. of the VLDB Endowment, 2012,5(8):716-727.
    [49] Bu YY, Howe B, Balazinska M, Ernst MD. HaLoop:Efficient iterative data processing on large clusters. Proc. of the VLDB Endowment, 2010,3(1-2):285-296.
    [50] Presto. 2018. http://prestodb.io/
    [51] Bittorf M, Bobrovytsky T, Erickson C, et al. Impala:A modern, open-source SQL engine for Hadoop. In:Proc. of the 7th Biennial Conf. on Innovative Data Systems Research. 2015.
    [52] Apache drill. 2018. http://drill.apache.org/
    [53] Thusoo A, Sarma JS, Jain N, et al. Hive:A warehousing solution over a map-reduce framework. Proc. of the VLDB Endowment,2009,2(2):1626-1629.
    [54] Apache hive. 2018. https://hive.apache.org/
    [55] Armbrust M, Xin RS, Lian C, et al. Spark SQL:Relational data processing in Spark. In:Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. 2015. 1383-1394.
    [56] Asterixdb. 2018. https://asterixdb.apache.org/
    [57] Zaharia M, Xin RS, Wendell P, et. al. Apache Spark:A unified engine for big data processing. Communications of the ACM, 2016,59(11):56-65.
    [58] Abadi M, Barham P, Chen J, et al. TensorFlow:A system for large-scale machine learning. In:Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. 2016. 265-283.
    [59] Meng X, Bradley J, Yavuz B, et al. MLlib:Machine learning in apache Spark. The Journal of Machine Learning Research, 2016,17(1):1235-1241.
    [60] Jia Y, Shelhamer E, Donahue J, et al. Caffe:Convolutional architecture for fast feature embedding. In:Proc. of the 22nd ACM Int'l Conf. on Multimedia. 2014. 675-678.
    [61] Grolinger K, Higashino WA, Tiwari A, et al. Data management in cloud environments:NoSQL and NewSQL data stores. Journal of Cloud Computing:Advances, Systems and Applications, 2013,2(1):49:1-49:24.
    [62] Rossbach CJ, Currey J, Silberstein M, Ray B, Witchel M. PTask:Operating system abstractions to manage GPUs as compute devices. In:Proc. of the 23rd ACM Symp. on Operating Systems Principles. 2011. 233-248.
    [63] Jeffers J, Reinders J. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers, 2013.
    [64] Indeck RS, Cytron RK, Franklin MA,et al. Associative database scanning and information retrieval using FPGA devices:U.S.Patent 7,139,743, 2006-11-21.
    [65] Hurst TN, Perlov C, Wilson C, et al. Non-Volatile memory:U.S. Patent 6,646,912, 2003-11-11.
    [66] Pandya AA. TCP/IP processor and engine using RDMA:U.S. Patent 7,376,755, 2008-5-20.
    [67] Liu J, Wu J, Panda DK. High performance RDMA-based MPI implementation over InfiniBand. Int'l Journal of Parallel Programming, 2004,32(3):167-198.
    [68] Wang JM. Key technologies in big data applications development and runtime support platform. Ruan Jian Xue Bao/Journal of Software, 2017,28(6):1516-1528(in Chinese with English abstract), http://www.jos.org.cn/1000-9825/5231.htm[doi:10.13328/j.cnki.jos.005231]
    [13]周傲英.感悟大数据一一从数据管理和分析说起.大数据,2017,3(2):3-18.
    [68]王建民.领域大数据应用开发与运行平台技术研究.软件学报,2017,28(16):1516-1528. http://www.jos.org.cn/1000-9825/5231.htm[doi:10.13328/j.cnki.jos.005231]

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700