Spark on Yarn模式的电信大数据处理平台

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

Spark on Yarn模式的电信大数据处理平台

详细信息查看全文 | 推荐本文 |

英文篇名：A Spark on Yarn Based Telecommunication Big Data Processing Platform
作者：杨玉 ; 张远夏
英文作者：YANG Yu;ZHANG Yuanxia;School of physical education and healthy, Yulin Normal University;School of computer science and engineering, Yulin Normal University;
关键词：云计算 ; 电信大数据 ; 映射-规约 ; Yarn规范 ; Spark内存计算
英文关键词：Cloud computing;;Telecommunication Big Data;;Map Reduce;;Yarn framework;;Spark memory Computing
中文刊名：FJDN
英文刊名：Journal of Fujian Computer
机构：玉林师范学院体育健康学院;玉林师范学院计算机科学与工程学院;
出版日期：2019-03-25
出版单位：福建电脑
年：2019
期：v.35
语种：中文;
页：FJDN201903009
页数：5
CN：03
ISSN：35-1115/TP
分类号：38-42

摘要

为了提高电信大数据处理的性能,提出了一种Spark on Yarn模式的电信大数据处理平台SY-TPP(Spark on Yarn Telecommunication Big Data Processing Platform)。SY-TPP平台的实现采用Hadoop2.0的Yarn规范,运用了Spark分布式内存计算框架,使SY-TPP平台数据集的处理尽量在内存中进行。以分级聚类算法为案例分析了SY-TPP平台的编程步骤;测试结果表明:电信运营商的上GB级的用户数据能够半个工作日内完成,32物理节点的SY-TPP平台比同等配置的MapReduce平台的加速比从9.5提升10.25。
In order to improve the performance of telecommunication large data processing, a Spark on Yarn based telecommunication big data processing platform called SY-TPP was presented in this paper. The implementation of SY-TPP platform is based on Yarn specification of Hadoop 2.0 and Spark distributed memory computing framework. The programming steps were discussed using the data Hierarchical Clustering algorithm for SY-TPP. The experimental results show that massive data processing of Telecommunication Company can be completed in half of a working day. The speedup increased from 9.5 to 10.25 in a 32 host SY-TPP computing platform and MapReduce Computing.

引文

[1]夏靖波,韦泽鲲,付凯,陈珍.云计算中Hadoop技术研究与应用综述.计算机科学,2016,43(11):6-11
    [2]B Tang,H He,G Fedak HybridMR:a new approach for hybrid MapReduce combining desktop grid and cloud infrastructures.Concurrency&Computation Practice&Experience,2015,27(16):4140-4155
    [3]A O'Driscoll,J Daugelaite,RD Sleator.'Big data',Hadoop and cloud computing in genomics.Journal of Biomedical Informatics,2013,46(5):774-781
    [4]Wei-Chun Chung,Jan-Ming Ho,Chung-Yen Lin,D.T.Lee.CloudEC:A MapReduce-based algorithm for correcting errors in next-generation sequencing big data.2017 IEEE International Conference on Big Data,2017:2836-2842
    [5]Zheng Y,Capra L,Wolfson O,et al.Urban Computing:concepts,methodologies,and applications.ACM Transaction on Intelligent Systems and Technology(ACM TIST),2014,1(1):1-9
    [6]J Dean,S Ghemawat.MapReduce:Simplified Data Processing On Large Clusters.communications of the ACM,2008,51(1):107-113
    [7]S Lu,W Tong,Z Chen,Implementation of the KNN algorithm based on Hadoop.proceedings of the 2015 International conference on Smart and Sustainable city and big data.London:IET,2015,123-126
    [8]于苹苹,倪建成,姚彬修,李淋淋,曹博.基于Spark框架的高效KNN中文文本分类算法.计算机应用,2016,36(12):3292-3297
    [9]F Marozzo,D Talia,P Trunfio.P2P-MapReduce:Parallel data processing in dynamic cloud environments.Journal of Computer&System Sciences,2012,78(5):1382-1402
    [10]Gao Tanjie.Data processing with Spark.Beijing:China Machine Press,2015
    [11]Li F,Ooi B C,?zsu M T,et al.Distributed data management using MapReduce.ACM Computing Surveys(CSUR),2014,46(3):1-42
    [12]Apache.Hadoop NextGen Map Reduce(Yarn).http://hadoop.apache.org/docs/current2/hadoop-Yarn/hadoop-Yarn-site/YARN.html,2018
    [13]The Apache software Foundtion.Spark.http://spark.apache.org/,2018
    [14]M Zaharia,M Chowdhury,MJ Franklin,S Shenker,I Stoica.Spark:Cluster computing with working sets.Usenix Conference on Hot Topics in Cloud Computing,2010,15(1):10
    [15]Barham P,Dragovic B,Fraser K,Hand S,Harris T,Ho A,Neugebaur R,Pratt I,Warfield A.Xen and the art of virtualization.In:Proc.of the 9th ACM Symp.On Operating Systems Principles.New York:Bolton Landing,2003:164-177
    [16]Citrix systems,citrix XenServer:Efficient virtual server software.XenSource Company.http://www.xensource.com/,2018
    [17]The K-nearest neighbor algorithm using MapReduce paradigm,proceedings of the 2014 International conference on Inteligent system,Modeling and simulation.Piscataway,NJ:IEEE,2014:513-518
    [18]Holden K,Andy K,Patrick W,et al.Learning spark:Lighting fast data analysis.New York:O,Reilly Media,2015
    [19]李成华,张新访,金海,向文.MapReduce:新型的分布式并行计算编程模型.计算机工程与科学,2011,33(3):129-135
    [20]Hadoop Open source web site 2018.http://hadoop.apache.org/
    [21]李媛祯,杨群,赖尚琦,李博涵.一种Hadoop Yarn的资源调度方法研究.电子学报,2016,44(5):1017-1024

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700