大数据流式计算框架Heron环境下的流分类任务调度策略

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

大数据流式计算框架Heron环境下的流分类任务调度策略

详细信息查看全文 | 推荐本文 |

英文篇名：Task scheduling strategy based on data stream classification in Heron
作者：张译天 ; 于炯 ; 鲁亮 ; 李梓杨
英文作者：ZHANG Yitian;YU Jiong;LU Liang;LI Ziyang;Software College, Xinjiang University;College of Information Science and Engineering, Xinjiang University;School of Computer Science and Technology, Civil Aviation University of China;
关键词：大数据 ; 流式计算 ; Apache ; Heron ; 任务调度 ; 数据流分类 ; 通信开销
英文关键词：big data;;stream computing;;Apache Heron;;task scheduling;;data stream classification;;communication overhead
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：新疆大学软件学院;新疆大学信息科学与工程学院;中国民航大学计算机科学与技术学院;
出版日期：2018-11-27 16:48
出版单位：计算机应用
年：2019
期：v.39;No.344
基金：国家自然科学基金资助项目(61462079,61562078,61562086);; 国家科技支撑计划项目(2015BAH02F01);; 新疆维吾尔自治区自然科学基金资助项目(2017D01A20);; 新疆维吾尔自治区高校科研计划项目(XJEDU2016S106)~~
语种：中文;
页：JSJY201904028
页数：11
CN：04
ISSN：51-1307/TP
分类号：178-188

摘要

新型大数据流式计算框架Apache Heron默认使用轮询调度算法进行任务调度,忽略了拓扑运行时状态以及任务实例间不同通信方式对系统性能的影响。针对这个问题,提出Heron环境下流分类任务调度策略(DSC-Heron),包括流分类算法、流簇分配算法和流分类调度算法。首先通过建立Heron作业模型明确任务实例间不同通信方式的通信开销差异;其次基于流分类模型,根据任务实例间实时数据流大小对数据流进行分类;最后将相互关联的高频数据流整体作为基本调度单元构建任务分配计划,在满足资源约束条件的同时尽可能多地将节点间通信转化为节点内通信以最小化系统通信开销。在包含9个节点的Heron集群环境下分别运行SentenceWordCount、WordCount和FileWordCount拓扑,结果表明DSC-Heron相对于Heron默认调度策略,在系统完成时延、节点间通信开销和系统吞吐量上分别平均优化了8.35%、7.07%和6.83%;在负载均衡性方面,工作节点的CPU占用率和内存占用率标准差分别平均下降了41.44%和41.23%。实验结果表明,DSC-Heron对测试拓扑的运行性能有一定的优化作用,其中对接近真实应用场景的FileWordCount拓扑优化效果最为显著。
In a new platform for big data stream processing called Heron, the round-robin scheduling algorithm is usually used for task scheduling by default, which does not consider the topology runtime state and the impact of different communication modes among task instances on Heron's performance. To solve this problem, a task scheduling strategy based on Data Stream Classification in Heron(DSC-Heron) was proposed, including data stream classification algorithm, data stream cluster allocation algorithm and data stream classification scheduling algorithm. Firstly, the instance allocation model of Heron was established to clarify the difference in communication overhead among different communication modes of the task instances. Secondly, the data stream was classified according to the real-time data stream size between task instances based on the data stream classification model of Heron. Finally, the packing plan of Heron was constructed by using the interrelated high-frequency data streams as the basic scheduling unit to complete the scheduling to minimize the communication cost by transforming inter-node data streams into intra-node ones as many as possible. After running SentenceWordCount, WordCount and FileWordCount topologies in a Heron cluster environment with 9 nodes, the results show that compared with the Heron default scheduling strategy, DSC-Heron has 8.35%, 7.07% and 6.83% improvements in system complete latency, inter-node communication overhead and system throughput respectively; in the load balancing aspect, the standard deviations of CPU usage and memory usage of the working nodes are decreased by 41.44% and 41.23% respectively. All experimental results show that DSC-Heron can effectively improve the performance of the topologies, and has the most significant optimization effect on FileWordCount topology which is close to the real application scenario.

引文

[1]孙大为.大数据流式计算:应用特征和技术挑战[J].大数据,2015,1(3):99-105.(SUN D W.Big data stream computing:features and challenges[J].Big Data Research,2015,1(3):99-105.)
    [2]Seagate.Data age 2025[EB/OL].[2018-08-10].https://www.seagate.com/files/www-content/our-story/trends/files/data-age-2025-white-paper-simplified-chinese.pdf.
    [3]孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862.(SUN D W,ZHANGG Y,ZHENG W M.Big data stream computing:technologies and instances[J].Journal of Software,2014,25(4):839-862.)
    [4]TOSHNIWAL A,TANEJA S,SHUKLA A,et al.Storm@Twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.New York:ACM,2014:147-156.
    [5]CARBONE P,EWEN S,HARIDI S,et al.Apache FlinkTM:stream and batch processing in a single engine[J].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2015,36(4):28-38.
    [6]ANIELLO L,BALDONI R,QUERZONI L.Adaptive online scheduling in Storm[C]//Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems.New York:ACM,2013:207-218.
    [7]XU J L,CHEN Z H,TANG J,et al.T-Storm:traffic-aware online scheduling in Storm[C]//Proceedings of the 34th IEEE International Conference on Distributed Computing Systems.Piscataway,NJ:IEEE,2014:535-544.
    [8]PENG B Y,HOSSEINI M,HONG Z H,et al.R-Storm:resourceaware scheduling in Storm[C]//Proceedings of the 16th Annual Middleware Conference.New York:ACM,2015:149-161.
    [9]鲁亮,于炯,卞琛,等.大数据流式计算框架Storm的任务迁移策略[J].计算机研究与发展,2018,55(1):71-92.(LU L,YU J,BIAN C,et al.A task migration strategy in big data stream computing with Storm[J].Journal of Computer Research and Development,2018,55(1):71-92.)
    [10]李梓杨,于炯,卞琛,等.基于流网络的流式计算动态任务调度策略[J].计算机应用,2018,38(9):2560-2567.(LI Z Y,YU J,BIAN C,et al.Dynamic task dispatching strategy for stream processing based on flow network[J].Journal of Computer Applications,2018,38(9):2560-2567.)
    [11]de ASSUNCAO M D,da SILVA VEITH A,BUYYA R.Distributed data stream processing and edge computing:a survey on resource elasticity and future directions[J].Journal of Network&Computer Applications,2018,103:1-17.
    [12]SHUKLA A,SIMMHAN Y.Model-driven scheduling for distributed stream processing systems[J].Journal of Parallel&Distributed Computing,2018,117:98-114.
    [13]TRUONG T M,HARWOOD A,SINNOTT R O.Predicting the stability of large-scale distributed stream processing systems on the cloud[C]//Proceedings of the 7th International Conference on Cloud Computing and Services Science.Piscataway,NJ:IEEE,2017:603-610.
    [14]SUN D,HUANG R.A stable online scheduling strategy for realtime stream computing over fluctuating big data streams[J].IEEEAccess,2016,4:8593-8607.
    [15]KULKARNI S,BHAGAT N,FU M,et al.Twitter Heron:stream processing at scale[C]//Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data.New York:ACM,2015:239-250.
    [16]FU M,MITTAL S,KEDIGEHALLI V,et al.Streaming@Twitter[J].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2015,38(4):15-27.
    [17]FU M,AGRAWAL A,FLORATOU A,et al.Twitter Heron:towards extensible streaming engines[C]//Proceedings of the2017 IEEE 33rd International Conference on Data Engineering.Piscataway,NJ:IEEE,2017:35-44.
    [18]Apache.Apache Aurora[EB/OL].[2018-08-10].http://aurora.apache.org.
    [19]HINDMAN B,KONWINSKI A,ZAHARIA M,et al.Mesos:a platform for fine-grained resource sharing in the data center[C]//Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation.Berkeley:USENIX Association,2010:429-483.
    [20]VAVILAPALLI V K,MURTHY A C,AGARWAL S,et al.Apache Hadoop YARN:yet another resource negotiator[C]//Proceedings of the 4th Annual Symposium on Cloud Computing.New York:ACM,2013:5.
    [21]KREPS J,NARKHEDE N,RAO J.Kafka:a distributed messaging system for log processing[EB/OL].[2018-05-10].http://pages.cs.wisc.edu/~akella/CS744/F17/838-Cloud Papers/Kafka.pdf.
    [22]Apache.Apache Distributed Log[EB/OL].[2018-05-10].http://bookkeeper.apache.org/distributedlog/.
    [23]Twitter.Implementing a custom scheduler[EB/OL].[2018-05-10].https://apache.github.io/incubator-heron/docs/contributors/custom-scheduler/.
    [24]KULKARNI S.Apache/incubator-heron[EB/OL].[2018-05-16].https://github.com/apache/incubator-heron.
    [25]KAMBURUGAMUVE S,RAMASAMY K,SWANY M,et al.Low latency stream processing:Apache Heron with Infiniband&Intel Omni-Path[C]//Proceedings of the 10th International Conference on Utility and Cloud Computing.New York:ACM,2017:101-110.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700