摘要
近年来,为迎合大数据时代的需求,诞生了一批大数据处理平台,包括Hadoop,Spark,Storm等,Spark以其独特的优势在此中最受欢迎。尽管Spark的应用得到了大力推广,其性能还存在严重问题,很多学者正致力于寻找提升性能的有效途径。针对这一问题,他们从优化相关配置参数的角度出发,分析并总结了参数优化对Spark平台性能的重要影响以及目前国内外的Spark参数优化技术。最后,归纳了Spark参数优化现存的主要问题,并提出了下一步的研究方向。
In recent years,in order to meet the needs of the era of big data,a number of big data processing platforms have been born,including Hadoop,Spark,Storm,etc.Spark is the most popular among them because of its unique advantages.Despite the widespread use of Spark,there are serious problems with its performance.In response to this problem,they analyze and summarize the current Spark parameter optimization techniques at home and abroad from the perspective of optimizing relevant configuration parameters.Finally,the main problems existing in Spark parameter optimization are summarized,and the next research direction is proposed.
引文
[1]Apache Hadoop,http://hadoop.apache.org/
[2]Apache Spark,http://spark.apache.org/.
[3]Apache Storm,http://storm.apache.org/.
[4]门威.基于MapReduce的大数据处理算法综述[J].濮阳职业技术学院学报,2017,30(5):85-88.
[5]Omid Alipourfard,Hongqiang Harry Liu,Jianshu Chen,Shivaram Venkataraman,Minlan Yu,Ming Zhang,CherryPick:Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics,Proc.of the 14th USENIX Symposium on Networked Systems Design and Implementation(NSDI’17),Boston,MA,USA,March 27-29,2017:469-182.
[6]SparkConfiguration,http://spark.apache.org/docs/1.6.1/configuration.html.
[7]Gounaris A,Torres J.A Methodology for Spark Parameter Tuning[J].Big Data Research,2017.
[8]陈侨安,李峰,曹越,等.基于运行数据分析的Spark任务参数优化[J].计算机工程与科学,2016,38(1):11-19
[9]XU J G,WANG G L,LIU S Y,et al.A Novel Performance Evaluation and Optimization Model for Big Data System[C]//Proceedings of the 15th International Symposium on Parallel and Distributed Computing(ISPDC 2016).Fuzhou,China,2016:1765-1773.
[10]A.J.Awan,M.Brorsson,V.Vlassov,E.Ayguade,How data volume affects spark based data analytics on a scale-up server,arXiv:1507.08340,2015.
[11]Y.Wang,R.Goldstone,W.Yu,T.Wang,Characterization and optimization of memory-resident mapreduce on hpc systems,in:28th International Parallel and Distributed Processing Symposium(IPDPS),2014,pp.799-808.
[12]A.Davidson,A.Or,Optimizing Shuffle Performance in Spark,Tech.Rep.,Berkeley-Department of Electrical Engineering and Computer Sciences,University of California,2016.
[13]杨志伟,郑烇,王嵩,等.异构Spark集群下自适应任务调度策略[J].计算机工程,2016,42(1):31-35,40.
[14]王利,王晶,张伟功,等.Linux内核参数对Spark负载性能影响的研究[J].计算机工程与科学,2017,39(7):1219-1226.
[15]康海蒙.基于细粒度监控的Spark优化研究[D].哈尔滨工业大学,2016.