用户名: 密码: 验证码:
Hadoop中处理小文件的四种方法的性能分析
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Performance analysis of four methods for handling small files in Hadoop
  • 作者:李三淼 ; 李龙澍
  • 英文作者:LI Sanmiao;LI Longshu;School of Computer Science and Technology, Anhui University;
  • 关键词:Hadoop ; 小文件处理 ; Hadoop的分布式文件系统(HDFS) ; MapReduce ; 大数据
  • 英文关键词:Hadoop;;small files handling;;Hadoop Distributed File System(HDFS);;Map Reduce;;big data
  • 中文刊名:JSGG
  • 英文刊名:Computer Engineering and Applications
  • 机构:安徽大学计算机科学与技术学院;
  • 出版日期:2014-12-30 16:56
  • 出版单位:计算机工程与应用
  • 年:2016
  • 期:v.52;No.856
  • 基金:安徽省自然科学基金(No.1308085QF114);; 安徽省高等学校省级自然科学研究项目(No.KJ2013A015)
  • 语种:中文;
  • 页:JSGG201609009
  • 页数:6
  • CN:09
  • ISSN:11-2127/TP
  • 分类号:48-53
摘要
Hadoop的设计初衷是为了存储和分析大数据,其最擅长处理的是大数据集。但是在实际应用中,却存在着大量的小文件。一般情况下有四种处理海量小文件的方法,分别为默认输入格式Text Input Format、为处理小文件而设计的Combine File Input Format输入格式、Sequence File技术以及Harballing技术。为了比较在相同的Hadoop分布式环境下这四种技术处理大量小文件时的性能,选用了典型的数据集,利用词频统计程序,来比较四种小文件处理技术的性能差异。实验研究表明,在不同需求下处理大量小文件的时候,选用适当的处理方法能够在很大程度上提高大量小文件的处理效率。
        Hadoop is designed to store and analyze large data, and it is good at processing large data sets. However, in practical applications, there are a large number of small files. There are four methods of handling massive small files generally which are default input format Text Input Format, Combine File Input Format which is designed for handling small files, Sequence File technology and Harballing technology. In order to compare the performance of these four technologies dealing with a large number of small files in the same Hadoop distributed environment, it uses a word frequency statistics program with typical data sets to compare the performance differences between the four small files processing technology.Experimental studies have shown that, when dealing with a large number of small files in different needs, choosing the appropriate handling method can improve the processing efficiency of a large number of small files to a large extent.
引文
[1]Mackey G,Sehrish S,Wang Jun.Improving metadata management for small files in HDFS[C]//Proc IEEE International Conference on Cluster Computing and Workshops,2009.
    [2]White T.Hadoop权威指南[M].周敏奇,王晓玲,金澈清,等译.2版.北京:清华大学出版社,2011.
    [3]Shvachko K,Kuang H,Radia S,et al.The Hadoop distributed file system[EB/OL].[2014-04-12].http://storageconference.us/2010/Papers/MSST/Shvachko.pdf.
    [4]Borthakur D.The Hadoop distributed file system:architecture and design[EB/OL].[2014-04-12].https://svn.apache.org/repos/asf/hadoop/common/tags/release-0.10.0/docs/hdfs_design.pdf.
    [5]Shvachko K V.HDFS scalability:the limits to growth[EB/OL].[2014-04-12].https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf.
    [6]White T.The small files problem[EB/OL].[2014-04-12].http://www.cloudera.com/blog/2009/02/the-small-files-problem/.
    [7]高泽栋.一种优化HDFS小写文件存储策略研究与实现[D].武汉:华中科技大学,2013.
    [8]李铁,燕彩蓉,黄永锋,等.面向Hadoop分布式文件系统的小文件存取优化方法[J].计算机应用,2014(11).
    [9]袁玉,崔超远,乌云,等.单机下Hadoop小文件处理性能分析[J].计算机工程与应用,2013,49(3):57-60.
    [10]陆嘉恒.Hadoop实战[M].2版.北京:机械工业出版社,2012.
    [11]龚高晟.通用分布式文件系统的研究与改进[D].广州:华南理工大学,2010.
    [12]Korat V G,Pamu K S.Reduction of data at namenode in HDFS using harballing technique[J].International Journal of Advanced Research in Computer Engineering&Technology,2012.
    [13]Yahoo.Apache hadoop module 4:Map Reduce[EB/OL].[2014-04-12].https://developer.yahoo.com/hadoop/tutorial/module4.html.
    [14]张春明,芮建武,何婷婷.一种Hadoop小文件存储和读取的方法[J].计算机应用与软件,2012(11).
    [15]樊超,凌捷.改善Hadoop文件处理效率的技术研究[J].微电子学与计算机,2014(7).
    [16]余思,桂小林,黄汝维,等.一种提高云存储中小文件存储效率的方案[J].西安交通大学学报,2011(6).

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700