海量样本数据集中小文件的存取优化研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

海量样本数据集中小文件的存取优化研究

详细信息查看全文 | 推荐本文 |

英文篇名：Research on access optimization of small files in massive sample data sets
作者：马振 ; 哈力旦·阿布都热依木 ; 李希彤
英文作者：MA Zhen;HALIDAN Abudureyimu;LI Xitong;School of Electrical Engineering, Xinjiang University;
关键词：Hadoop分布式文件系统(HDFS) ; 小文件 ; 样本数据集 ; 缓存预取 ; 分布式数据库 ; HBase
英文关键词：Hadoop Distributed File System(HDFS);;small file;;sample data set;;cache prefetch;;distributed database;;HBase
中文刊名：JSGG
英文刊名：Computer Engineering and Applications
机构：新疆大学电气工程学院;
出版日期：2018-11-15
出版单位：计算机工程与应用
年：2018
期：v.54;No.917
基金：新疆维吾尔自治区自然科学基金(No.2016D01C048)
语种：中文;
页：JSGG201822013
页数：6
CN：22
分类号：85-89+103

摘要

针对Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)在海量样本数据集存储方面存在内存占用多、读取效率低的问题,以及分布式数据库HBase在存储文件名重复度和类似度高时产生访问热点的问题,结合样本数据集的特点、类型,提出一种面向样本数据集存取优化方案,优化样本数据集中小文件的写入、读取、添加、删除和替换策略。该方案根据硬件配置测得大、小文件的分界点,通过变尺度堆栈算法按样本数据集的目录结构将小文件合并存储至HDFS;结合行键优化策略将文件索引存储在HBase数据表中;搭建基于Ehcache缓存框架的预取机制。实验结果表明,该方案降低了主节点的内存消耗,提高了文件的读取效率,实现了对海量样本数据集中小文件的高效存取。
For the Hadoop Distributed File System(HDFS), there are problems of large memory usage and low reading efficiency in the storage of massive sample data sets, and the problem of generating access hotspots when the repeatability and similarity of storage file name are high for the distributed database HBase. Combined with the characteristics and types of sample data sets, a sample data sets access optimization scheme is proposed to optimize the writing, reading, adding,deleting and replacing of small files in the sample data sets. The scheme measures the demarcation point of large and small files according to the hardware configuration and stores the small files into HDFS by the variable scale stack algorithm according to the directory structure of the sample data sets, then stores the file index in the HBase data table with the row-key optimization strategy and builds the prefetching mechanism based on the Ehcache cache frame. The experimental results show that the scheme reduces the memory consumption of the master node, improves the reading efficiency of the files and achieves efficient access to small files in the massive sample data sets.

引文

[1]李国,李汶晓,徐俊洁.航空货运中海量小文件的存储优化[J].计算机工程与设计,2018,39(5):1484-1489.
    [2]李林阳,吕志平,崔阳,等.海量GNSS小文件云存储优化方法研究[J].武汉大学学报(信息科学版),2017,42(8):1068-1074.
    [3]游小容,曹晟.海量教育资源中小文件的存储研究[J].计算机科学,2015,42(10):76-80.
    [4] Sethia D,Sheoran S,Saran H.Optimized MapFile based storage of small files in Hadoop[C]//IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing,2017:906-912.
    [5]郑通,郭卫斌,范贵生.HDFS中海量小文件合并与预取优化方法的研究[J].计算机科学,2017,44(S2):516-519.
    [6]李三淼,李龙澍.Hadoop中处理小文件的四种方法的性能分析[J].计算机工程与应用,2016,52(9):44-49.
    [7]马生俊,陈旺虎,郭宏乐,等.Hadoop集群中影响应用性能的因素分析[J].小型微型计算机系统,2018,39(4):719-724.
    [8] Feng C,Li B.Research of temporal information index strategy based on HBase[J].Procedia Computer Science,2017:367-372.
    [9] Liu J,Zhang X X.Modeling fuzzy relational database in HBase[J].Journal of Intelligent&Fuzzy Systems,2016,31(3):1845-1857.
    [10]夏秀峰,张羽.基于时间序列的PDM文件数据块分布算法[J].计算机工程与设计,2015,36(12):3264-3268.
    [11]王伦文,张铃.动态网络最短程求解技术研究[J].系统仿真学报,2018,30(3):1189-1194.
    [12]汪志鹏,杨明慧,陈兵,等.安全数据库顶层规范中SQL操作的形式化分析与验证[J].计算机应用研究,2015,32(6):1751-1756.
    [13]陆婷,房俊,乔彦克.基于HBase的交通流数据实时存储系统[J].计算机应用,2015,35(1):103-107.
    [14] Ma K,Yang B.Column access-aware in-stream data cache with stream processing framework[J].Journal of Signal Processing Systems,2017,86(2/3):191-205.
    [15] Eadline D.Hadoop 2 Quick-Start guide:learn the essentials of big data computing in the Apache Hadoop 2ecosystem[M].[S.l.]:Addison-Wesley Professional,2015.
    [16] Krause J,Stark M,Jia D,et al.3D object representations for fine-grained categorization[C]//IEEE International Conference on Computer Vision Workshops,2014:554-561.
    [17] Hadoop.Sequence file Wiki[EB/OL].(2009-09-05)[2016-12-18].https://wiki.apache.org/Hadoop/SequenceFile.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700