面向云环境的重复数据删除关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向云环境的重复数据删除关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Key Technologies of Data Deduplication for Cloud Environment
作者：付印金
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：重复数据删除 ; 云备份 ; 虚拟桌面云 ; 索引查询 ; 数据路由 ; 虚拟机调度
英文关键词：Data Deduplication ; Cloud Backup ; Virtual Desktop Cloud ; Index
英文关键词：Lookup ; Data Routing ; Virtual Machine Scheduling
学位年度：2013
导师：肖侬
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2013-06-01

摘要

随着大数据时代的到来,信息世界的数据量呈爆炸式增长,数据中心的数据存储和管理需求已达到PB级甚至EB级。研究发现,不论是在备份、归档存储层,还是在常规的主存储层,日趋复杂的海量数据集中都有大量的重复数据。传统的数据备份技术和虚拟机镜像存储管理方法更是加速了重复数据的增长。为了抑制数据过快增长,提高IT资源利用率,降低系统能耗以及管理成本,重复数据删除技术作为一种新兴的数据缩减技术,已成为当前学术界和工业界的研究热点。
     云计算作为大数据的关键支撑技术,通过网络计算和虚拟化技术优化资源利用率,为用户提供廉价、高效、可靠的计算和存储服务。针对具有大量冗余数据的云备份和虚拟桌面云环境,重复数据删除技术能够极大地降低存储空间需求和提高网络带宽利用率,但也存在系统性能上的挑战。本文主要讨论：如何利用重复数据删除技术优化个人计算环境云备份服务、数据中心分布式云备份存储系统以及虚拟桌面云集群存储系统,以提高IT资源利用率和系统扩展性,降低数据消重操作对I/O性能的影响。本文在全面了解当前云计算技术发展现状的基础上,深入分析和研究了基于重复数据删除技术的云备份、大数据备份和虚拟桌面云等应用,并提出了新的系统设计和算法。主要工作和创新如下：
     (1)提出了基于个人计算环境云备份服务的分级应用感知源端重复数据删除机制ALG-Dedupe。本文通过对大量个人应用数据进行统计分析,首次发现了不同类型应用数据集之间共享的数据量可以忽略不计。利用文件语义指导应用数据分类,设计了应用感知的索引结构,允许应用数据内部独立并行地进行重复数据删除,并可以根据各类应用数据的特点自适应地选择数据划分策略和指纹计算函数。由于客户端本地冗余检测和云数据中心远程冗余检测这两种方法实现的源端消重策略在响应延迟和系统开销上互补,将应用感知的源端重复数据删除分为客户端的局部消重和云端的全局消重两级来进一步提高数据缩减率和减少消重处理时间。通过实验表明,ALG-Dedupe在极大提高重复数据删除效率的同时,有效地缩减了数据备份窗口和云存储成本,降低了个人计算设备的能耗和系统开销。
     (2)设计了一种支持云数据中心实现大数据备份的可扩展集群重复数据删除方法E-Dedupe。该方法的新颖之处在于同时开发了数据局部性和相似性来优化集群重复数据删除。E-Dedupe结合集群节点间超块级数据路由和节点内块级重复数据删除处理,在提高数据缩减率的同时,保持数据访问的局部性；通过扩展Broder的最小值独立置换理论,首次提出采用手纹技术来提高超块相似度的检测能力；通过节点的存储空间利用率加权相似度,设计了基于手纹的有状态超块数据路由算法,将数据按超块粒度从备份客户端分配到各个重复数据删除服务器节点。利用超块手纹中的代表性数据块指纹构建相似索引,并结合容器管理机制和数据块指纹缓存策略,以优化数据块指纹查询性能。通过采用源端在线重复数据删除技术,备份客户端可以避免向目标路由节点传输超块中的重复数据块。通过大量实验表明,E-Dedupe能够在获得集群范围内高数据缩减率的同时,有效地降低了系统通信开销和内存开销,并保持各节点负载平衡。
     (3)提出了一种基于集群重复数据删除的虚拟桌面云存储优化技术。为支持可扩展的虚拟桌面云服务,虚拟桌面服务器集群需要管理大量桌面虚拟机,本文通过开发虚拟机镜像文件的语义信息,首次提出了基于语义感知的虚拟机调度算法来支持基于重复数据删除的虚拟桌面集群存储系统。同时,结合服务器的数据块缓存和本地混合存储缓存,设计了基于重复数据删除的虚拟桌面存储I/O优化策略。实验分析表明,基于重复数据删除的虚拟桌面集群存储优化技术有效地提高了虚拟桌面存储的空间利用率,降低了存储系统的I/O操作数,并改进了虚拟桌面的启动速度。
     通过上述几项基于云环境中的重复数据删除关键技术研究,我们为未来云存储和云计算研究提供了有力的技术支撑。
With the coming of Big Data era, the data capacity in information world is growingat an explosive rate, and the scale of dataset needing storage and management in datacenter can be easily expanded to Petabytes, or even Exabytes. We know the results fromsome previous works: there is large amount of data redundancy exists in the massivedatasets of both backup/archive storage and primary storage. Traditional data backuptechniques and virtual machine image management can magnify the duplication bystoring redundant data over and over again. In order to restrain the excessive datagrowth, improve IT resource utilization, reduce system power consumption and savemangement cost, data deduplication, as a novel data reduction technology, has become ahot topic in academia and industry.
     As a key technology to support Big Data, cloud computing can optimize resourceefficiency by network computing and virtualization technique, to provide cost effective,high efficient and reliable computing and storage services for users. In cloud backupand virtual desktop cloud environment, deduplication can significantly reduce the re-quirement of storage space, and improve the effficieny of network bandwidth due tohigh data redundancy in these services, but it also brings new challenges on systemperformance. This thesis discusses how to apply deduplication to optimize cloud backupservice in personal computing environment, distributed cloud backup storage system indata center and cluster storage system in virtual desktop cloud, so that the storage spaceefficiency and system scalability can be significantly improved, and the negative impactof deduplication process on I/O performance becomes negligible. In this thesis, afterdeeply understanding the development of current cloud computing technology, we studythe deduplication based cloud backup, Big Data backup and virtual desktop cloud, andpropose creative system designs and novel algorithms. In summary, the main contribu-tions and innovations of this thesis are as follows:
     (1) Proposes ALG-Dedupe, an application-aware local and global sourcededuplication scheme for cloud backup services of personal computing environment.This thesis firstly discovers that the amount of data shared among different types of ap-plications is negligibly small by conducting a content overlapping analysis on massivepersonal datasets. According to the semantic based application classification, an appli-cation-aware index structure is designed, to improve the efficiency of deduplication byeliminating redundancy in each application independently and in parallel. It can alsoreduce the computational overhead by employing an intelligent data chunking schemeand an adaptive use of hash functions based on application awareness. To balance net-work latency and system overhead in personal devices, ALG-Dedupe combines cli-ent-side local redundancy detection with cloud-side global redundancy detection to im- prove data reduction ratio and reduce deduplication time. The experimental results showthat ALG-Dedupe can improve the deduplication efficiency significantly, shorten back-up window, save cloud cost, and reduce power consumption and system overhead inpersonal computing devices.
     (2) Designs-Dedupe, a scalable inline cluster deduplication method for Big Databackup. The novelty in our study lies in exploiting both locality and similarity in backupdata streams to optimize cluster deduplication. It combines inter-node super-chunk leveldata routing in cluster with intra-node chunk level deduplication process, to imprve datareduction ratio and keep data locality in each node. Inspired by the generalization ofBroder’s Theorem,-Dedupe is the first application of handprinting in the context ofcluster deduplication to improve the ability of similarity detection. After discount thesuper-chunk resemblances with storage usage in nodes, the handprint based stateful datarouting algorithm assigns data from backup clients to each deduplication server node atsuper-chunk level.-Dedupe builds a similarity index with super-chunk handprints overthe traditional container based locality-preserved chunk-fingerprint caching scheme toalleviate the chunk index lookup disk bottleneck. The backup clients can avoid transfer-ring duplicate data chunk to target deduplication server node over the network by per-forming source deduplication. Finally, we conduct a large number of experiments toshow that-Dedupe can maintain high cluster-wide data reduction ratio, reduce systemcommunication overhead and memory cost, with balanced workload in each node.
     (3) Proposes a cluster-deduplication based virtual desktop storage optimizationtechnique. To support virtual desktop cloud service, virtual desktop server cluster isneeded to manage large amount of desktop virtual machine. This thesis is the first re-search work to provide a virtual machine scheduling algorithm to optimizededuplication based virtual desktop storage by leveraging semantic awareness in virtualmachine images. Meanwhile, it combines chunk cache in server with local hybrid stor-age cache to improve the I/O performance of deduplication based virtual desktop stor-age. The experiments show that our optimization can improve the space efficiency ofvirtual desktop storage, reduce I/O operations, and enhance the virtual desktop start-upperformance.
     By studying the above key techniques of deduplication in cloud environment, weprovide a powerful technical support for the future of cloud storage and cloud compu-ting.

引文

[1]Chris Eaton, Dirk Deroos, Tom Deutsch, George Lapis, Paul Zikopoulos. Under-standing Big Data:Analytics for Enterprise Class Hadoop and Streaming Data [M]. New York, NY, USA:McGraw-Hill,2011:1-33.
    [2]John Gantz, David Reinsel. The Digital Universe Decade-Are You Ready?[R]. IDC White Paper,2010:1-16.
    [3]Richard Villars, Carl Olofson, Matthew Eastwood. Big Data:What It Is and Why You Should Care [R]. Framingham:IDC (White Paper),2011:1-14.
    [4]Michael Armbrust, Armando Fox, Rean Griffith, Anthony Joseph, Randy Katz, Andrew Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica and Matei Zaharia. Above the Clouds:A Berkeley View of Cloud Computing[R]. Berkeley, CA, USA:University of California at Berkeley.2009:1-25.
    [5]Biggar H. Experiencing Data De-Duplication:Improving Efficiency and Reducing Capacity Requirements [R]. Boston:The Enterprise Strategy Group (White Paper),2007:1-9.
    [6]Comparison of online backup services [EB/OL]. http://en.wikipedia.org/wiki/Comparison_of_online_backup_services,2010-02-18.
    [7]Yujuan Tan, Hong Jiang, Dan Feng, Lei Tian, Zhichao Yan and Guohui Zhou. SAM:A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup[C]. In Proceedings of the39th International Conference on Parallel Processing (ICPP'10), Piscataway, NJ, USA:IEEE,2010:614-623.
    [8]谭玉娟.数据备份系统中的数据去重技术研究[D].武汉：华中科技大学计算机学院,2012：16-47.
    [9]Michael Vrable, Stefan Savage and Geoffrey M. Voelker. Cumulus:File system Backup to the Cloud [C]. In Proceedings of the7th USENIX Conference on File and Storage Technologies (FAST'09), Berkeley, CA, USA:USENIX,2009:225-238.
    [10]Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long and Mark Lillibridge. Ex-treme Binning:Scalable, Parallel Deduplication for Chunk based File Backup [C]. In Proceedings of the17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MAS-COTS'2009), Piscataway, NJ, USA:IEEE,2009:1-9.
    [141]Benjamin Zhu, Kai Li, Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System [C]. In Proceedings of the6th USENIX Con-ference on File and Storage Technologies (FAST'08), Berkeley, CA, USA: USENIX,2008:269-282.
    [12]Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, Peter Camble. Sparse indexing:large scale, inline deduplication using sampling and locality [C]. In Proceedings of the7th Conference on File and Stor-age Technologies (FAST'09), Berkeley, CA, USA:USENIX,2009:111-123.
    [13]Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. The design of a similarity based deduplication system [C]. In Proceedings of the2nd ACM Annual International Systems and Storage Confer-ence (SYSTOR'09), New York, USA:ACM,2009:55-63.
    [14]Biplob Debnath, Sudipta Sengupta, Jin Li. ChunkStash:Speeding up Inline Storage Deduplication using Flash Memory[C]//Proc of the USENIX ATC'10. Berkeley, CA, USA:USENIX,2010:215-230.
    [15]Wen Xia, Hong Jiang, Dan Feng, Yu Hua. SiLo:a Similarity-locality based Near-exact Deduplication Scheme with Low RAM Overhead and High Throughput [C]. In Proceedings of2011USENIX Annual Technical Conference (ATC'11), Berkeley, CA, USA:USENIX,2011:285-298.
    [16]Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, Michal Welnicki. HYDRAstor:a Scalable Secondary Storage [C]. In Proceedings of the7th USENIX Conference on File and Storage Technologies (FAST'09), Berkeley, CA, USA:USENIX,2009:197-210.
    [17]Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilane. Tradeoffs in Scalable Data Routing for Deduplication Clusters [C]. In Proceedings of the9th USENIX Conference on File and Storage Technologies (FAST'11), Berkeley, CA, USA:USENIX,2011:15-29.
    [18]Fred Douglis, Deepti Bhardwaj, Hangwei Qian, Philip Shilane. Content-aware Load Balancing for Distributed Backup [C]. In Proceedings of the25th USENIX Con-ference on Large Installation System Administration (LISA'11), Berkeley, CA, USA:USENIX,2012:151-168.
    [19]Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng. MAD2:A Scalable High Throughput Exact Deduplication Approach for Network Backup Services [C]. In Proceedings of the26th IEEE Conference on Mass Storage Systems and Technolo-gies (MSST'10), Piscataway, NJ, USA:IEEE,2010:1-14.
    [20]魏建生.高性能重复数据检测与删除技术研究[D].武汉：华中科技大学计算机学院,2012：76-94.
    [21]Tianming Yang, Hong Jiang, Dan Feng, Zhongying Niu, Ke Zhou, Y. Wan. DE-BAR:a Scalable High-Performance Deduplication Storage System for Backup and Archiving [C]. In Proceedings of the24th IEEE International Parallel and Distrib-uted Processing Symposium (IPDPS'10), Piscataway, NJ, USA:IEEE,2010:1-12.
    [22]杨天明.网络备份中的重复数据删除技术研究[D].武汉：华中科技大学计算机学院,2010：83～104.
    [23]Jurgen Kaiser, Dirk Meister, Andre Brinkmann, Sascha Effert. Design of an Exact Data Deduplication Cluster [C]. In Proceedings of the28th IEEE Symposium on Mass Storage Systems and Technologies (MSST'12), Piscataway, NJ, USA:IEEE,2012:1-12.
    [24]Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas. Probabilistic Deduplication for Cluster-Based Storage Systems [C]. In Proceedings of the3rd ACM Symposium on Cloud Computing (SOCC'12), San Jose:ACM,2012:1-19.
    [25]Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors [J]. Communications of the ACM,1970,13(7):422-426.
    [26]Calicrates Policroniades, Ian Pratt. Alternatives for Detecting Redundancy in Stor-age Systems Data [C]. In Proceedings of the2004USENIX Annual Technical Conference (ATC'04), Boston:USENIX,2004:73-86.
    [27]You L. L. and Karamanolis C. Evaluation of efficient archival storage techniques [C]. In Proceedings of the21st IEEE Symposium on Mass Storage Systems and Technologies (MSST),2004:227-232.
    [28]B. Han and P. Keleher. Implementation and Performance Evaluation of Fuzzy File Block Matching [C]. In Proceedings of the2007USENIX Annual Technical Con-ference (ATC'07), Santa Clara, CA,2007:199-204.
    [29]Douglis, P. K. F., Lavoie, J., Tracey, J. M. Redundancy elimination within large collections of files [C]. In Proceedings of the2004USENIX Annual Technical Conference (ATC'04), Boston, MA:USENIX Association,2004:59-72.
    [30]Navendu Jain, Mike Dahlin, and Renu Tewari. Taper:Tiered approach for elimi-nating redundancy in replica synchronization [C]. In Proceedings of the4th Usenix Conference on File and Storage Technologies (FAST'05). San Francisco, CA: USENIX Association,2005:281-294.
    [31]Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes. Jumbo Store:Providing Efficient Incremental Upload and Ver-sioning for a Utility Rendering Service [C]. In Proceedings of the5th USENIX Conference on File and Storage Technologies (FAST2007), San Jose, CA, USENIX Association,2007:123-138.
    [32]Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth net-work file system [C]. In Proceedings of the18th ACM Symposium on Operating Systems Principles (SOSP'01), Chateau Lake Louise, Banff, Canada,2001:174-187.
    [33]Cox L., Murray C., And Noble B. Pastiche:Making backup cheap and easy [C]. In Proceedings of the5th USENIX Symposium on Operating Systems Design and
    [34]Kave Eshghi, Hsiu Khuern Tang. A framework for analyzing and improving con-tent based chunking algorithms [R]. Tech. Rep. HPL-2005-30(R.1), Hewlett Pack-ard Laboratories, Palo Alto,2005.
    [35]Deepak R. Bobbarjung, Suresh Jagannathan, Cezary Dubnicki. Improving Dupli-cate Elimination in Storage Systems [J]. ACM Transactions on Storage, Vol.2, No.4,2006:424~448.
    [36]Erik Kruus, Cristian Ungureanu, Cezary Dubnicki. Bimodal Content DefinedChunking for Backup Streams [C]. In Proceedings of the FAST2010USENIXConference on File and Storage Technologies, San Jose, CA, USA,2010:239~252.
    [37]Dirk Meister, André Brinkmann. Multi-Level Comparison of Data Deduplication ina Backup Scenario [C]. In proceedings of SYSTOR'09, Haifa, Israel,2009:8~8.
    [38]Chuaiyi Liu, Yingping Lu, Chunhui Shi, Guanlin Lu, David Du, and DongshengWang. ADMAD: Application-driven metadata aware de-deduplication archivalstorage systems [C]. In Proceedings of the5th IEEE International Workshop onStorage Network Architecture and Parallel I/Os (SNAPI’08),2008:29~35.
    [39]F. Douglis and A. Iyengar. Application-specific delta encoding via resemblancedetection [C]. In Proceedings of the2003USENIX Annual Technical Conference(ATC’03),2003:113~126.
    [40]A. Z. Broder. On the resemblance and containment of documents [C]. In SE-QUENCES’97: Proceedings of the Compression and Complexity of Sequences,1997:21~29.
    [41]Sean Quinlan and Sean Dorward. Venti: a new approach to archival storage [C]. Inproceedings of the USENIX Conference on File and Storage Technologies, Mon-terey, California, USA,2002:89~101.
    [42]Kubiatowicz J., Bindel D., Chen Y., Czerwinski S., Eaton P., Geels D., GummadiR., Rhea S., Weatherspoon H., Weimer W., Wells C., And Zhao B. Oceanstore: Anarchitecture for global store persistent storage [C]. In Proceedings of the9th Inter-national Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS’00). Cambridge, MA,2000:319~328.
    [43]Solid state101-an introduction to solid state storage [EB/OL]. White Paper, Stor-age Networking Industry Association,[2009-04-17]. http://www.snia.org/fo-rums/sssi/SSSI_Wht_Paper_Final.pdf
    [44]Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solidstate drives[C]. Proc of the MSST’10. Piscataway, NJ: IEEE,2010:1~6.
    [45]Debnath B, Sengupta S, Li J. ChunkStash: Speeding up Inline StorageDeduplication using Flash Memory[C]. Proc of the USENIX ATC’10. Berkeley,CA: USENIX,2010:215~230.
    [46]Pagh, R., and Rodler, F. Cuckoo hashing. Journal of Algorithms [J]. Vol.51, Issue
    [47]D. Bhagwat, K. Eshghi, P. Mehra. Content-based Document Routing and IndexPartitioning for Scalable Similarity-based Searches in a Large Corpus [C]. Proc. ofthe13th ACM International Conf. on Knowledge Discovery and Data Mining(SIGKDD’07),2007:105~112.
    [48]Forman, G., Eshghi, K., and Chiocchettti, S. Finding similar files in large documentrepositories [C]. In Proceedings of the eleventh ACM SIGKDD international con-ference on Knowledge discovery in data mining (SIGKDD’05), ACM,2005:394~400.
    [49]Manber, U. Finding similar files in a large file system [C]. In Proceedings of theUSENIX Winter1994Technical Conference on USENIX Winter Technical Con-ference (ATC’94). San Francisco, California: USENIX Association.1994:1~10.
    [50]EMC Data Domain Global Deduplication Array [EB/OL]. http://www.datadomain.com/products/global-deduplication-array.html,2011.
    [51]IBM ProtecTIER Deduplication Gateway [EB/OL]. http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html,2011.
    [52]SEPATON S2100-ES2[EB/OL]. http://www.sepaton.com/products/SEPATON_ES2.html,2011.
    [53]P. Flajolet, G. N. Martin. Probabilistic counting algorithms for data base applica-tions [J]. Journal of Computer and System Sciences, vol.31, no.2,1985:182~209.
    [54]Neil T. Spring and David Wetherall. A protocol independent technique for elimi-nating redundant network traffic [C]. In Proceedings of the ACM SIGCOMMConference, Stockholm, Sweden,2000:87~95.
    [55]Andrew Tridgell. Efficient Algorithms for Sorting and Synchronization [D]. PhDthesis, Australian National University, April2000.
    [56]Jungle disk [EB/OL]. http://www.jungledisk.com/,2011.
    [57]BackupPC [EB/OL]. http://backuppc.sourceforge.net/,2011.
    [58]EMC Avamar [EB/OL]. http://www.emc.com/avamar,2011.
    [59]H. Pucha, D. G. Andersen, and M. Kaminsky. Exploiting similarity for multi-sourcedownloads using file handprints [C]. In Proc.4th USENIX NSDI,2007:127~140.
    [60]IDC: Digital Data to Double Every18Months [J]. Information Management Jour-nal, vol.43, no.5,2009:1~20.
    [61]Data Loss Statistics [EB/OL]. http://www.bostoncomputing.net/consultation/databackup/statistics/.2011.
    [62]L. Ponemon. The Cost of a Lost Laptop [R]. White Paper, Ponemon Institute andIntel Corporation,2009.
    [63]Cloud Storage for Cloud Computing [EB/OL]. SNIA: Advancing storage&infor-mation technology,2009. www.snia.org/cloud/CloudStorageForCloudComputing.pdf.
    [64]H. Falaki, D. Lymberopoulos, R. Mahajan, S. Kandula and D. Estrin. A First Lookat Traffic on Smartphones [C]. In Proceedings of the10th annual conference onInternet measurement (IMC’10),2010:281~287.
    [65]M. Sivathanu, V. Prabhakaran, F.I. Popovici, T.E. Denehy, A.C. Arpaci-Dusseau,and R.H. Arpaci-Dusseau. Semantically Smart Disk Systems [C]. Proc. of the2ndUSENIX Conference on File and Storage Technologies (FAST’03),2003:73~88.
    [66]G. Sivathanu. End-to-End Abstractions for Application-Aware Storage [D]. StonyBrook University, Technical Report FSL-08-01, May2008.
    [67]A. Katiyar, J. Weissman. ViDeDup: An Application-Aware Framework for VideoDe-duplication [C]. Proc. of the3rd USENIX Workshop on Hot Topics in Storageand File Systems (HotStorage’11),2011:31~35.
    [68]S. Kannan, A. Gavrilovska, K. Schwan. Cloud4Home—Enhancing Data Serviceswith@Home Clouds [C]. Proc. of the31st International Conference on DistributedComputing Systems (ICDCS'11),2011:539~548.
    [69]Maximizing Data Efficiency: Benefits of Global Deduplication [R]. NEC WhitePaper,2009.
    [70]N. Agrawal, W.J. Bolosky, J.R. Douceur, and J.R. Lorch. A Five-Year Study ofFile-System Metadata [C]. Proc. of the5th USENIX Conference on File and Stor-age Technologies (FAST’07),2007:31~45.
    [71]D.T. Meyer and W.J. Bolosky. A Study of Practical Deduplication [C]. Proc. of the9th USENIX Conference on File and Storage Technologies (FAST’11),2011:1~14.
    [72]K. Jin, E.L. Miller. The effectiveness of deduplication on virtual machine disk im-ages [C]. Proc. of the3rd Annual Haifa Systems and Storage Conference(SYSTOR’09),2009:1~14.
    [73]A. El-Shimi, R. Kalach, A. Kumar, J. Li, A. Oltean, and Sudipta Sengupta. PrimaryData Deduplication--Large Scale Study and System Design [C]. Proc. of the2012USENIX Annual Technical Conference (ATC’12),2012:285~296.
    [74]Y. Fu, H. Jiang, N. Xiao, L. Tian, F. Liu. AA-Dedupe: An Application-AwareSource Deduplication Approach for Cloud Backup Services in the Personal Com-puting Environment [C]. Proc. of the13th IEEE International Conference on Clus-ter Computing (Cluster’11),2011:112~120.
    [75]C. Dubnicki, K. Lichota, E. Kruus and C. Ungureanu. Methods and systems for da-ta management using multiple selection criteria [P]. United States Patent7844581,Nov.30,2010.
    [76]Hash collision probabilities [EB/OL]. http://preshing.com/.2011.
    [77]B. Agarwal, A. Akella, A. Anand, A. Balachandran, P. Chitnis, C. Muthukrishnan,R. Ramjee and G. Varghese. EndRE: An End-System Redundancy EliminationService for Enterprises [C]. Proc. of the2nd Symposium on Networked SystemsDesign and Implementation (NSDI’10),2010:419~432.
    [78]C.J. Kolodg. Effective Data Leak Prevention Programs: Start by Protecting Data atthe Source-Your Databases [R]. White Paper, IDC, Aug.2011.
    [79]G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, W. Hsu.Characteristics of Backup Workloads in Production Systems [C]. Proc. of the10thUSENIX Conf. on File and Storage Technologies (FAST’12),2012:33~48.
    [80]P. Efstathopoulos. File Routing Middleware for Cloud Deduplication [C]. Proc. of2nd ACM Interanational Workshop on Cloud Computing Platforms (CloudCP’12),2012:1~11.
    [81]Jaccard Index [EB/OL]. http://en.wikipedia.org/wiki/Jaccard_index.2012.
    [82]A.Z. Broder, M. Charikar, A.M. Frieze, M. Mitzenmacher. Min-wise IndependentPermutations [J]. Journal of Computer and System Sciences, vol.60, no.3,2000:630~659.
    [83]The Linux Kernel Archives [EB/OL]. http://www.kernel.org/,2012
    [84]FIU IODedup Traces [EB/OL]. http://iotta.snia.org/traces/391,2010
    [85]Virtual Desktop Infrastructure [EB/OL]. VMware White paper,2007.
    [86]Citrix XenDesktop [EB/OL]. http://www.citrix.com/products/xendesktop/overview.html,2012.
    [87]VMware View [EB/OL]. http://www.vmware.com/products/view/overview.html,2012.
    [88]Microsoft VDI [EB/OL]. www.microsoft.com/vdi/,2012.
    [89]Clements A, Ahmad I, Vilayannur M, et al. Decentralized Deduplication in SANCluster File Systems [C]. Proc of the USENIX ATC’09. Berkeley, CA: USENIX,2009:98~111
    [90]Nath P, Kozuch M, O’Hallaron D, et al. Design Tradeoffs in Applying ContentAddressable Storage to Enterprise-scale Systems Based on Virtual Machines [C]//Proc of the USENIX ATC’06. Berkeley, CA: USENIX,2006:71~84
    [91]Liguori A, Hensbergen E. Experiences with Content Addressable Storage and Vir-tual Disks [C]. Proc of the USENIX WIOV’08. Berkeley, CA: USENIX,2008:5~10
    [92]Zhang X, Huo Z, Ma J, Meng D. Exploiting Data Deduplication to Accelerate LiveVirtual Machine Migration [C]. Proc of the IEEE Cluster’10. Piscataway, NJ: IEEE,2010:88~96
    [93]The VMware Reference Architecture for Stateless Virtual Desktops on Local Sol-id-State Storage with VMware View4.5[EB/OL]. VMware Inc.,2011. http://www.vmware.com/files/pdf
    [94]Meyer D, Aggarwal G, Cully B, et al. Parallax: Virtual disks for virtual machines[C]. Proc of the ACM SIGOPS/EuroSys’08. New York, NY: ACM,2008:41~54.
    [95]Mohammad Shamma, Dutch T. Meyer, Jake Wires, Maria Ivanova, Norman C.Hutchinson, Andrew Warfield. Capo: Recapitulating Storage for Virtual Desktops [C]. In Proceedings of the9th USENIX Conference on File and Storage Technolo-gies. USENIX Association (FAST'11).2011:31-45.
    [96]Narayanan D, Thereska E, Donnelly A, et al. Migrating server storage to SSDs: Analysis of tradeoffs [C]. Proc of the ACM EuroSys'09, New York:ACM,2009:145-158
    [97]Liao Xiaofei, Li He, Jin Hai, et al. VMStore:Distributed storage system for multi-ple virtual machines [J]. Science China Information Sciences,2011,54(6):1104-1118
    [98]Nam Y.J., Park D., Du D. Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets [C]. The20th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommu-nication Systems (MASCOTS'12),2012:201-208.
    [99]Lillibridge M., Eshghi K., Bhagwat D. Improving Restore Speed for Backup Sys-tems that Use Inline Chunk-based Deduplication [C]. The11th USENIX Confer-ence on File and Storage Technologies (FAST'08), USENIX Association,2008:269-282.
    [100]Gantz J, Chute C, Manfrediz A, et al. The diverse and exploding digital uni-verse:An updated forecast of worldwide information growth through2011[EB/OL]. IDC White Paper-sponsored by EMC,[2008-03-05]. http://www.ifap.ru/library/book268.pdf
    [101]McKnight J, Asaro T, Babineau B. Digital archiving:end-user survey and market forecast2006-2010[EB/OL]. The Enterprise Strategy Group,[2006-03-18]. http://www.enterprisestrategygroup.com/ESGPublications/ReportDetail.asp7Report ID=591.
    [102]Clements A, Ahmad I, Vilayannur M, et al. Decentralized deduplication in SAN cluster file systems [C]. Proc of the USENIX ATC'09. Berkeley, CA: USENIX,2009:101-114.
    [103]敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5)：916～929.
    [104]马良,甄彩君,赵彬,马井玮,王刚,刘晓光.利用低能协处理器的快速重复数据删除系统[EB/OL].中国科技论文在线,2010-04-17.
    [105]You L, Pollack K, Long D. Deep Store:An archival storage system architec-ture [C]. Proc of the ICDE'05. Piscataway, NJ:IEEE,2005:804-815.
    [106]Nath P, Kozuch M, O'hallaron D. Design tradeoffs in applying content ad-dressable storage to enterprise-scale systems based on virtual machines [C]. Proc of the USENIX ATC06. Berkeley, CA:USENIX,2006:71-84.
    [107]Koller R, Rangaswami R. I/O Deduplication:Utilizing content similarity to improve I/O performance [C]. Proc of USENIX FAST'10. Berkeley, CA:USENIX,2010:211-224.
    [108] Henson V. An analysis of compare-by-hash [C]. Proc of the USENIXHOTOS'03. Berkeley, CA: USENIX,2003:13~18.
    [109] Duplessie S, Babineau B, Whitehouse L. What Data Domain is doing to stor-age [EB/OL]. The Enterprise Strategy Group, Inc. White Paper,[2009-04-09].http://virtualmachine.search vmware.com/document;5137150/vm-research.htm
    [110] Dutch M. Understanding data deduplication ratios [EB/OL]. SNIA White Paper,
    [2008-07-18]. http://www.snia.org/forums/dmf/knowledge/white_papers_and_re-ports/Understanding_Data_Deduplication_Ratios-20080718.pdf
    [111] Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking forbackup streams [C]. Proc of the USENIX FAST’10. Berkeley, CA: USENIX,2010:239~252.
    [112] EMC Centera: Content Addressed Storage System [EB/OL], Data Sheet,
    [2008-02-01].http://www.emc.com/collateral/hardware/datasheet/c931-emc-centera-cas-ds.pdf.
    [113] Bolosky W, Corbin S, Goebel D, et al. Single instance storage in windows2000[C]. Proc of the4th USENIX Windows System Symposium. Berkeley, CA:USENIX,2000:13~24.
    [114] Meyer D, Bolosky W. A Study of Practical Deduplication [C]. Proc of theUSENIX FAST’11. Berkeley, CA: USENIX,2011:1~14.
    [115] Quinlan S, Dorward S. Venti: a new approach to archival storage [C]. Proc ofthe USENIX FAST’02. Berkeley, CA, USA: USENIX,2002:89~101
    [116] Nath P, Urgaonkar B, Sivasubramaniam A. Evaluating the usefulness of con-tent addressable storage for high-performance data intensive applications [C]. Procof HPDC’08. New York: ACM,2008:35~44.
    [117] Bloom B. Space/time trade-offs in hash coding with allowable errors [J].Communications of the ACM,1970,13(7):422~426.
    [118] Manbe U. Finding similar files in a large file system [C]. Proc of the USENIXWinter1994Technical Conf. Berkeley, CA: USENIX,1994:1~10.
    [119] Tolia N, Kozuch M, Satyanarayanan M, et al. Opportunistic use of content ad-dressable storage for distributed file systems [C]. Proc of Usenix ATC’03. Berke-ley, CA: USENIX,2003:127~140.
    [120] Douglis F, Iyengar A. Application-specific delta encoding via resemblance de-tection [C]. Proc of the USENIX ATC’03. Berkeley, CA: USENIX,2003:113~126.
    [121] Policroniades C, Pratt I. Alternatives for detecting redundancy in storage sys-tems data [C]. Proc of the USENIX ATC’04. Berkeley, CA: USENIX Association,2004:73~86.
    [122] Bhagwat D, Pollack K, Long D, et al. Providing high reliability in a minimumredundancy archival storage system [C]. Proc of the IEEE MASCOTS’06. Pisca-taway, NJ: IEEE,2006:413~421.sion for large-scale de-duplication archival storage systems [C]. Proc of the ICS’09.New York: ACM,2009:370~379.
    [124] SDFS [EB/OL]. http://www.opendedup.org/,2010.
    [125] Lessfs [EB/OL]. http://www.lessfs.com/wordpress/,2010.
    [126] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential datacompression [J]. IEEE Transactions on Information Theory, vol. IT-23,1977:337~343.
    [127] Youjip Won, Jongmyeong Ban, Jaehong Min Jungpil Hur, Sangkyu Oh,Jangsun Lee. Efficient index lookup for De-duplication backup system [C]. In Pro-ceedings of the16th IEEE International Symposium on Modeling, Analysis, andSimulation of Computer and Telecommunication Systems (MASCOTS’08), Sep2008:1~3.
    [128] Blelloch G E. Introduction to data compression [EB/OL]. Computer ScienceDepartment, Carnegie Mellon University, October16,2001.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700