In the paper, “The Effectiveness of Deduplication on Virtual Machine Disk Images“, the authors perform an in-depth analysis of several factors that may or may not impact the level of deduplication of virtual machine images.
So, what’s exactly deduplication?
The main idea is to leverage data commonality in a storage system by identifying duplicate “chunks” of data across multiple files and storing only one copy of each chunk.
How do you do that?
The idea is to compute a digest (such as SHA-1) of each data chunk composing a file, and check if that data is already present in the chunk store. The chunk is only stored in case it is already not there, otherwise a pointer to the already stored chunk is added to the metadata describing how to reconstruct the original file.
How do you divide a file into chunks?
There are two main techniques for chunking: variable-size chunking and fixed-size chunking. Fixed size chunking is straightforward: you define a chunk size (such as 4KB), and divide a file into equal chunks of that size. However, if some data is appended or removed from this file, all the chunks after the modification will become invalid. Variable-size chunking is resistant to modification, since the chunks can have different sizes. A well-known technique for variable-size chunking is to compute a rabin fingerprint of the file stream to define where to place the boundaries of each chunk.
Why use deduplication on virtual machine images?
Virtualization technology is widely adopted in data centers and cloud computing providers in order to better utilize physical resources and to provide isolation between different applications/users. A problem that arises is the amount of storage needed to store multi-gigabyte VM disk images. Several researches identified that different VM images share a considerable amount of data between then, what suggests that the use of deduplication may reduce the total amount of storage needed in VM hosting facilities.
Below are some interesting findings of the aforementioned paper on deduplication in the context of VM images:
- Deduplication can save 80% of more of storage space when stored VM images are from the same operating system “lineage”, such as Ubuntu or Fedora.
- For mixed operating systems, the deduplication ratio is about 40%, which is still quite a considerable amount of space saved.
- Fixed-size chunking outperforms variable-size chunking for VM images, which is good news, since typically that’s easier to implement.
- Compression of chunks can further increase storage savings
- Factors that have major impact on deduplication effectiveness:
- Base operating system (the more homogeneous, the more the level of deduplication)
- Chunk size (the smaller the chunk, the higher the deduplication level, the higher the overhead to reconstruct the original file)
- Factors that have little impact on deduplication effectiveness:
- Package installation or language localization within the same operating system
- Surprisingly, consecutive releases of a single OS have a similar level of de-duplication of releases away from each other (normally high)
What are the implications of this to my work?
Deduplication in the context of VM images is a great use case for a content-addressable storage, which can be used as a storage backend for the chunk store needed for de-duplication of VM images. Current CAS solutions are either based on costly hardware (such as disk arrays) or centralized. However, a centralized CAS architecture will have limited capacity and will not scale as the amount of stored data grows.
Public and private cloud providers spend a massive amount of storage space to keep user’s VMs. Using a distributed content addressable storage to store VMs have the following advantages (among others):
- Obvious scalability and elasticity
- Reduce storage demands for multi-gigabyte VM hosting
- Use the saved storage space for replicating data chunks, increasing availability and durability
- Parallel transfer of chunks from multiple servers, possibly in a BitTorrent fashion, what may speed up transfer of a VM image to hosts