In this post I will describe the initial architecture of a distributed object storage that supports deduplication: both in the object level and in the block level. This architecture is one of the main contributions of my master thesis on distributed content-addressable storage over a consistent hashing ring.
I have chosen the OpenStack Object Storage (Swift) as the base storage system for my prototype because it has an architecture based on consistent hashing, and its code-base is available as open-source. Moreover, it’s a prominent cloud infrastructure project used by many organizations and I would like to contribute the outcome of this work to Swift by enabling deduplication functionality.
However, this work is intended to be generic and the techniques used in it may be applied to any other object storage system based on consistent hashing (Dynamo-like, DHTs, etc).
Object: By object I refer to any arbitrary-sized binary object (blob), such as text files, images, virtual machine images, etc
Block: A binary chunk of data of fixed or variable size that is usually a part of a larger object. In traditional filesystems block sizes are typically small (eg. 4KB) and optimized for the underlying physical storage (such as hard disk).
Consistent hashing: Technique to partition a keyspace among a distributed set of nodes, which enables efficient location of keys and simplifies fault tolerance (read more: http://www.martinbroadhurst.com/Consistent-Hash-Ring.html).
Deduplication: Mechanism to identify duplicate data in a storage system and store only one (or few) copies of the duplicated data, thus, saving storage space.
Content-addressable storage (CAS): Technique to address stored object by its contents, rather than its location (physical or logical). This is typically done by hashing the contents of an object and producing an unique “fingerprint” that will be used to identify/locate the object.
Background on Openstack Swift
Swift is a decentralized object storage system where clients can add, retrieve, update and delete objects indexed by a key. Swift’s architecture is similar to Amazon Simple Storage Service (S3), which builds on the famous Dynamo key-value distributed data store.
Swift uses a consistent hashing ring to distribute data among a set of nodes. Thus, each node is responsible for a range of the storage namespace. In order to reach an object from its key, the key is hashed and the request is routed to the node responsible for the range that contains this hashed value. In order to increase fault tolerance, availability and durability each object is replicated across multiple nodes in the ring.
For more details on Swift architecture, please check here: http://swift.openstack.org/
Background on content-addressable storage/deduplication
For a detailed motivation and background for content-addressable storage and deduplication in distributed storage systems, please read my previous posts: here, here and here.
The main idea is to leverage commonality between stored objects by storing only one copy of duplicated objects. This type of deduplication is suitable for use cases where the same objects are stored multiple times (but with different identifiers or containers), such as in software repositories, archiving, backup, etc.
In order to enable object-level deduplication on Swift an additional “content-addressable storage ring” (CAS-ring) is created in addition to the usual object ring. Both rings are distinct overlays on the top of the same physical nodes. In the CAS-ring, objects are indexed by the hash of its contents. Objects stored in the CAS-ring are unique and immutable, since the key of an object is a cryptographic digest of its contents. Thus, if an object’s contents are changed, the key of the object in the CAS-ring will also change.
In order to deduplicate an object, a secure cryptographic hash function, such as SHA-1, is used to calculate the “fingerprint” of the object. Then, it is verified if the object is already present in the CAS-ring (duplicated objects will map to the same key). If not, the object is stored in the CAS-ring. The object ring will not store the contents of the object, but instead a metadata reference to the content-addressed object in the CAS-ring.
The deduplication can be done inline(when the object is being inserted/updated) or offline(background post-processing). If it’s done inline, there may be a higher latency for the insert operation, since the whole object needs to be hashed after being transferred to the node responsible for it. If the deduplication is done offline, the insert operation remains unchanged, and a background process will deduplicate recently added objects. Another benefit of offline deduplication is that the resource consumption can be limited not to interfere with client operations in the object storage. However, with this latter approach, additional temporary storage space is needed, to store objects before they are deduplicated.
I have chosen the post-processing deduplication method since it allows a better control over the resources while it doesn’t incur additional overhead to the insert operation.
When retrieving a deduplicated object, an additional level of indirection is added: instead of directly fetching objects from the object ring, a client will first need to get the content-address of that object from the object ring, and then fetch the object’s contents from the CAS-ring. For example:
- Client wants to retrieve object “xpto”:
- Get metadata for object “xpto” in the object ring
- Object ring returns fingerprint of object “xpto”: “6b4b0d3255bfef95″
- Get object contents for object “6b4b0d3255bfef95″ in the CAS-ring
However, I do not believe this additional level of indirection will significantly affect performance of the get operation, as the object transfer will correspond to a major part of the operation latency. I plan to evaluate this overhead for different object sizes.
In addition to an object’s contents, the CAS-ring will also store a reference count for the content-addressed object. When an object is removed from the object ring, its reference count is decremented in the CAS-ring. When the reference count for a particular object in the CAS-ring reaches zero, the object is deleted from the content-addressable storage.
Swift’s replication mechanism remains the same in both the object ring (metadata) and in the CAS-ring (object’s contents). Thus, it should be ensured that when an object is deduplicated, at least N replicas of the deduplicated object exist in the CAS-ring (where N is the configured replication level).
In order to achieve even higher levels of deduplication it is possible to divide objects into chunks of data (blocks) and apply the deduplication procedure to each block. Many studies have shown that deduplication can save up to 80% of storage space when applied to virtual machine images, which share a large amount of data blocks among them. If object-level deduplication was applied to different Ubuntu images, they would have a different fingerprint, and thus, would be stored twice. However, if block-level deduplication is applied, the common blocks between the two images are just stored once.
Objects can be divided into variable or fixed size blocks. For simplicity, I am assuming fixed sized blocks, however the proposed architecture can also be applied to variable-sized blocks. Object-level deduplication can be seen as a special case of block-level deduplication, where block size = infinity.
Several challenges arise when deduplicating objects at the block level in a distributed object storage. Some of the challenges are:
- The large quantity of blocks make it prohibitive to store each block as a separate file. For instance, if the block-size is 4KB, a 2GB object will have 524288 blocks. As the amount of stored objects grow, the amount of stored blocks will exceed the maximum number of files supported by the underlying file system. Traditional CAS solutions solve this by storing multiple blocks in a single file and having an index of blocks for stored file.
- With small block sizes, even modest-sized objects will have more blocks than partitions in the distributed object storage. For instance, if there’s a 20MB object with a 4KB block-size and 100 nodes in the storage system, assuming block digests are uniformly distributed, each node will store more than 50 blocks of this object. This means that all nodes of the storage system will need to be contacted to rebuild the original object. This may become a bottleneck when a large number of objects is being accessed simultaneously.
- Block access locality may be lost when blocks are distributed among many nodes (fragmentation). A locality strategy may be used to ensure nearby blocks in the original object are also stored nearby in the storage node.
- Access to disk may become a bottleneck in the storage nodes if many requests are done concurrently, since many distinct blocks would need to be fetched from different locations on disk. Caching and pre-fetching of blocks (exploiting locality) can mitigate this overhead.
I will try to address these and other challenges throughout my master thesis work. Moreover, I plan to evaluate the impact of each of these issues during the evaluation phase.
Deduplicating objects (at the block-level)
The idea is to have the same “content-addressable storage ring” (CAS-ring), but instead of storing whole-objects, the ring will store content-addressable blocks in a distributed manner. Since inline deduplication of thousands of blocks within an object would add a lot of overhead in the insert operation, the post-processing method for block-level deduplication was chosen.
When an object is inserted into the object storage, it is marked to be deduplicated. The deduplication routine divides an object into blocks (of a specified size), and computes the secure digest of each block, checking if it’s already stored in the CAS-ring. In case it is not already stored, the block is stored in the CAS-ring. After all blocks are stored, the original object is finally replaced in the object-ring by a manifest that describes how the original object should be reconstructed from content-addressable blocks.
The object ring will store for each object a piece of metadata (called manifest or recipe) containing fingerprints of the blocks that compose the object. These block addresses will be used in the CAS-ring to retrieve the blocks and reconstruct the original object. For example:
- Client wants to retrieve object “xpto”:
- Get manifest/receipe of object “xpto” in the object ring
- Object ring returns block addresses for object “xpto”: “6b4b0d3255bfef95″, ”890afd80709″, ”da39a3ee5e”.
- Get blocks “6b4b0d3255bfef95″, ”890afd80709″ and “da39a3ee5e” in the CAS-ring.
- Reconstruct original object, by concatenating retrieved blocks.
One idea here is to use techniques employed in BitTorrent to speed-up retrieval of large objects from multiple storage nodes.
The initial approach is to keep reference counts for each stored block in the CAS-ring. These counts are updated as objects are added, removed or updated in the object ring. However maintaining reference counts for each stored block may add a considerable amount of metadata overhead depending on the block size. For instance, if the block size is 4KB, the overhead of reference counts is 32 (integer size in bits) / 4 * 1024 * 8 (size of block in bits) = 0.000976562 = 0.097% metadata per block. For a 1TB block storage, the reference count would consume 1GB of storage space. There is a tradeoff between metadata overhead vs block-size vs deduplication level, which I plan to analyze during this work.
An alternative approach would be not to keep reference counts and to have a distributed garbage collection mechanism, that would go through all objects in the object ring and delete unreferenced blocks.
So far I haven’t thought on block replication in the CAS-ring. The initial approach is to basically replicate blocks in the same way objects are currently replicated in Swift. I will further investigate that.
In this post I presented an architecture for object-level and block-level deduplication in an distributed object store (OpenStack Swift). This is just an initial draft of the prototype I will be implementing and evaluating in the coming months. Feel free to comment and give suggestions on this design: feedback will be very appreciated. I will be posting the progress of this work in this blog.