Dynamo (storage system)

Amazon Dynamo is a distributed hash table that is used internally by Amazon. Like the Google File System is Dynamo optimized for a specific application that is tailored to the requirements of some Amazon Web Services, which require high reliability.

Requirements

Amazon applications expect that a storage system is highly available and extremely fail-safe. In particular must be able to be stored in every situation.

" [ ... ] Even if disks are failing, network routes are flapping, or data centers are being destroyed by tornadoes. "

" [ ... ] Even if disks fail, play network connections crazy or data centers of tornadoes destroyed. "

The system must always be incrementally scalable to cover peak loads can, for example, during the Christmas season. Complex database accesses are avoided, accessed directly via a key. Furthermore, it must not be taken at this point to security, because the system environment is in a "friendly" within the Amazon Services which is sealed off from the outside.

Construction

Dynamo builds on a network of fully equal computer, ie there is no central control or management, each node can perform any task. This architecture has been selected, in order to ensure the scalability of the system.

Services such as the Shopping Cart service (the service that manages the shopping cart) expect on the storage system can be accessed for writing always, the system is highly available and has low latency. Consistency to an eventual consistency ( "sometime eventually consistent" ) softened - As defined in the ACID properties criteria of consistency and high availability are opposite, was - in contrast to traditional database systems. Another feature was that mostly small (less than 1MB ) files in the form of key-value pairs to be stored. Complex database queries need not be supported.

To achieve the desired properties, a number has already been used in another context to known methods:

  • Consistent hashing
  • Sloppy Quorum and hinted handoff
  • Vector Clocks
  • Anti-entropy by Merkle Trees
  • Gossip -based protocol

Consistent hashing

All computers are arranged as a ring ( at least logically, physically crosslinking is different). For each key, a hash value is calculated using MD5. Each node is then assigned a specific region of the range of values ​​of the result of the hash function, which the respective node stores the corresponding file, a certain number of the following node in the ring also to save a copy, the total number of the storing node is configurable. To maximize the reliability, nodes are not only distributed to different computers and racks but even different data centers.

An example for the case of six nodes with redundant storage on three nodes (N = 3 ) is found in the adjacent figure.

Since this is a heterogeneous system landscape, in which the capacity of the node computers used may be different ( and also some files are in demand more frequently than others) Dynamo uses so-called virtual node. The concept of virtual nodes enables that may be located several virtual nodes in the same ring on a physical node. This allows for better utilization of the physical storage capacities at different nodes, since the memory usage for all nodes is equal to the uniform distribution of the hash function and as a physical node multiple virtual nodes can be associated with a higher storage capacity.

Sloppy Quorum and hinted handoff

To ensure the reliability of the system, were next to the parameter N ( the number of nodes is replicated to which ) the parameter R (read, read) and W (Write, Writing ) yet introduced, which are also configurable. These parameters are known as similar even from quorum systems. However, they were here the extent modified so that one can speak (English for sloppy ) of sloppy. These determine how many of the nodes have to report a read or write operation to be successful, so that the action is considered successful. In the standard configuration, the tuple (N, R, W) with the values ​​( 3, 2, 2) is occupied. This means that

  • Each file is stored three times,
  • A read access is considered as successful, when at least two nodes, and return the file
  • A write is considered successful when at least two nodes to access Report as successful.

The parameters also allow an application, the system precisely match for their needs. For example, a configuration of ( 3, 1, 3) make sure that you have realized a kind of read buffer (only one node must respond to a read access, all copies must always be written successfully, since N = W ), whereas a system with W = 1 is optimized for the fastest possible write accesses. The configuration (1, 1, 1 ) is again realized just a normal (but not highly available ) file system (corresponding with replication as (2,2,2 ), ( 3,3,3 ), etc. ).

If the coordinator node (the node on which portion of the hash value actually falls ) is not available, uses the so-called hinted handoff: If in the example of the figure above, the hash value of 3 and node A would not be available, so the copy would instead be passed on to node D ( handoff ) with the notation ( hinted ) that this file actually belongs to node A. Therefore D stores the copies in a separate local database and requested from time to time A to whether the node is available. Once this is the case, all copies are transmitted to A hinted. After successful transfer, D can delete the hinted object.

Vector Clocks

By Sloppy Quorum Configuration of (3, 2, 2) it can lead to different versions under certain circumstances. Since updates may run in the background (for example, to the third node ), it may be that after a successful write ( but who has achieved only two nodes ) directly followed by a read access, which might now returns two different versions. To resolve this conflict, there is the so-called vector clocks, also called Vector Clocks that are simply only version counter in principle. Each file contains a vector of tuples of the form ( node ID, version number), with an update at each node always increases its version number included in the file by one. In the described case, the problem coordinator now would, for example, for the same node once the version 14 and version 15 once get back and can see from these version numbers, which version is the latest. Accordingly, the requesting client would also get only the latest version returned with the version number 15.

The problem is actually only if the actual coordinator has failed for some reason and it comes at the same time parallel access. For example, the following sequence could result:

In this case, it is not decidable whether the version of B or C is the newer, and the resolution will be moved to the application layer, the client receives both versions. In the example of the Shopping Cart service both versions would for example be combined and from node A new release ( [A, 3], [ B, 1 ], [C, 1] ) are written. However, this is depending on the particular application. If an application prefers not to worry about errors resolution, so there are pre-implemented and easy last- write -wins strategies.

Anti-entropy by Merkle Trees

By hinted handoff further problems may arise. For example, the following sequence is possible:

Problem: A does not even notice that it has an old version and at that time it only N - 1 are copies. To work around this problem, when you restart the copies compares A with those of B and C. However, in order to keep the traffic and computational load as low as possible, so-called Merkle Trees are used for this. Merkle Trees are trees, the hash values ​​of the files have in their leaves, in the root of a hash of all the hash and the hash node in between corresponding to the subtree. Thus A and B must first replace only the Wurzelhash and can then determine whether their copies are all identical or not. If not, the tree is traversed until the leaf is found guilty. Can then be seen on the corresponding vector clocks, which is the newer version, and it will be copied accordingly.

In the event that (similar to the example) the network connection to A tear off and A is the right does not even notice, either A will determine the next read with the help of Vector Clocks that an old version exists, or at the regular Gossip news, as there the hash of the Merkle Trees are sent.

Gossip -based protocol

Thus, every time the entire loop structure does not have to be rebuilt at a temporary failure of a node, there is the hinted handoffs. However, it must also be possible to permanently remove nodes from the network or add. To easily make this possible, an entry in a corresponding configuration file is performed via command -line tool or browser by an administrator log in on any node. This change is then communicated to all other nodes of the ring via a Gossip -based protocol. Using this protocol, both the allocation of virtual nodes on the computers as well as a list of machines is constantly kept up to date.

A simple example of the explicit addition of nodes X to network ABCD would be as follows:

The order of communication (who exchanges with whom) is random and it does not have in any communication been a change ( in the example, step 4).

If a node is added or removed, and the sharing of virtual nodes on the physical computer must change, but there are several methods, which are explained in detail in the paper inevitably. The simplest variant of this is - in the case of a non- heterogeneous system landscape, that the same number should be equal to virtual nodes on each physical machine. When removing a node, thus the relevant virtual node are copied to randomly selected physical nodes that have fewer virtual nodes than the rest of the ring. Conversely takes over a newly added node virtual nodes of fully-loaded nodes - also randomly selected.

DynamoDB

Since 2012 Dynamo has been offered by Amazon Web Services as Storage Service under the name DynamoDB. However, the IaaS service differs in some respects from the original Dynamo implementation. For example DynamoDB offers a Bigtable - like interface that map in the multidimensional Keys to a value. Thus, the relational database can be a table structure similar pose.

55253
de