Skip to content

Latest commit

 

History

History
39 lines (23 loc) · 3.23 KB

File metadata and controls

39 lines (23 loc) · 3.23 KB

Dynamo: Amazon's Highly Available Key-value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels

What is the Problem?

Amazon runs a world-wide e-commerce platform using tens of thousands of servers located in many data centers around the world. Reliability is one of the most important requirements because even the slightest outage has significant financial consequences and impacts customer trust. Besides, the platform has to be highly scalable in order to support continuous growth.

Summary

This paper describes the design and implementation of Dynamo, a highly available and scalable distributed data store, which is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness, and performance. Dynamo even allows read and write operations to continue during network partitions and resolves update conflicts using different conflict resolution mechanisms. When compared to traditional database systems, Dynamo sacrifice some features such as richness of schema representation and strong consistency in favor of higher performance and availability at massive scales.

Key Insights

  • Data fetched are not guaranteed to be up-to-date, but updates are guaranteed to be propagated to all nodes eventually.

Notable Design Details/Strengths

  • Dynamo employs a mechanism to dynamically partition the data over the set of nodes in the system, which makes incrementally scale possible.
  • To further achieve high availability and durability, Dynamo replicates the data on multiple hosts.

Limitations/Weaknesses

  • Dynamo exposes data consistency and reconciliation logic issues to the developers. For any new applications that want to use Dynamo, some analysis is required during the initial stages of the development to pick the right conflict resolution mechanisms that meet the business case appropriately, which requires extra efforts.
  • A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. This can be very limiting in environments where systems need to handle a very high write throughput.

Summary of Key Results

  • This paper is built and evaluated based on a realistic scenario, which further proves that Dynamo suits Amazon's large-scale scenario very well.
  • The object buffer maintained by each storage node results in lowering the 99.9^(th) percentile latency by a factor of 5 during peak traffic even for a very small buffer of a thousand objects.

Open Questions

How can we balance the tradeoff between the network communication and the promising performance of the large-scale clusters? The principal bottleneck in large-scale clusters is often inter-node communication bandwidth. Many applications must exchange information with remote nodes to proceed with their local computation. Internet services increasingly employ service-oriented architectures (e.g., Dynamo), where the retrieval of a single web page can require coordination and communication with literally hundreds of individual sub-services running on remote nodes.