Home » Books » Computers & Technology » Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Author: Martin Kleppmann

Publisher: O'Reilly Media

Genres: Computers & Technology

Publish Date: April 11, 2017

ISBN-10: 1449373321

Pages: 616

File Type: PDF, Epub

Language: English

Book Preface

If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzzwords relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!

In the last decade we have seen many interesting developments in databases, in distributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments:

Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn, Microsoft, and Twitter are handling huge volumes of data and traffic, forcing them to create new tools that enable them to efficiently handle such scale.
Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and data models flexible.
Free and open source software has become very successful and is now preferred to commercial or bespoke in-house software in many environments.
CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.
Even if you work on a small team, you can now build systems that are distributed across many machines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) such as Amazon Web Services.
Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.

Data-intensive applications are pushing the boundaries of what is possible by making use of these technological developments. We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.

The tools and technologies that help data-intensive applications store and process data have been rapidly adapting to these changes. New types of database systems (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks for batch and stream processing, and related technologies are very important too. Many applications use some combination of these.

The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing. However, as software engineers and architects, we also need to have a technically accurate and precise understanding of the various technologies and their trade-offs if we want to build good applications. For that understanding, we have to dig deeper than buzzwords.

Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using. If you understand those principles, you’re in a position to see where each tool fits in, how to make good use of it, and how to avoid its pitfalls. That’s where this book comes in.

The goal of this book is to help you navigate the diverse and fast-changing landscape of technologies for processing and storing data. This book is not a tutorial for one particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples of successful data systems: technologies that form the foundation of many popular applications and that have to meet scalability, performance, and reliability requirements in production every day.

We will dig into the internals of those systems, tease apart their key algorithms, discuss their principles and the trade-offs they have to make. On this journey, we will try to find useful ways of thinking about data systems—not just how they work, but also why they work that way, and what questions we need to ask.

After reading this book, you will be in a great position to decide which kind of technology is appropriate for which purpose, and understand how tools can be combined to form the foundation of a good application architecture. You won’t be ready to build your own database storage engine from scratch, but fortunately that is rarely necessary. You will, however, develop a good intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.

Who Should Read This Book?

If you develop applications that have some kind of server/backend for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this book is for you.

This book is for software engineers, software architects, and technical managers who love to code. It is especially relevant if you need to make decisions about the architecture of the systems you work on—for example, if you need to choose tools for solving a given problem and figure out how best to apply them. But even if you have no choice over your tools, this book will help you better understand their strengths and weaknesses.

You should have some experience building web-based applications or network services, and you should be familiar with relational databases and SQL. Any non-relational databases and other data-related tools you know are a bonus, but not required. A general understanding of common network protocols like TCP and HTTP is helpful. Your choice of programming language or framework makes no difference for this book.

If any of the following are true for you, you’ll find this book valuable:

You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.
You need to make applications highly available (minimizing downtime) and operationally robust.
You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.
You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.

Sometimes, when discussing scalable data systems, people make comments along the lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses. As we shall see, relational databases are important but not the final word on dealing with data.

Scope of This Book

This book does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation for those things. Instead we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.

In the ebook editions we have included links to the full text of online resources. All links were verified at the time of publication, but unfortunately links tend to break frequently due to the nature of the web. If you come across a broken link, or if you are reading a print copy of this book, you can look up references using a search engine. For academic papers, you can search for the title in Google Scholar to find open-access PDF files. Alternatively, you can find all of the references at https://github.com/ept/ddia-references, where we maintain up-to-date links.

We look primarily at the architecture of data systems and the ways they are integrated into data-intensive applications. This book doesn’t have space to cover deployment, operations, security, management, and other areas—those are complex and important topics, and we wouldn’t do them justice by making them superficial side notes in this book. They deserve books of their own.

Many of the technologies described in this book fall within the realm of the Big Data buzzword. However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems.

This book has a bias toward free and open source software (FOSS), because reading, modifying, and executing source code is a great way to understand how something works in detail. Open platforms also reduce the risk of vendor lock-in. However, where appropriate, we also discuss proprietary software (closed-source software, software as a service, or companies’ in-house software that is only described in literature but not released publicly).

Outline of This Book

This book is arranged into three parts:

In Part I, we discuss the fundamental ideas that underpin the design of data-intensive applications. We start in Chapter 1 by discussing what we’re actually trying to achieve: reliability, scalability, and maintainability; how we need to think about them; and how we can achieve them. In Chapter 2 we compare several different data models and query languages, and see how they are appropriate to different situations. In Chapter 3 we talk about storage engines: how databases arrange data on disk so that we can find it again efficiently. Chapter 4 turns to formats for data encoding (serialization) and evolution of schemas over time.
In Part II, we move from data stored on one machine to data that is distributed across multiple machines. This is often necessary for scalability, but brings with it a variety of unique challenges. We first discuss replication (Chapter 5), partitioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into more detail on the problems with distributed systems (Chapter 8) and what it means to achieve consistency and consensus in a distributed system (Chapter 9).
In Part III, we discuss systems that derive some datasets from other datasets. Derived data often occurs in heterogeneous systems: when there is no one database that can do everything well, applications need to integrate several different databases, caches, indexes, and so on. In Chapter 10 we start with a batch processing approach to derived data, and we build upon it with stream processing in Chapter 11. Finally, in Chapter 12 we put everything together and discuss approaches for building reliable, scalable, and maintainable applications in the future.

References and Further Reading

Most of what we discuss in this book has already been said elsewhere in some form or another—in conference presentations, research papers, blog posts, code, bug trackers, mailing lists, and engineering folklore. This book summarizes the most important ideas from many different sources, and it includes pointers to the original literature throughout the text. The references at the end of each chapter are a great resource if you want to explore an area in more depth, and most of them are freely available online.

O’Reilly Safari

Note

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/designing-data-intensive-apps.

To comment or ask technical questions about this book, send email to [email protected].

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

This book is an amalgamation and systematization of a large number of other people’s ideas and knowledge, combining experience from both academic research and industrial practice. In computing we tend to be attracted to things that are new and shiny, but I think we have a huge amount to learn from things that have been done before. This book has over 800 references to articles, blog posts, talks, documentation, and more, and they have been an invaluable learning resource for me. I am very grateful to the authors of this material for sharing their knowledge.

I have also learned a lot from personal conversations, thanks to a large number of people who have taken the time to discuss ideas or patiently explain things to me. In particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra, Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Helland, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie McCaffrey, Josie McLellan, Christopher Meiklejohn, Ian Meyers, Neha Narkhede, Neha Narula, Cathy O’Neil, Onora O’Neill, Ludovic Orban, Zoran Perkov, Julia Powles, Chris Riccomini, Henry Robinson, David Rosenthal, Jennifer Rullmann, Matthew Sackman, Martin Scholl, Amit Sela, Gwen Shapira, Greg Spurrier, Sam Stokes, Ben Stopford, Tom Stuart, Diana Vasile, Rahul Vohra, Pete Warden, and Brett Wooldridge.

Several more people have been invaluable to the writing of this book by reviewing drafts and providing feedback. For these contributions I am particularly indebted to Raul Agepati, Tyler Akidau, Mattias Andersson, Sasha Baranov, Veena Basavaraj, David Beyer, Jim Brikman, Paul Carey, Raul Castro Fernandez, Joseph Chow, Derek Elkins, Sam Elliott, Alexander Gallego, Mark Grover, Stu Halloway, Heidi Howard, Nicola Kleppmann, Stefan Kruppa, Bjorn Madsen, Sander Mak, Stefan Podkowinski, Phil Potter, Hamid Ramazani, Sam Stokes, and Ben Summers. Of course, I take all responsibility for any remaining errors or unpalatable opinions in this book.

For helping this book become real, and for their patience with my slow writing and unusual requests, I am grateful to my editors Marie Beaugureau, Mike Loukides, Ann Spencer, and all the team at O’Reilly. For helping find the right words, I thank Rachel Head. For giving me the time and freedom to write in spite of other work commitments, I thank Alastair Beresford, Susan Goodhue, Neha Narkhede, and Kevin Scott.

Very special thanks are due to Shabbir Diwan and Edie Freedman, who illustrated with great care the maps that accompany the chapters. It’s wonderful that they took on the unconventional idea of creating maps, and made them so beautiful and compelling.

Finally, my love goes to my family and friends, without whom I would not have been able to get through this writing process that has taken almost four years. You’re the best.

Preface
1. Who Should Read This Book?
2. Scope of This Book
3. Outline of This Book
4. References and Further Reading
5. O’Reilly Safari
6. How to Contact Us
7. Acknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
1. Thinking About Data Systems
2. Reliability
  1. Hardware Faults
  2. Software Errors
  3. Human Errors
  4. How Important Is Reliability?
3. Scalability
  1. Describing Load
  2. Describing Performance
  3. Approaches for Coping with Load
4. Maintainability
  1. Operability: Making Life Easy for Operations
  2. Simplicity: Managing Complexity
  3. Evolvability: Making Change Easy
5. Summary
2. Data Models and Query Languages
1. Relational Model Versus Document Model
  1. The Birth of NoSQL
  2. The Object-Relational Mismatch
  3. Many-to-One and Many-to-Many Relationships
  4. Are Document Databases Repeating History?
  5. Relational Versus Document Databases Today
2. Query Languages for Data
  1. Declarative Queries on the Web
  2. MapReduce Querying
3. Graph-Like Data Models
  1. Property Graphs
  2. The Cypher Query Language
  3. Graph Queries in SQL
  4. Triple-Stores and SPARQL
  5. The Foundation: Datalog
4. Summary
3. Storage and Retrieval
1. Data Structures That Power Your Database
  1. Hash Indexes
  2. SSTables and LSM-Trees
  3. B-Trees
  4. Comparing B-Trees and LSM-Trees
  5. Other Indexing Structures
2. Transaction Processing or Analytics?
  1. Data Warehousing
  2. Stars and Snowflakes: Schemas for Analytics
3. Column-Oriented Storage
  1. Column Compression
  2. Sort Order in Column Storage
  3. Writing to Column-Oriented Storage
  4. Aggregation: Data Cubes and Materialized Views
4. Summary
4. Encoding and Evolution
1. Formats for Encoding Data
  1. Language-Specific Formats
  2. JSON, XML, and Binary Variants
  3. Thrift and Protocol Buffers
  4. Avro
  5. The Merits of Schemas
2. Modes of Dataflow
  1. Dataflow Through Databases
  2. Dataflow Through Services: REST and RPC
  3. Message-Passing Dataflow
3. Summary
II. Distributed Data
5. Replication
1. Leaders and Followers
  1. Synchronous Versus Asynchronous Replication
  2. Setting Up New Followers
  3. Handling Node Outages
  4. Implementation of Replication Logs
2. Problems with Replication Lag
  1. Reading Your Own Writes
  2. Monotonic Reads
  3. Consistent Prefix Reads
  4. Solutions for Replication Lag
3. Multi-Leader Replication
  1. Use Cases for Multi-Leader Replication
  2. Handling Write Conflicts
  3. Multi-Leader Replication Topologies
4. Leaderless Replication
  1. Writing to the Database When a Node Is Down
  2. Limitations of Quorum Consistency
  3. Sloppy Quorums and Hinted Handoff
  4. Detecting Concurrent Writes
5. Summary
6. Partitioning
1. Partitioning and Replication
2. Partitioning of Key-Value Data
  1. Partitioning by Key Range
  2. Partitioning by Hash of Key
  3. Skewed Workloads and Relieving Hot Spots
3. Partitioning and Secondary Indexes
  1. Partitioning Secondary Indexes by Document
  2. Partitioning Secondary Indexes by Term
4. Rebalancing Partitions
  1. Strategies for Rebalancing
  2. Operations: Automatic or Manual Rebalancing
5. Request Routing
  1. Parallel Query Execution
6. Summary
7. Transactions
1. The Slippery Concept of a Transaction
  1. The Meaning of ACID
  2. Single-Object and Multi-Object Operations
2. Weak Isolation Levels
  1. Read Committed
  2. Snapshot Isolation and Repeatable Read
  3. Preventing Lost Updates
  4. Write Skew and Phantoms
3. Serializability
  1. Actual Serial Execution
  2. Two-Phase Locking (2PL)
  3. Serializable Snapshot Isolation (SSI)
4. Summary
8. The Trouble with Distributed Systems
1. Faults and Partial Failures
  1. Cloud Computing and Supercomputing
2. Unreliable Networks
  1. Network Faults in Practice
  2. Detecting Faults
  3. Timeouts and Unbounded Delays
  4. Synchronous Versus Asynchronous Networks
3. Unreliable Clocks
  1. Monotonic Versus Time-of-Day Clocks
  2. Clock Synchronization and Accuracy
  3. Relying on Synchronized Clocks
  4. Process Pauses
4. Knowledge, Truth, and Lies
  1. The Truth Is Defined by the Majority
  2. Byzantine Faults
  3. System Model and Reality
5. Summary
9. Consistency and Consensus
1. Consistency Guarantees
2. Linearizability
  1. What Makes a System Linearizable?
  2. Relying on Linearizability
  3. Implementing Linearizable Systems
  4. The Cost of Linearizability
3. Ordering Guarantees
  1. Ordering and Causality
  2. Sequence Number Ordering
  3. Total Order Broadcast
4. Distributed Transactions and Consensus
  1. Atomic Commit and Two-Phase Commit (2PC)
  2. Distributed Transactions in Practice
  3. Fault-Tolerant Consensus
  4. Membership and Coordination Services
5. Summary
III. Derived Data
10. Batch Processing
1. Batch Processing with Unix Tools
  1. Simple Log Analysis
  2. The Unix Philosophy
2. MapReduce and Distributed Filesystems
  1. MapReduce Job Execution
  2. Reduce-Side Joins and Grouping
  3. Map-Side Joins
  4. The Output of Batch Workflows
  5. Comparing Hadoop to Distributed Databases
3. Beyond MapReduce
  1. Materialization of Intermediate State
  2. Graphs and Iterative Processing
  3. High-Level APIs and Languages
4. Summary
11. Stream Processing
1. Transmitting Event Streams
  1. Messaging Systems
  2. Partitioned Logs
2. Databases and Streams
  1. Keeping Systems in Sync
  2. Change Data Capture
  3. Event Sourcing
  4. State, Streams, and Immutability
3. Processing Streams
  1. Uses of Stream Processing
  2. Reasoning About Time
  3. Stream Joins
  4. Fault Tolerance
4. Summary
12. The Future of Data Systems
1. Data Integration
  1. Combining Specialized Tools by Deriving Data
  2. Batch and Stream Processing
2. Unbundling Databases
  1. Composing Data Storage Technologies
  2. Designing Applications Around Dataflow
  3. Observing Derived State
3. Aiming for Correctness
  1. The End-to-End Argument for Databases
  2. Enforcing Constraints
  3. Timeliness and Integrity
  4. Trust, but Verify
4. Doing the Right Thing
  1. Predictive Analytics
  2. Privacy and Tracking
5. Summary
Glossary
Index