Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.
First though, let me be clear about what I’m trying to accomplish by coining a theorem to describe the above actuality. My thoughts on this are hardly original. In fact, I would say that these issues are practically standardized as issues that are expected to be successfully overcome by the average system architect. What I am proposing, however, is that we can take a number of descriptive core issues and sum them up into a single, easy to remember, theorem.
Now, let me explain why trade-offs should be taken into careful consideration when designing your distributed system. For every problem that a system is designed to solve, there is a limitation added to the system. By this I mean that if your system solves one issue, the code and/or data put in place to do so, may require us to write additional code and/or add additional data to the system in order to remove its unintended effects. However, my adding the extra code and/or data, we’ve effectively required ourselves to write further code or add further data to then remove the consequences of writing our previous code or adding our previous data; and on and on, until as you can see, we’ve hit an infinite loop of adding to remove.
To better explain this, take for example a system that is designed to solve a real-time messaging based problem. The architecture that has been decided on is an XMPP based architecture. As such, we’ve basically decided to abstract away the database in such a way as to nearly render it unavailable for useful reporting purposes. So, by choosing our XMPP based architecture, we’ve made it more difficult to report on the events occurring within our system, even though we’ve developed a system that very specifically solves the initial problem.
Now let’s assume that this system grows large and the inevitable happens; we need to monitor our system and report on its usage. We must now add code to handle monitoring every event and dump it into a database. This database does what we need, but now the inevitable happens again; we need to add a feature to the system that involves the addition of a separate component that runs as a web service and connects to the system via XMPP. So we write code for this feature, and we write code for the monitoring portion of the system so that it can successfully monitor our newly added web service. As you can see, the monitoring portion of our system is a portion of the system that must also scale, both to the system’s usage and the system’s addition of features. That is our limitation, we now need to code additionally for the monitoring system.
The above is a simplified example, but I hope it helps you understand the issues involved in developing a system – there are trade-offs involved in every aspect of the design and development process. Now this is not to say that every problem we solve in code will have an equal impact on the usage of our system for our purposes. Sometimes we’ll add a feature that produces a limitation that we indeed never come into contact with. That is ideal, and we attempt to write code that produces these kinds of limitations all the time. What I’m trying to do is discuss some of the broader design issues that aren’t of that ilk.
Breaking this concept down into its core issues is crucial to the complete understanding of necessarily recognizing the need for trade-offs within your system. The core issues are as follows:
- Horizontally scaling a system complicates the system’s continued maintenance as more components are necessarily added to the system, therefore increasing the overall system’s size.
- As a result of a system’s increase in size, monitoring becomes more difficult, and eventually becomes a sub-system in and of itself.
- As a result of a system’s increase in size, the addition of features becomes a more time intensive and difficult task. This is due to the necessity of having to take into account the instantaneous volume of additional data, as a consequence of rolling out a feature amongst a system’s existing components. This is especially the case for a system with a large number of concurrent clients or users.
- As a result of a system’s increase in size, refactoring or reorganizing a system’s components and/or its component’s data involves the removal and subsequent re-launch of a system’s affected components and/or its affected component’s data.
We’ll define the above used terms – Data, Horizontal Scalability, and Flexibility – as follows:
- Data – In the context of the above points, data means a system’s code, running application instances, and file data (flat files, database data, etc).
- Horizontal Scalability – To horizontally scale a system is to partition a system’s usage along its topographical dimensions, among individual system components.
- Flexibility – A system’s flexibility is a function of the ease by which a system’s maintenance, monitoring, feature additions, and reorganization is made.
Now that we’ve properly defined the terms above as they are used within the context of the subject at hand, we can define our theorem as follows:
I’ve coined the above theorem the I.H.S.D.F Theorem (pronounced “is-deaf”). It’s hardly creative but I think it’s certainly appropriate and easy to pronounce.
I’d like to hammer home that the decrease in overall system flexibility will likely be the result of multiple issues resulting from horizontally scaling your system, and that this theorem is meant to easily state that reality. This theorem also encompasses what is in fact the basis on which most other distributed system issues are based.
I hope that this brief brain-dump has helped to define an easy to remember theorem that you can use to keep the realities of developing distributed systems in mind. I’ve been using it for a while now and I’ve found that it’s helped to keep my head planted squarely on a few separate occasions.
Great material here. Any type of ethic that points to silver-bullets being fiction is good, IMHO.
Having been through a fairly large architectural change from a monolithic to federated database change, I'm not totally in agreement that the decrease in flexibility has to be roughly proportional. On an individual node basis, monitoring doesn't have to necessarily change. I'd say that more systemic monitoring becomes more valuable, but as time goes on, the dynamics inherent within the system can better inform what you should be monitoring.
For example: before, you have nagios (or whatever) watching your MySQL instances on your individual nodes. Then you do the horizontal thing and go with your preferred method of doing that. Your original monitoring should obviously stay in place, but other patterns emerge (objects stored per shard, what thresholds of objects dictate what a 'hot' shard would be, etc.) that gives cluster-wide or application-level monitoring more value.
Either way, you're right: monitoring becomes more complex, but I'd also say that difficulties that come from that complexity would have been just the same if you hadn't horizontally scaled.
Posted by: John Allspaw | December 22, 2008 at 10:35 AM
Hi John,
You've made some good points.
My reference to monitoring was meant to be more specific to a system's feature set (web service that performs function X) than a system's components (MySQL instances, etc). Or as you put it: application-level monitoring. Basically, custom built monitoring becomes more difficult in my experience, regardless of how automated the mechanism is, as the size of the system increases. For example, monitoring a system of 10,000 servers is just plain complicated! I probably should have explicitly noted the difference there.
It's funny, after writing this up last night, I started mulling over adding an addendum to this theorem describing in further detail how I arrived at the proportional measure of increased horizontal scalability to decreased flexibility. I kind of gave myself an "out" in that I said "approximately proportional", but that's hardly scientific.
You're right, monitoring becomes more complex regardless of the reasons why a system grows larger, but I thought it important to differentiate growth by horizontal versus vertical scaling. In my experience, vertically scaling a system actually has little impact on the flexibility of the system from a developers perspective. That's the extent of my reasoning behind explicitly stating that this theorem is specific to Horizontal Scalability.
Posted by: Max Indelicato | December 22, 2008 at 11:10 AM
"In my experience, vertically scaling a system actually has little impact on the flexibility of the system from a developers perspective."
Indeed! I might even say that it's easy with 'diagonal scaling' as well:
http://clarification.wordpress.com/2008/06/05/diagonal-scaling-and-the-law-of-diminishing-returns/
Regardless, I've got you in my RSS sights. :) Not many people post with such insight about these topics.
Posted by: John Allspaw | December 22, 2008 at 02:06 PM
Agreed, diagonal scaling has the added benefit of easing a developer's life in regards to flexibility too. Interesting post you linked to, looks like a great resource on diagonally scaling.
And thanks for the kind words. I'm certainly hoping to post regularly on these kinds of topics.
Posted by: Max Indelicato | December 22, 2008 at 03:05 PM
I must say I love your blog. I've just recently begun scaling applications, and I've found your entries to be a very good resource on theoretical (as well as hands-on) scaling.
Thank you!
Posted by: Niklas | December 29, 2008 at 07:46 AM
Another article I wish management would read and understand.
Posted by: roppert | December 29, 2008 at 10:01 AM
I recently wrote a summary of a pattern for addressing the monitoring bottleneck on one of my recent posts: Data aggregation pattern for effective monitoring (http://natishalom.typepad.com/nati_shaloms_blog/2008/11/managing-application-on-the-cloud-using-a-jmx-fabric-1.html)
The general idea is that the data for monitoring will be stored into in-memory cloud and offloaded asynchronously to the database.
BTW are you familiar with Space Based Architecture? (http://en.wikipedia.org/wiki/Space_based_architecture)
Its a pattern for achieving linear scalability in stateful transactional applications. You may find it relevant from some of the scenarios you described in your blog.
Nati S.
Posted by: Nati Shalom | December 30, 2008 at 05:14 PM
Hi Nati,
I'm a huge fan of Space-based Architectures. In fact, a while back I started coding a web-service version that I intended to release to the dev community as an IaaS offering. GigaSpaces was a HUGE inspiration and much of my implementation was based on the GigaSpaces feature set. It ended up morphing into more of a simple tuplespace service after I realized just how much work was involved in developing a truly robust and scalable SBA web-service. Alas, I dropped it after Amazon came out with SimpleDB. The big gray cloud of discouragement got the better of me.
Anyway, I have a list of write-ups that I'd like to do, and there's one about SBA on there not too far from the top. Thanks for commenting, I'm actually a frequent reader of your blog - lots of good stuff on there.
Posted by: Max Indelicato | December 30, 2008 at 05:50 PM