« Simple MVC PHP Framework | Main | The I.H.S.D.F. Theorem: A Proposed Theorem for the Trade-offs in Horizontally Scalable Systems »

December 16, 2008

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e54fa5b0b98833010536805225970c

Listed below are links to weblogs that reference Scalability Strategies Primer: Database Sharding:

Comments

SFPUG

Postgres == no need for sharding?

Max Indelicato

Hi SFPUG,

Most of the above sharding strategies are just as applicable to Postgres as MySQL (or SQL Server, Oracle, etc). Database sharding is really an RDBMS independent scalability solution.

Nikhil Gupte

Why are you storing the password in the Index Shard?

How would you query the system if it receives a URL to fetch a particular Blog Entry, for instance, if someone enters the following URL: http://theblog.com/BLOG-ID ? Would we then need to always have urls that contain the user's ID (http://theblog.com/USER-ID/BLOG-ID) ?

What if we have a page that lists all the top blogs (which may be on different shards)? Would we then need to combine the results of multiple queries within the application? If yes, would it not be a problem to do *sorting* and *pagination* of huge datasets within the application layer (as opposed to the database layer)?

Is it possible (performance-wise) to use joins (and UNIONs) with *remote tables* (if the db supports it like PostgreSQL) within the database layer itself? Would that be better than doing it on the application layer?

Niklas

Great article! Covers the base of scaling, and very much needed.

Thank you

Bogdan

Good article.
You haven't talked about using limit or order by clauses in the user shards.
What if you want to display a list of users, alphabetically ordered?

Max Indelicato

Hi Nikhil,

I've numbered my responses to each of your questions for clarity:

1.) I stored the password in the Index Shard as an authentication optimization. Really, my intent was to show that you can use the Index Shard to create indexes on data beyond just the data's key. Having the password in the Index Shard means that upon authentication, we don't need to make another trip to a user's Domain Shard because all we're doing is checking the username and password anyway. Again though, this is an optimization and you could just keep the password in a user's Domain Shard solely if you wished.

2.) To obtain a user's blog entries you would create another index table in the Index Shard. An index table with the following columns would probably do what you need:

* userId
* blogId

Using the small table above, you would simply query for the blogId and retrieve the userId. If you joined that table with the original user index table, then you could find the shardId and retrieve the blog entries from the Domain Shard as specified by the shardId.

3.) Listing the top blogs would likely require some data duplication. You're correct that pagination and sorting is difficult in the application layer, and as such, it should be avoided if possible. I'd recommend creating specific Domain Shard just for the blog list using one of the other partitioning methods like Row Count Partitioning for example. A chron job or even a small application could compile the top blogs in the background on a separate server and insert them within the blog Domain Shard. Its important to remember that each partitioning strategy has its place and you may use them in conjunction with each other within the same system.

4.) In a sharded approach, I would recommend that you avoid joins across shards altogether. They can get tricky and almost always result in an overuse of resources. The solution here would be to use another partitioning method like in question 3.

Hope that answers your questions!

Max Indelicato

Hi Bogdan,

See the answer to Nikhil's question 3 that I provided above. Use each partitioning strategy for what it's intended for and combine them on the same set of data as necessary.

JAlexoid

You kids!...
This stuff we call Database Federation. Ever heard of that?
While you need to mock about in you application code, we can just add new servers and do a rebalance of data, and that is it. No application or application configuration changes required...

Max Indelicato

Database federation traditionally results in a SPOF - the database that is acting as the view. ***[Database federation also doesn't provide parallelized queries upon a single whole dataset spread amongst the federated servers.]*** - EDIT: I actually believe I'm wrong on this point.

Beyond that, the issue is lack of control. If in fact you do hit the limits of scability afforded to you by a specific vendor's implementation of database federation, you're out of luck. You can't easily pick apart a vendor's software to optimize as necessary, and a vendor rarely has your application's requirements in mind when they were designing their federation feature-set.

With database sharding as I've described in this article you're allowed all of the above, in addition to everything that traditional RDBMS database federation provides. You can think of database sharding as a custom approach to traditional database federation. I'll admit that canned database federation is a far easier solution, it's just not the right solution for every problem.

These are some of the reasons that many of the larger organizations (Facebook, Myspace, etc) that see obscene amounts of load on their databases, go with the custom database sharding strategy.

rhnet

Awesome job, thanks for the article.

Yuriy

Hi, thanks for the article. I have a question. It will sound silly for somebody who's been practicing db sharding, but please bear with me.

Let's say I have shards that store proverbial invoices. Since each shard is a stand-alone db instance, it will have "invoices" table with auto_incremental PK. Each shard auto increments the PK value on its own and so values are not unique among shards. In such a case I cannot expose an invoice by its id. www.example.com/invoices/123 could return multiple invoices from multiple shards. Am I missing anything here?

Thank you,
Yuriy

Max Indelicato

Hi Yuriy,

You've asked a really good question. In fact, I'm going to write a blog post on it tonight so that I can get into more depth than I can on a response to your comment. Keep a look out for it later on and feel free to comment on that post if your question isn't answered sufficiently in the post.

Yuriy

Sounds good! Thanks Max.

The comments to this entry are closed.

About

  • Max Indelicato. Chief Software Architect and technology enthusiast.

Other Services I Use

January 2009

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31