<< Alienating Users By Changing The License | Home | Dear Sir, Please Advise! >>

Amazon SimpleDB Performance

The n+1 problem you can't avoid

Amazon offers a few massivly scalable web services like Elastic Compute Cloud (EC2) and Simple Storage System (S3). EC2 delivers virtual Linux boxes and S3 offers fail-safe disk space. Those services can be used as building blocks for custom applications. They are well thought out and allow you to start small and grow fast while concentrating on your own business (application) and leaving the operational details to a third party.

A rather new service and still in limited beta is Amazon SimpleDB. It is "a web service for running queries on structured data in real time." While S3 stores unstructed data like files, SimpleDB stores items with attributes organized in domains. The concept is a similar to other databases where an item could roughly be described as a row, an attribute as a column and a domain as a table. There are no connections between the domains (i.e. no foreign keys or referential integrity) but you can run arbitrary queries on the attributes of the items in a domain.
Like all Amazon webservices SimpleDB is accessible through HTTP using either a SOAP or a REST-style API. You can GET, PUT or DELETE items in a domain along with their attributes and you can QUERY a domain with a simple set of operators: =, !=, <, >, <=, >=, STARTS-WITH, AND, OR, NOT, INTERSECTION AND UNION. A QUERY returns a list of item identifiers and is quite fast. If you also need the attributes of these items however you must execute an additional GET request for each item that has been returned.

What does that mean in practice?

Say you store a set of blog entries in SimpleDB and want to show a list of the ten least recently added entries with title, date and author. You would issues a QUERY request that returns the identifiers of the ten entries and then issue ten GET requests to retrieve the title, date and author attributes for each of them. That makes a total of 11 requests that have to be executed. In general you need n+1 requests for a query that returns n items.

Amazon explains that you can run these queries in parallel to improve the overall response time for a query but that means you are running 50 or 100 parallel requests for just 10 visitors looking at your list. Sure, you can also cache them or do some other fancy tricks but the problem remains: Any query to SimpleDB requires n+1 HTTP requests. The bigger your result sets the worse your performance. Interestingly enough the n+1 problem is well known (and dreaded) from O/R-mapping where it can hit you if relatated objects are loaded in an improper way. The difference is however that with SimpleDB you can't escape it and that an HTTP request to SimpleDB is more expensive than an additional SELECT to a conventional database.

So what's the point here?

In my opinion SimpleDB in its current shape can only be used for very few use cases if at least one of the following conditions is met:

  • Your application only uses PUT, GET and DELETE requests
  • Your application only uses queries that return very few items (10 or less)
  • Your application only uses queries very infrequently, for example if serving all data from a distributed cache and using SimpleDB only as a backup store for structured data

Maybe that's one of the reasons for Amazons to partner with Sun and provide support for MySQL on EC2.

Update 2008-08-27

Amazon has updated their API to include a Query Sort feature to sort result sets based on a single attribute and add to return all information associated with the items of a particular Query:

The QueryWithAttributes feature provides the ability for developers to retrieve all the information associated with items returned as a response to a particular query. This highly requested feature simplifies the application development process; instead of issuing a query followed by a series of read requests, application developers can now use a single API call to retrieve all information about items stored in Amazon SimpleDB. This is useful for developers who are not using parallel programming or who utilize programming languages that do not support parallel programming.

I didn't have a chance to test these two new features yet but from the spec it seems like they could make SDB an interesting alternative to traditional databases in a few more scenarios.




Add a comment Send a TrackBack