Dear Sir, Please Advise!
Those emails to mailing lists come even from well-known consulting companies like Accenture. Three emails with almost the same question within a few hours each time including the full 25 lines long company disclaimer.
I wonder if it is known to the companies that the cheap resources are actually sponsored by well educated and expensive employees often employed by the same companies who answer those questions in their office time.
Amazon SimpleDB Performance
The n+1 problem you can't avoid
Amazon offers a few massivly scalable web services like Elastic Compute Cloud (EC2) and Simple Storage System (S3). EC2 delivers virtual Linux boxes and S3 offers fail-safe disk space. Those services can be used as building blocks for custom applications. They are well thought out and allow you to start small and grow fast while concentrating on your own business (application) and leaving the operational details to a third party.
A rather new service and still in limited beta is Amazon SimpleDB. It is "a web service for running queries on structured data in real time." While S3 stores unstructed data like files, SimpleDB stores items with attributes organized in domains. The concept is a similar to other databases where an item could roughly be described as a row, an attribute as a column and a domain as a table. There are no connections between the domains (i.e. no foreign keys or referential integrity) but you can run arbitrary queries on the attributes of the items in a domain.
Like all Amazon webservices SimpleDB is accessible through HTTP using either a SOAP or a REST-style API. You can GET, PUT or DELETE items in a domain along with their attributes and you can QUERY a domain with a simple set of operators: =, !=, <, >, <=, >=, STARTS-WITH, AND, OR, NOT, INTERSECTION AND UNION. A QUERY returns a list of item identifiers and is quite fast. If you also need the attributes of these items however you must execute an additional GET request for each item that has been returned.
What does that mean in practice?
Say you store a set of blog entries in SimpleDB and want to show a list of the ten least recently added entries with title, date and author. You would issues a QUERY request that returns the identifiers of the ten entries and then issue ten GET requests to retrieve the title, date and author attributes for each of them. That makes a total of 11 requests that have to be executed. In general you need n+1 requests for a query that returns n items.
Amazon explains that you can run these queries in parallel to improve the overall response time for a query but that means you are running 50 or 100 parallel requests for just 10 visitors looking at your list. Sure, you can also cache them or do some other fancy tricks but the problem remains: Any query to SimpleDB requires n+1 HTTP requests. The bigger your result sets the worse your performance. Interestingly enough the n+1 problem is well known (and dreaded) from O/R-mapping where it can hit you if relatated objects are loaded in an improper way. The difference is however that with SimpleDB you can't escape it and that an HTTP request to SimpleDB is more expensive than an additional SELECT to a conventional database.
So what's the point here?
In my opinion SimpleDB in its current shape can only be used for very few use cases if at least one of the following conditions is met:
- Your application only uses PUT, GET and DELETE requests
- Your application only uses queries that return very few items (10 or less)
- Your application only uses queries very infrequently, for example if serving all data from a distributed cache and using SimpleDB only as a backup store for structured data
Maybe that's one of the reasons for Amazons to partner with Sun and provide support for MySQL on EC2.
Update 2008-08-27
Amazon has updated their API to include a Query Sort feature to sort result sets based on a single attribute and add to return all information associated with the items of a particular Query:
The QueryWithAttributes feature provides the ability for developers to retrieve all the information associated with items returned as a response to a particular query. This highly requested feature simplifies the application development process; instead of issuing a query followed by a series of read requests, application developers can now use a single API call to retrieve all information about items stored in Amazon SimpleDB. This is useful for developers who are not using parallel programming or who utilize programming languages that do not support parallel programming.
I didn't have a chance to test these two new features yet but from the spec it seems like they could make SDB an interesting alternative to traditional databases in a few more scenarios.
Alienating Users By Changing The License
Java Service Wrapper and Ext JS move to GPL
In the past days I've notices two Open Source projects that changed their license from a commercial friendly license (BSD/MIT, LGPL) to GPL. The motivation of Java Service Wrapper and Ext JS seems to be to offer commercial licenses for closed-source projects that cannot use GPLed libraries.
While I understand that these projects need resources to enhance their products this step will have an impact on the trust companies will have in Open Source projects. It seems both projects have used a commercial friendly license to attract people. Now that they have an established user base they try to monetize this asset. It remains to be seen if this will work as various groups are already discussing the options to fork these projects, nevertheless the overall damange remains. "A foundation you can build on" - the claim of Ext JS - leaves a bad taste. Not only that companies developing closed-source products will have to pay now (or switch or stick with the old soon unsupported version), Open Source projects with a more commercial friendly license will also have to look for alternatives.
Another interesting aspect is the current use of the GPL: Once designed to grant a maximum of freedom to the users of software it has become a means to restrict usage. Many projects that use the GPL choose to dual license their software and sell commercial licenses to those who can't use GPLed libraries in their products.
Openfire Enterprise is becoming Open Source
The most complete Java-based XMPP server
The enterprise edition of the popular Java-based XMPP server Openfire is becoming Open Source. This also includes the Flex-based IM client Sparkweb.
As Matt points out the the clustering functionality in the enterprise edition will not be made Open Source: "Part of the reason for this is that it uses a third-party commercial library for clustering."
Nevertheless, due to the moduluar design of Openfire, the use of Coherence can quite easily be substituted by a free alternative like Terracotta.
This is really great news. Thanks to Jive for their commitment to the community.
Update 2008-04-07
Dombiak Gaston has followed up with a roadmap describing the two phases of the transition. The first phase with the majority of features is scheduled be finished by April, 27th.
Help Vampires
Sucking the very life and energy out of people
Just received a great link from Martin Smith on the Asterisk-Java dev mailing list:
Help Vampires.
The page includes everything you need to know:
- How to identify them
- What to do if you are a Help Vampire
- How you can reform them
A lot of help vampires originate from countries that provide cheap outsourcing capabilities for companies in Europe and the US. Interestingly enough these vampires usually receive support from employees and contractors working for (and paid by) just the companies that enjoy the cheap prices of offshore development.
References
Advertising on WHOIS?
Crazy ideas by Network Solutions LLC
WHOIS is a simple TCP-based protocol that is widely used to provide human-readable information about domain names, registered networks, NIC handles and more. It dates from the old-ages of the Internet and is described in RFC 3912.
Usually you will use WHOIS to lookup the technical or administrative contact of a domain or network in case of technical (like DNS problems, routing trouble) or legal issues. Today I had a look at a .com domain and what I received was something like this:
$ whois ibm.com
Registrant:
International Business Machines Corporation
New Orchard Road
Armonk, NY 10504
US
Domain Name: IBM.COM
------------------------------------------------------------------------
Promote your business to millions of viewers for only $1 a month
Learn how you can get an Enhanced Business Listing here for your domain name.
Learn more at http://www.NetworkSolutions.com/
------------------------------------------------------------------------
Administrative Contact:
DNS Admin, IBM
IBM Corporation
New Orchard Road
Armonk, NY 10504
US
+1.9147654227 fax: +1.9147654370
Network Solutions has long been known to seek additional revenue from their NIC services by adding "creative" features, but selling ads in WHOIS (obviously available since the end of last year) just sounds crazy.
References
Compression: gzip vs bzip2 vs 7-zip
A trade-off between time and space
Today I had a look at the different options to compress files (in this case for backup purposes) on a Ubuntu system. The most common tools to compress files are gzip and bzip2. They have both been around for a long time, are available on most systems by default and are nicely integrated with other utilities like GNU tar (using its -z and -j options).
7-zip and the algorithm it uses (LZMA) is not that common on UNIX-like operating systems. It is well-known as a free alternative for WinZip on Windows systems and was started back in 1998. For Ubuntu p7zip – a port of 7-zip to POSIX – is available in universe (sudo apt-get install p7zip).
My test file was a MySQL dump with a size of 163 MB that contains mostly text. I was interested in the compressed file size and in the time it takes to compress and uncompress the file.
Here are the results:
| Compressor | Size | Ratio | Compression | Decompression |
|---|---|---|---|---|
| gzip | 89 MB | 54 % | 0m 13s | 0m 05s |
| bzip2 | 81 MB | 49 % | 1m 30s | 0m 20s |
| 7-zip | 61 MB | 37 % | 1m 48s | 0m 11s |
For the test I ran all tools with their default settings, i.e. without providing any special options.
Gzip is still a great tool and provides good compression without consuming a lot of computation power. Bzip2 is much slower and only provides slightly better compression. 7-zip consumes a bit more cycles than bzip2 but results in far smaller compressed files. Speed for decompression is even better for 7-zip than for bzip2.
So if time is important (think of on-the-fly compression) gzip is the tool of choice. If you don't care too much about processing speed and need very good compression have a look at 7-zip. The only advantage bzip2 has over 7-zip is that bzip2 is part of most default installations and is more common. Let's hope this will change in the future, especially integration with GNU tar would be great.
References
The Advantage of Being Non-Agile
Sometimes being not so agile prevents you from adding crap to your system
Agile software development has been en vogue in this decade. It started with the popularity of Extreme Programming (XP) and Kent Beck's series of books on the topic.
One of the values stated in the Agile Manifesto is "Responding to change over following a plan" and is accompanied by the principle
Welcome changing requirements, even late in development. Agile processes harness change for the customer's competitive advantage.
While I generally consider most agile practices as common sense, some of them may also cause trouble.
Being agile encourages customers to feed the development team with ideas on the spur of the moment while non-agile processes enforce proper planning. As a result of the slower time to market more formal processes prevent short-run ideas from being implemented and consuming resources that were better spent working on future-proof core requirements. Besides this those short-run requirements clutter the system if not properly removed once they are no longer needed.
A nice solution is to use differnt processes for differnt systems. Agile processes are best deployed for systems with a short lifetime and requirements that are not fully understood upfront. Examples for this kind of systems are public facing web applications like web stores. More formal processes are well suited for core systems with a long lifetime where maintenance cost is an important factor and stability is key.
This separation allows new ideas to be tested in an agile environment and move them to core systems once they become mature and turn out to be important features in the long run. When they will finally reach the core system the requirements are well understood and can easily be specified in a formal way - so the formal process is not a burden but a line of defense for feature creep.
Which systems belong to which category depends on the type of your business. If you are a bank your accounting systems will probably fit the "core" category as will the systems needed for compliance and trading. The public facing systems and systems mainly used for sales could benefit from an agile process.
The following questions may help you when you classify your own systems:
- How long do you expect the system to live?
- How stable are the requirements for the system?
- Will your system undergo frequent changes?
- How important is the system for your company in the long run?
If you are planning new systems you can also use this classification to define your system boundaries.
The specified call count is not a number: null
Acegi + DWR + IE6 - ActiveX = Boom!
Today we had an interesting bug in a small web application developed for a customer in the financial industry. The application is based on the Spring Framework, secured by Acegi Security and makes heavy use of AJAX (powered by DWR).
Everything went fine while testing with different browsers from Firefox to Safari and Internet Explorer in different versions. Finally we have started testing in the target environment - well locked down and without support for ActiveX controls from untrusted sites. IE6 needs ActiveX for its implementation of XMLHttpRequest (XHR) - the heart of AJAX. If XHR is not available DWR automatically switches to using IFrames to emulate XHR. This usually works well and has already been used in the previous version. Nevertheless the application just didn't work: Every remote call failed with a not so user friendly error message: "The specified call count is not a number: null".
Remote debugging of Tomcat showed that DWR is trying the read the request data using req.getInputStream() to parse it. When using IFrames reading from the request's input stream fails immediately and returns null, when using XHR it works fine. The main difference is the content type of the requests "application/x-www-form-urlencoded" for IFrames and "text/plain" for XHR. As IFrames do work without Acegi but fail with the Acegi filters in place I guess Acegi does mess with the requests when it wraps them in its SavedRequestAwareWrapper.
I didn't have the time to further track it down, but I created a small workaround that falls back to using req.getParameter() if reading the stream fails.
The following snippet shows the modification made to DWR's ParseUtil.java:
in = new BufferedReader(new InputStreamReader(req.getInputStream()));
while (true)
{
String line = in.readLine();
if (line == null)
{
if (paramMap.isEmpty())
{
Enumeration nameEnum = req.getParameterNames();
while(nameEnum.hasMoreElements())
{
String name = (String) nameEnum.nextElement();
paramMap.put(name, req.getParameter(name));
}
}
break;
}
...
And who is to blame? Well, as with many interesting problems that's hard to decide. It's just a combination of multiple pieces of software mixed with environment constraints. I guess it's something that just happens and reminds us that testing is not useless.
Update 2008-02-22
Joe has just released DWR 2.0.3 that includes a fix for this issue.
Will the BEA Acquisition Push Spring Framework?
Alternative stacks may gain more attention in response to Oracle's acquisition of BEA
BEA is well known for its J2EE application server Weblogic and recently extended its product line to a SOA stack branded Liquid. Oracle has its own J2EE compontents mainly derived from Orion Server and branded as OC4J. They also have a SOA stack called Fusion.
So you might ask what the future of these products will be.
Oracle already acquired three different business application suites in the past: J. D. Edwards, PeopleSoft and Siebel. Does it make sense to develop three different product lines of business application suites and at least two different product lines of Java EE and SOA middleware? Probably not.
This uncertain future may make companies reevaluate their current technology stack for middleware applications. They will notice that there is an alternative beyond the IBM Websphere dinosaur and JBoss which is now RedHat. The alternative is Apache Tomcat and Spring Framework mixed with some Terracotta if you need distributed shared data.
Spring basically allows you to assemble your own application server. Transaction management, security, remoting, O/R mapping, clustering – just add what you need when you need it. Basically there is not much left where Spring does not offer a proven solution. On the other hand the features where commercial J2EE servers have been strong in the past (compared to other stacks) are becoming less important:
- EJB remoting will be replaced by web services
- entity bean style O/R mapping has already been replaced by Hibernate (where EJB3 tries to catch on)
- load distribution and fail-over is easy in a world of HTTP or JMS based web services
- clustering of data is less important in the stateless world of services
In the end the uncertainty caused by Oracle's acquisition of BEA could very well further increase the growth of Spring and the open source components it makes so easy to combine.