Tuesday, January 1, 2013

Maintainability and Eucalyptus

I recently blogged about the importance of Maintainability for on-premise Clouds. Within the lists of steps to a successful on-premise Cloud deployment identified in the blog, Eucalyptus as IaaS software is heavily involved with the Deploy and Maintain part.


I already  mentioned the work done to make Eucalyptus installation easy peasy, so let me summarize them here. Eucalyptus is packaged for the main Linux distritributions, so the installation is as easy as configuring the repository, and do a yum install or apt-get install. Configuring Eucalyptus is still a bit more complex that I would like to, and requires to register the components with each other,  but the steps can easily be automated, as demonstrated by our FastStart installation. 

Although there is always margin to improve, as distributed systems go, I dare to say that we are getting as easy as possible. Moreover, any good sysadmin already uses software to manage the infrastructure, so I see script-ability as the most important feature to allow easy progress with custom installations (ie Eucalyptus deploy recipes to use with ansible, chef, and puppet).


If you follow our development, you already know that Eucalyptus 3.2 got recently released. There are ample documents covering the release either in general (Rich, and Marten blog) or for specific features (DavidAndrew, and Kyo blog), but if I wear my Cloud Admin hat, the part that didn't get enough coverage is the the amount of work that went into making Eucalyptus more maintainable.

Eucalyptus 3.2 fixed issues.
Eucalyptus 3.2 had 350 fixed bugs, and those are only the reported ones, since quite a few got fixed while restructuring parts of the code. Peek over the list, you will see the ones related to the new features but there is a large number of things done to make Eucalyptus more robust and hence maintainable. You don't believe me? Let me give you a sample:
  • reworked the inner code paths of the Storage Controller, preventing now to accidentally configure the SC with am undesired backend;
  • added safety mechanism to the HA functioning which will prevent or greatly reduce the risk to have split brain Cloud Controllers;
  • more robust handling of orphan instances (the situation appears if the Node Controller is not able to timely relay its information all the way to the CLC)
  • plugged memory and database connections leaks (fairly annoying since they required restart of components under particular use cases).
Likely you got more excited about our awesome new user console, but it's the features like the above list that gives me the comfort of a solid Infrastructure. 

User Console screenshot taken from David's blog

Are we there yet?

As I mentioned before, there is always room for improvements. The bulk of the work for 3.2 went into hardening the code, into covering all the corner cases, into improving QA coverage. I call all this work the invisible work since it is neither flashy, nor apparent at cursory inspection, yet it is the one that allows the Infrastructure to survive the test of time. 

With most of the invisible work done, what is ahead of us is more easy to understand and categorize. From the list of the work scoped for 3.3, I see a lot of great new features: autoscaling, cloudwatch, elb for example. This is still scoping work, so if you really like one feature, go and tell us, or up-vote it in our issue tracker.  Yet, with all these new features, we don't loose focus on our Infrastructure roots. In particular the work for Maintenance Mode, and Networking, alongside with a lot of other features that will make it much easier to deal with Cloud Resource, for example vmtypes and tagging.

So, are we there yet? As I mentioned in my previous blog, work on an infrastructure is done when the infrastructure is not in use anymore, so no, we are not there yet, but for sure we are having a great ride.