Monday, September 19, 2011

Drinking Champagne

Some call it eat your own dog food but we prefer the alternative (check the wikipedia article if you don't know), and so we want to give you the latest experience in running our services into Eucalyptus, or, in other terms, how we have been drinking our own Champagne.

Harvesting The Grapes

As with many enterprise companies, it is the responsibility of internal IT to deploy and maintain software tools, products and services that are a vital part of the company's day-to-day operations. The list of products and services is ever-growing and so the following is hopelessly incomplete, yet it can give you an idea of what we use:

Issue Tracker
this was one of our first service to deploy. We even have an article on how it came to be; 
Web Sites
we have 2 main web sites and internally we call them www and open and hopefully they are covering all your needs for private clouds; 
as an open source company, the way to downloads Eucalyptus is very important to us, and we are always finding ways to improve it. We are in the process of spinning up a service on its own to handle this, so stay tune for the launch of; 
Starter Images
all of our 25,000 and counting Eucalyptus clouds need images to be turned into instances to be useful. We provide starter images which we are in the process of refreshing and making even more useful: we will be launching the EMIs service very soon with newer images, mechanism to customize them at run times and more documentation; 
Euca Projects
we recently launched Euca Projects: it is a young sites but it already has very interesting projects and documentation. We for sure used it when we needed to change the ECC logo; 
LDAP directory
we internally use an LDAP directory as the keystone for our identity management system. We always look for ways to improve our users experience and single-sign-on ranks pretty high in our list: we have plan to make LDAP the basis for our single-sign-on to all our services;
Internal wiki and more
we do have internal wiki, internal repositories and other services which are rely upon by the different departments. 

The First Glass 

In summary, we have important services with very different SLA associated with them (critical to nice to have it running), we have about 1 TB of very important data associate with it (with various level of backups and data retention policies), a handful of hostnames and IPs associate with the instances and different network and security requirements. We happen to hear about this new wonderful technology called private cloud early on, so we have all (or almost all) of the above running on our Eucalyptus-powered internal cloud, which unimaginatively we called it production cloud. We have various team working on the different services (the web-team, the support-team, engineering, sales, professional services ets ...) and Eucalyptus accommodate all of them nicely. To have fine control some team employs their own Cloud Application Architect (see cloud IT roles), while others rely on our Tech-team for the job, and the IT-team is our Cloud Administrator. The services are compartmentalized into their own security group and cloud account, they use their own Walrus buckets and EBS volumes, thus separating nicely the security concerns and the possibility of a rogue application bringing down all our infrastructure.

Refining The Process

Eucalyptus follow AWS semantics and offers the instances the Metadata Service which allow the instance to capture information about its creation, ownership and more, Amongst the data available to running instance, there is the user data: this data is passed at run-time (euca-run-instances time to be more precise) and it is up to the instance to use it. Our new instances (the EMIs projects mentioned above) will have a very simple mechanism which allows for them to execute any script passed as user-data. We are converting all of our services to be restarted using script to be passed as user-data: in a nutshell the script will download the specific configuration and data from Walrus buckets, and/or will be mounting their own EBS volume when needed, install and configured the wanted service. You can already see some of them on Euca Projects: the idea is very simple and it makes for a much simplified handling of our services. We already have planet, redmine (projects), databases and we will be adding the scripts for our web server soon.

Rebuilding the Engine while Racing

All of the above helped us tremendously early this week when a disk started to act up. One of our main server (the front-end for production cloud) had one disk starting to go bad, causing all sorts of issues. While Eucalyptus really helps in the IT menial tasks, it is clearly not a substitute for proper planning. Our server was still running on a single big disk (the project to update it keeps getting a lower priority over more fun activities) so this issue had the potential to have a huge impact and long downtime for a number of our services. Our IT-team got in overdrive and in the span of 1 day, planned the move: instances were moved off EBS volumes, Walrus backup were temporarily suspended, and the Eucalyptus Cloud Controller was shut down, while we were migrating the data to a secure location. Since we didn't have our SAN ready for action, we decided to go for a RAID5 temporary solution to give us more time to plan for the next big expansion. Externally, there was a less than a minute downtime for services using EBS (we needed to migrate the data to ephemeral and switch over). The instances with our precious services kept running while we serviced the front end machine and all was well when we restarted the Cloud Controller. Eucalyptus rediscovered the running instances, took ownership of them, and few minutes after, all the state was properly restored. We had another brief downtime when we moved the data back to EBS (for the services using them).

The Day After

The only casualty of this process was an instance that got accidentally terminated. It was a Cloud Operator error and we have taken steps to avoid it in the future (no more watching action movies when doing major upgrades!). In case you missed the announcement, Eucalyptus 3 is coming fast, and it will make it even easier to handle the situation like the one we encountered: HA is part of Eucalyptus 3, and moving a Eucalyptus component will be much easier to do while keeping the cloud running. Amongst the lessons learned, I would say remember to listen to the Cloud Architect when planning your deployment: having some services depending on a single disk is always a bad idea, not matter how many backups you have in your vault. Share with us your Eucalyptus or cloud experience