Wednesday, December 5, 2012

It's Maintainability, Stupid

I have been playing with FastStart and Silvereye for a while, as well as worked with the single-cloud script. I also wrote about the setup I use for testing Eucalyptus on a laptop. The latest tweets, and blogs clocked a Eucalyptus installation in less than 15 minutes. Quite impressive I have to say. Yet I think it's important not to lose prospective on the ultimate goal for an on-premise cloud.

On-Premise Cloud as Infrastructure

I see the installation of  Eucalyptus (or any on-premise cloud for that matter) as a step during the deployment of a cloud, and neither the first one, nor the most important one. The on-premise clouds I am referring to, are IaaS, and right now I want to focus on the Infrastructure part . When I think about Infrastructure I think of satellites, power grid, aqueducts, Internet, and so on. There are quite a few characteristics of an Infrastructure, but one in particular comes to my mind to identify a successful one: Maintainability.

Why Maintainability?

An Infrastructure's purpose is to provide or support basics services: when infrastructure collapses, severe consequences are to be expected (think blackouts, blocked highway, etc...). An Infrastructure needs to be dependable (how can you build a house on unstable foundations?), to have a long lifespan (I can think of temporary only for Proof Of Concept installations), to sustain a very different load or use throughout its useful life (think about Internet, highways  etc.. from their inception to what they are today), hence it needs to  adapt to the load (elasticity), to isolate and/or limit the scope of failures (resiliency), to be functioning and accessible (available), to be inter-operable with different versions and/or similar minded infrastructure, to isolate operator access (minimize human errors). In my mind the above encapsulates the essence of Maintainability. I guess I'm taking the Cloud Admin side by considering reliability and availability as part of maintainability - since an Infrastructure that is neither reliable, nor available is not maintainable from the admin point of view.

Deploying On-Premise Cloud

So, how do we deploy an on premise cloud successfully? I see most of the difficulties in the planning, preparing, and forecasting. One key element is to understand where the workload and the cloud will be in the future, and to have a path to migrate both the underlying physical system and the cloud software incrementally. Very easy, right?

Learn your Workload

In most cases an on-premise cloud is deployed for a specific application or set of applications, or departments, or group of users, or use-cases, so the workload is already implicit. Learn the needs to of the workload in term of compute, network, storage, and how parallel or spiky it is, and if possible at all, forecast the future workload.

Capturing at this stage a sample of the workload, or to create an artificial load which mimic the real workload, is a boon for the successive stages. Although it not always easy or possible, having a model of the workload will be very helpful to validate the physical infrastructure, and the cloud deployment.

Map Workload to Cloud Resources

Cloud Resources are basically three: compute (CPU and RAM used by the instances), storage, network (bandwidth, security/isolation, IP addresses). Understanding the workload needs in terms of Cloud Resources usage, will help to size the cloud appropriately. 

Also, usage will most likely vary with the Cloud Users becoming more savvy: it's common to see a very heavy reliance on EBS (for instances and volumes) when starting using the cloud, and to move more toward instance-store once the applications become cloud-aware. 

Prepare your Physical Infrastructure

With the workload defined in term of Cloud Resources, we can start isolating the possible bottlenecks and prepare the physical infrastructure to successfully run the workload. Note that some Cloud Resources may end up using multiple physical resources at the same time: for example, boot from EBS instances may tax at the same time the Storage Controller (EBS service provider) and the network (to allow the Node Controller to boot the instance). 

Also factor in the load incurred in fulfilling some operation: for example starting an instance may have the Node Controller fetch the Image from Walrus, copying it on its local disk, then starting the instance. Those operation may create contention on the network (instance traffic and image transfer), on Walrus (multiple NC asking for different images), or on the local disk (if the caching of the instances is done on the same disk where ephemeral storage resides).

The physical infrastructure plays also a very important role when thinking about scaling the cloud: if the storage cannot be easily expanded, if the network cannot be upgraded or reconfigured, growing the cloud to meet the forecast workload may be impossible without the need of a re-install or a long downtime


Well, we finally got here: with the physical infrastructure properly sized, we can start installing each component, Cloud Controller, Walrus, Cluster Controller(s), Storage Controller(s), and Node Controller(s) on their respective hosts. And yes, this step can take a lot less than 1/2 hour, although on production installation, with multi-cluster, HA, and SANs, it may take a bit longer.


Once the cloud is deployed, it truly becomes an infrastructure, and as such we need to ensure it stays up all the time, through upgrades (cloud software, host OS, router firmware, all should be up-gradable with hopefully no downtime for the cloud as a whole), failures (failure of a machine or a component should not impact the cloud, although it may impact an instance, a pending request, etc...), expansions (adding Node Controllers, clusters, storage, network), and load spikes (cloud should degrade gracefully and not collapse). 

Any of the above steps may happen after deployment (you may need to profile some problematic workload, or to deploy some new components), yet they all fall under the Maintain umbrella, since they are all needed to ensure that the Infrastructure fires on all cylinders and becomes invisible. Once the Infrastructure has been in place long enough, its usage will be taken for granted, and the only attentions will be received when there are deficiencies or problems (think about power grid and how black-outs or brown-outs get in the news). That is, when your cloud is not anymore maintainable, you will be in the news.