Saturday, February 17, 2007

The "Hidden Costs of Virtualization"

An interesting article arguing that whereas OS virtualisation is sold on the costs savings achieved by higher CPU utilisation of existing resource, several cost factors seem to be overlooked. The most important of these being that operational costs scale with number of OS instances ; it immaterial whether these instances are OS's real or virtual. The article also points out that commercial OS virtualization software is not cheap; around $20,000 for VMware ESX for a 4 way Intel box, though open source solutions should in due course pull this pricing down.

However, I'd suggest that there are a number of additional considerations.

OS virtualisation, in itself, does nothing to address inherent complexity issue within modern enterprise environments. Instead of sprawl of physical machines with poorly managed applications and configurations, one can now extend these complexity and management issues into a virtualized resource space! Obviously, OS virtualization management is needed, and indeed provided, as commercial products, by virtualization vendors.

In many respects are we not back to where we started? Sure, we can now drive up CPU utilisation, but the runtime infrastructure is more complex than ever. Meanwhile business applications are still as brittle, as tightly coupled, as change resistent as they ever were!

Also, is increased CPU untilisation, at the cost of increased complexity, a good trade?

Driving up CPU utilization has got to be good, right? Indeed, many CIO's want to make dramatic OPEX cost savings by driving CPU utilisation to ~100%.

Yet, whilst running a large datacentre's CPU resource at single figure utilisation levels is an obvious cost issue, what seems to be overlooked are the issues associated with running resources at near maximum utilisation.

Load volatility is an obvious concern. If you achieve, on average 80% utilization across your resource population, just how do you cope with peaks that require say 50% more resource? The standard response may be to outsource the extra resource requirement to a third party utility compute supplier. Yet whilst frequently discussed by the industry, I'm not aware of many over-capacity deals. Quite the contrary, early entants into the Utility Compute Market have recently dismanled facilities due to lack of commerical interest.

Yet, there is a more important issue; namely operational risk. Evidence suggests that compute resource under excessive load is statistically more likely to experience software failure (reference). Moreover, any complex tightly coupled system may suffer cascading failures ; i.e. an initial component failure cascading into a major system outage.

Hopefully, such cascading failures are the exception, however component failure within a heavily loaded environment will always be more intrusive than an for a lightly loaded equivalent, as dependent upon priority of lost service/component, other running services may need to be terminated to free sufficient resource.

Hence, Operational Risk and Data Centre Resource Utilization are issues that are fundamentally linked; linked by the sizes of the potential failure domains within the system.

Failure domains may be defined by / mapped to:
  • Physical locality / Physically shared resource.
  • Hardware type
  • Software type / version
  • Management / Security domains
Considering "physical" failure domains. The following domains usual exist:
  • A data centre facility - (complete power failure, halon release, collapse of network infrastructure)
  • A shared PDU - ( possibly affecting 25% of data centre resource - assuming critical systems like San storage and IP Newtorking - a wired into at least two PDU's)
  • A network switch failure - perhaps impacting 50 servers if each is single homed.
  • A cabinet failure - perhaps affecting the 100 processors in that cabinet.
  • Single, physical machine.

Without spare usable capacity, re-provisioning OS instances caused by a single cabinet or even machine failure may prove challenging.

Perhaps we're need to think about things in a slightly different way?

Conventional Wisdom:

Large datacentres are running out of space and / or limitied by environmental conciderations (power, air-conditioning requirements). Usually, there is little possibility of building a secondary large datacentre facility within the Metro area, because cost can by substantial, and suitable real-estate is not available. Hence, virtualize compute instances and maximally use what resource is already there.


The Alternative:

Instead of a single large datacentre facility, adopt a modular datacentre; these distributed over a larger geographic region. The largest failure domain that we care about is a datacentre module instance, so for 'N' active modular datacentres, we need 'N+1' to allow for complete failure of any instance. So by modularizing and virtualizing the datacentre, we actually increase the ability to use spare CPU resource per datacentre instance, without impacting operational risk.

In this respect Sun's Black Box strategy was an interesting marketing move. I say "marketing" as in reality the hardware required to realise a modular datacentre centre is the easy bit! The difficultly lies in the unstated, but very real requirement, for a distributed/extremely robust composite application resource broker that seemlessly knits these modular resource together into one robust distributed Enterprise/Utility Service Fabric.

No comments: