Release the Monkeys – How to Prevent an Epic IT Failure

The weekend’s BA outage provides an opportunity for all of us accountable for managing IT infrastructure to introspect on the engineering and operations behaviours we encourage within our organisation.

Service Resilience, Business Continuity and Disaster Recovery are all important aspects of a cohesive and comprehensive Service Management strategy.  In a world of Software as a Service and Public Cloud, however, we need to take this a step further, we need to look at prevention as well as cure and apply resilience engineering to business critical systems.

As engineers and technology leaders we must understand how to balance design, engineering and operations when building systems to enable failure prevention rather than cure

Richard Slater, Principal Consultant

Resilience Engineering is a practice originally from the construction industry whereby design of buildings, infrastructure and utilities incorporates failsafe mechanisms, backups and redundancy to keep people alive when the worst happens.

There is another lesson to learn from the construction industry; we need to bake resilience needs into the planning stages and architecture of a project; it is not something we can add on at the end of a project.  Typically, within System Design, we talk about the Availability Quality Attribute which incorporates resilience – and represents the system’s ability to withstand failures, attacks and unexpected demand.  Architecture is just the beginning of the story. However, we still need to think resilience during the engineering of a project.

Release the Monkeys

The question of how you encourage engineering practices to improve the availability or resilience of a system doesn’t have a perfect answer.  As developers, we need to balance functional requirements, such as ‘Add this product to a bag` with non-functional requirements, such as ‘The bag must survive a failure when a datacentre fails’.  Often the two will compete, and it is easier to prioritise the former when the pressure is on.

Netflix came up with an innovative solution to this problem in the form of Chaos Monkey, a tool to simulate failure in AWS across Development, Testing and Production.  Chaos Monkey works by killing off AWS Instances indiscriminately; this has the effect of causing systems that have not been architected and engineered for resilience to fail unexpectedly.  This failure, in turn, gets the engineering teams thinking about Availability and Resilience as part of their daily activities, crucially amplifying a feedback loop between failure point and mitigation.

Beyond Architecture and Engineering

Architects and Engineers can typically only work within business and technical constraints.  As business leaders, however, we have the capability to set a strategy that emphasises the organisations varied responsibilities towards customers, shareholders and regulators.  This strategy should highlight a balance between ‘delivery at pace’ and ‘delivering safely and securely.’

Fortunately, we are in a world where we can employ the public cloud to provide instant-on, high-availability, scalable services; you can only exploit this fully by adopting cloud-first patterns that rely heavily on the cloud providers Platform-as-a-Service offerings; a lift-and-shift mentality is no longer sufficient to create highly available resilient systems.

Richard Slater, Principal Consultant

Don’t keep all of your Eggs in one Basket

An interesting take away from the BA outage was that it appears that they have built highly available IT services in two datacentres near their Heathrow headquarters.  Whilst BA should be commended on using multiple datacentres they may have failed to recognise the inherent single point of failure by locating both datacentres in the same region.  Typically, if hosting facilities are hosted within the same country or within 300km of each other they are subject to political, social or environmental factors, given the thunderstorms at the weekend it is possible that both Heathrow facilities were impacted by a lightning strike on the electrical infrastructure.

Of course the public cloud is not immune to these political, social and environmental factors, although in the case of AWS and Azure the majority of their outages have been due to human error rather than external influence.  As an organisation, you can mitigate this by employing the services from multiple clouds to deliver your most critical functionality, engineering systems that can be deployed across multiple cloud providers gives you immunity from the failures of one provider.

If you want to know more about designing for failure or resilience engineering, feel free to get in contact with Richard Slater, Principal Consultant at Amido. 

 

Leave a Reply

Your email address will not be published. Required fields are marked *