The recent massive British Airways data centre outage highlights just how damaging data centre issues can be. Data centre managers are constantly replacing ageing infrastructure systems and components in order to avoid failures and downtime. However, most data centre outages aren’t actually caused by tech failure, but by issues with people and processes.
In the case of BA, a power outage caused the initial downtime but it was the fact that the staff working in the data centre didn’t know how to safely restore all the systems when switching them back on that led to the wider outage. This knowledge gap due to outsourcing and insufficient training made the initial outage so much worse, turning it into a full-blown disaster event. Data centre managers have to focus a lot of energy and resources on keeping infrastructure up to date, but there needs to be just as much importance placed on the processes around these systems to remain resilient.
Two-thirds of data centre outages are related to process, not infrastructure systems. The costs related to an outage are far-reaching, including not only initial costs such as damage to mission-critical data, lost productivity and equipment damage but also legal and regulatory impacts and lost confidence and trust from stakeholders and customers. The losses of reputation and market share for BA have been significant.
If people and processes are so important in reducing downtime, data centre managers should take the following actions:
- Make maintenance coherent and repeatable
When servers are constantly being patched, there needs to be a way to store regular maintenance routines so that they are repeatable, sustainable and updateable processes that don’t rely solely on human knowledge. This will reduce the risk of an outage when maintenance is being run.
- Constantly learn and improve
When an issue does occur and recovery is needed, it is essential to update processes and procedures accordingly. If knowledge and information are gathered whenever there is an issue or outage and processes are regularly updated, the processes will be continuously improved and downtime will be reduced.
- Provide status visualisation
Infrastructure is often the root cause of failed implementations when people on the software side of the business try to release a piece of software without knowing that there is maintenance being carried out by those running the infrastructure. If we can make maintenance more predictable and surface reliable information we can reduce the number of failures that occur. The ability to view real-time information about what is being done to each system will help to avoid collisions between software updates and infrastructure maintenance.
- Improve communication
Good communication and status visibility will improve response times and better enable teams to deal with issues. People need the best information available to make informed decisions and everyone needs to be aware of current status. There should also be a clear policy for how and when clients are notified, but communicating internally is just as important.
- Have recovery plans ready
Storing data centre recovery test plans as flexible templates will mean that subsequent tests are built on existing capabilities and knowledge rather than starting from scratch. It also means that when real-life disaster recovery plans are needed, the necessary information will be readily available, reducing the response time in a crisis.
With the average costs of data centre outages rising (38% since 2010) managers need to do everything they can to avoid the risk of an outage and ensure they have the right processes in place to deal with them quickly when they do occur. Reducing the risks involved in routine maintenance, increasing visibility for teams working on and using the services and having repeatable recovery processes in place can significantly reduce downtime.