Data centre resilience - what to look for when selecting a 'resilient' data centre

Posted on in

The importance of resilience

Resilience in the context of data centres relates to the ability of a data centre to remain operational and recover quickly from unexpected events, such as natural disasters, power outages, cyber-attacks, and other disruptions. So, essentially, it’s of the utmost importance. But how do you measure the resilience of a data centre facility?

The data centre tier system

One clear marker of a data centre’s resilience is its tier classification, which describes the level of availability and reliability of a data centre’s infrastructure. The tier system was created by the Uptime Institute, a global data centre authority, to provide a common language and set of standards for evaluating data centre resiliency. The tier system ranges from Tier 1 to Tier 4, with each tier representing a progressively higher level of availability and redundancy:

  • Tier 1: a single path for power and cooling (the lowest level of resilience).
  • Tier 2: some level of redundancy, with at least one backup power and cooling source.
  • Tier 3: multiple sources of power and cooling and designed to allow maintenance or upgrades without disrupting service.
  • Tier 4: the highest level of redundancy, with multiple sources of power and cooling that are fully fault tolerant.

But, beyond the numbers, what should you be looking for?

Redundant power and cooling systems ...

…to ensure continuous operations in the event of a power outage or other disruptions, such as:

  • Dual power feeds from separate power grids or power substations to ensure that power is available even if one feed fails.
  • Backup generators to provide power in the event of a power outage.
  • A Uninterruptible Power Supply (UPS) system to provide immediate backup power in the event of a power outage, allowing critical systems to remain operational until backup generators can be activated.
  • Redundant cooling systems such as backup air conditioning units or chillers, to ensure that the temperature remains within acceptable limits in the event of a cooling system failure.
  • Hot aisle/cold aisle design to optimise airflow and cooling efficiency. This design separates hot and cold airflows and directs cool air to the front of the racks where the servers draw in air.
  • Environmental monitoring to detect and alert staff to potential cooling or power issues.

Multiple network connections

…connections from different providers to ensure high availability and connectivity:

  • Carrier diversity and network redundancy: no single point of failure in the event that one carrier experiences an outage or network issue.
  • Geographical diversity: having network providers with geographically diverse routes can help ensure that a single natural disaster or other unexpected event doesn’t affect all of the network connections.
  • Network monitoring: to detect potential network issues before they affect connectivity (and identify and isolate issues when they do occur).
  • Load balancing: to distribute traffic across multiple network connections so no single network connection becomes overloaded.

Strong physical and cyber security measures ...

… to protect against threats such as cyber-attacks, theft, and vandalism.

Comprehensive disaster recovery plans ...

… including strategies to recover critical IT systems and data in the event of a disaster. A disaster recovery plan typically includes procedures and technologies for restoring critical systems, applications, and data in a timely manner, with minimal data loss and downtime. The goal is to ensure that business operations can be resumed as quickly as possible after a disaster, with minimal impact on productivity, revenue, and customer service, including:

  • Backup and recovery solutions: these include backup systems and software that allow data to be backed up and stored offsite, so that it can be quickly restored in the event of a disaster.
  • High availability solutions: including technologies such as clustering, load balancing, and failover, which allow critical systems and applications to be automatically switched to a redundant system in the event of a failure.
  • Hot, warm, and cold sites: alternative locations for data centre operations that are equipped with redundant systems, power and cooling infrastructure, and network connectivity, allowing for quick recovery in the event of a disaster.

And more

Beyond these key elements, it’s important to see evidence of:

  • a data centre regularly testing and validating its disaster recovery plan to ensure that it can effectively protect against potential disasters and minimise downtime;
  • the ability to scale up or down quickly and efficiently to meet changing business requirements and demand;
  • a proactive maintenance programme to identify and address potential issues before they become major problems.

With so many factors to take into account, the only way to really get a view on the resilience of a data centre is take a tour of the facility and speak to the people to whom you are entrusting your business-critical IT and workloads. If you’re looking for London edge or Manchester-based data centre footprint, get in touch or arrange a tour of our colocation data centres.