Incident Report: August 5 Service Disruption to BUR Datacenter

On August 5, 2021, from approximately 7:21 PM – 9:03 PM UTC, multiple sites hosted on our BUR datacenter experienced a service disruption resulting in an increase to status codes in the 500’s and slow response times. As of 9:03 PM UTC, our service was fully restored.

Chronology of Events

Times are in UTC.

  • 7:21 PM – Our monitoring system discovers rising temperature levels at our Los Angeles Datacenter. Work to simultaneously alleviate and reroute traffic to decrease CPU load and discover the physical cause of the temperature increase begins.
  • 7:26 PM – First alert received.
  • 7:55 PM – Datacenter confirms with VIP that a power issue caused a temperature increase.
  • 7:58 PM – VIP releases our initial Lobby Post concerning the incident.
  • 8:02 PM – We begin to see some cooling recovery.
  • 8:10 PM – Another rise in temperature in the datacenter is detected.
  • 8:17 PM – VIP identifies a cooling issue within the affected datacenter and works toward resolution with on-premises staff. Work continues to offload CPU usage from the Datacenter while working towards a physical fix.
  • 8:33 PM – VIP receives notice that cooling has returned and the datacenter temperature is decreasing.
  • 8:42 PM – We start to see cooler temperatures within the datacenter. The resolution is still in progress and we continue to monitor.
  • 8:47 PM – Datacenter transfers load back to utility power, temperatures continue to decrease.
  • 9:03 PM – We now see full recovery of service disruption for affected applications hosted in the Los Angeles Datacenter.

Business Impact

This service disruption event caused elevated levels of status codes in the 500s as well as sporadic instances of increased loading times. This event affected applications served from our Los Angeles Datacenter.

  • The elevated levels of 5xx responses lasted approximately 70 minutes, starting around 19:20 UTC and returning to normal levels around 20:30 UTC, August 5, 2021.
  • Some customers also experienced longer response times during this event.

Root Cause Analysis

Why did this happen?

Servers in the BUR datacenter started throttling CPUs to lower frequencies due to an excessive increase in temperatures. This resulted in sites hosted in the BUR datacenter returning 5xx errors. The increase in temperatures at the datacenter were related to the datacenter suffering power loss, including loss of backup power. When the datacenter regained utility power, the temperatures dropped and resolved the outage.

The process of the BUR going slightly offline due to the server cooling issue is not outside of expected mitigation procedure, and VIP did not experience a cascading outage of sites outside of this datacenter were impacted.

VIP does not support an origin datacenter failover. There are multiple POP locations around the globe for edge traffic, but an issue at the origin datacenter cannot be fully mitigated when the core issue is outside of the control of VIP.

Corrective Actions

Immediate Fix

  • When temperature increases started, a line of communication with the physical datacenter location was opened as technications worked on the cooling issue.
  • VIP rerouted some traffic from the datacenter to decrease the CPU load and decrease the core temperatures.
  • The datacenter technicians were able to remedy the power issue that was causing the ineffective cooling.
  • Replicas for several big sites were in the progress of being spun up, however temperatures dropped before they were needed.

Preventative Actions

Communications & Process Improvements

  • VIP is updating the Outage Protocol to increase the speed of external alerts.
  • VIP is working with datacenter management to receive additional incident details which will then be applied to future risk mitigation.

Technology Improvements

  • We are investigating systems-related service failover to our alternate datacenters.