Incident Report: September 29 Service Disruption to DCA Data Center

On Sept. 29, 2021 at 17:05 UTC (1:05 PM EDT) a VIP origin data center in Ashburn, Virginia (DCA) experienced a rapid increase in ambient temperature. The temperature increase triggered thermal throttling on a subset of VIP’s server infrastructure which caused some sites to experience errors and intermittent slowness.

As temperatures returned to normal levels, VIP discovered a resource limit triggered by the rapid reallocation of resources away from the impacted infrastructure.  This issue prevented immediate recovery of the affected sites, but by 18:00 UTC (2:00 PM EDT) the issue was resolved and all affected sites were responding as expected.

Chronology of Events

Times are UTC.

  • 17:17 VIP is notified by internal teams that alerts indicate that temperatures are higher than normal at DCA.
  • 17:19 VIP notifies customers via the VIP Lobby and begins providing regular status updates.
  • 17:20 Servers begin thermal throttling of CPU.
  • 17:24 VIP customers with sites at DCA begin reporting site availability issues.
  • 17:32 Cooling restored and data center temperatures begin to decrease.
  • 17:34 Thermal throttling ends.
  • 17:41 VIP monitors recovery and discovers a etcd resource limit issue preventing full recovery
  • 17:59 VIP increases the limit, which resolves the issue.
  • 18:00 All sites have recovered.

Business Impact

WordPress VIP customer sites with DCA as their origin experienced intermittent slowness and reduced availability for approximately 40 minutes between 17:20 and 18:00 UTC (1:20 – 2:00 PM EDT) on September 29, 2021.

Root Cause Analysis

Why did this happen?

VIP leases data center space in Ashburn from a well-known vendor.  The vendor is responsible for providing the space, power, and cooling. Their investigation into the root cause of the temperature issue is ongoing, so we don’t have a RCA at this time. 

The problem that prevented immediate recovery once the thermal event had ended was caused by etcd exceeding the maximum configured database size.  Etcd is a distributed key value store used to track the state of VIP hosted sites.  During normal operation, the configured size was sufficient, but during the thermal event, thousands of sites were scheduled to move to unaffected servers simultaneously.  This spike in activity caused etcd to exceed its configured limit. 

Remediation

Immediate Fix

  • The data center temperature issue was addressed, ensuring that temperatures returned to normal.
  • The etcd quota was increased, which immediately allowed sites to resume starting.

Preventative Actions

Data Center

VIP is awaiting the official RCA from the data center, and looks forward to reviewing and discussing their plan to mitigate future risk.

Etcd Quota

The immediate action taken by VIP to increase the etcd quota should prevent a similar issue from occurring again. In addition, VIP is adding additional monitoring to ensure the configured limits can absorb large spikes in activity.