Incident Report: August 5 Service Disruption to BUR Datacenter

On August 5, 2021, from approximately 03:27 – 05:35 AM UTC, multiple sites hosted on our BUR datacenter experienced a service disruption resulting in an increase to status codes in the 500’s and slow response times. As of 05:44 AM UTC, our service was fully restored.

Chronology of Events

Times are in UTC.

  • 03:27:13 – First alert received.
  • 03:30:00 – VIP Team acknowledges alert and begins investigation.
  • 03:31:00 – Service failure identified. Restoration of last backup required. Outage Protocol initiated.
  • 04:13:00 – Twitter Outage Notification posted.
  • 04:39:00 – Outage posted to WPVIP Lobby.
  • 04:56:00 – Restoration process is complete. VIP Team begins testing services. Sites start recovering.
  • 04:57:00 – Due to the amount of sites working to recover the restored cluster encounters an OOM error.
  • 05:00:00 – VIP Team adds a Network Policy to limit traffic and allow the cluster to start up.
  • 05:35:00 – Services restored with increased resources. Network Policy removed allowing web traffic to
  • 05:44:00 – Service disruption resolved. WPVIP Lobby Updated. Twitter Updated.

Business Impact

This service disruption event caused users to experience an increase to status codes in the 500’s and slower response times for uncached data on multiple sites hosted in the Los Angeles Datacenter from Aug 5, 2021 03:27:13 UTC until 05:44:00 UTC.

Root Cause Analysis

Why did this happen?

The cluster powering Vitess (a database solution for deploying, scaling, and managing clusters of database instances) running in the BUR datacenter broke. The WPVIP operator, which relies on Vitess topologies for rendering wp-config.php, reconciled bad configs for sites and brought them down.

Remediation

The cluster was restored, which fully resolved the outage.

Preventative Actions

  • VIP fixed the operator code so that if Vitess cannot retrieve topology information from the cluster, the operator won’t reconcile sites again, sites will keep status quo. This was the expected behavior for Vitess to begin with.
  • VIP is updating the Outage Protocol to increase the speed of external alerts.

If there are any questions or concerns related to this incident, please reach out to your VIP Relationship Manager or open a ticket via vip-support@wordpress.com.