Incident Report: Jan 12 Service Disruption

Overview

Between 20:30 and 20:46 UTC on 12 January 2022, WordPress VIP experienced a partial service disruption due to a code change that impacted how HTTP requests are routed within the WordPress VIP infrastructure. As a result, the majority of uncached requests for affected sites were served 503 responses during this time.

Chronology of Events

DateUTC TimeUpdate
12 Jan. 202220:15Code change release causes an internal API to generate incorrect configuration data. 
12 Jan. 202220:30As part of normal operations, VIP routing configurations dynamically  update using data from an internal API. The data is incorrect  because of the previous update.
12 Jan. 202220:30:42 First failed request is recorded in the logs and internal alerts received. 
12 Jan. 202220:32VIP begins investigation.
12 Jan. 202220:40VIP identifies the problem.
12 Jan. 202220:43VIP reverts the offending code and reloads routing configurations.
12 Jan. 202220:46:08The last failed request resulting from this issue is recorded in our logs. Incident is resolved.
12 Jan. 202220:50VIP Lobby updated, post-outage process begins.

What Happened

A code release caused incorrect data to be materialized by an internal API.  Our systems use this data to determine how HTTP requests are routed within the WordPress VIP Infrastructure. With incorrect data, our systems were incapable of forwarding traffic to the correct destination, and returned errors to uncached requests on affected sites resulting in HTTP 503 errors.

Remediation

The issue was addressed by reverting the code change that led to incorrect routing configurations and deploying the correct configurations. 

Future Prevention

The process for code releases is being reviewed to add additional procedural safeguards. Automated checks are also being investigated to minimize the chance of a similar problem happening in the future.