Incident Report: June 11 Service Interruption

On Tuesday, June 11, the VIP Go platform experienced a widespread service interruption that lasted for 3 hours and 11 minutes. On behalf of the team here at VIP, I would like to apologize for the impact this had on each of you. We place the reliability of our platform at the heart of everything we do and we take great pride in the trust you place in us each and every day. Please know that we take this incident seriously. I’d like to walk you through our understanding of this incident as well as the steps we’re taking to help prevent future occurrences.

What happened

A software update to the deployment system we use to apply updates to the underlying application software on the VIP Go platform introduced a defect that was at the root of yesterday’s incident. This resulted in client-specific WordPress application directories (e.g. plugins, themes, client MU plugins) being unavailable to the containers used to serve customer sites on VIP Go. Practically speaking, this means WordPress sites on our VIP Go platform displayed either a blank theme, a missing theme, or an incorrect theme.

The issue was resolved by reverting the update to the deployment system and then performing a rolling restart of all affected application containers across our platform.

Impact

  • WordPress sites on the VIP Go Platform displayed either a blank screen, missing theme, or an incorrect theme.
  • WordPress sites on the VIP Go Platform were operating without functionality provided by plugins, client MU plugins, or themes.
  • Code deploys to Node.js sites on the VIP Go Platform would not have been fully executed, i.e. the code would be deployed but the deployed changes would not have come into effect.
  • After the incident, some WordPress sites had to re-enable the correct custom theme, and also theme options (e.g. menu and widget assignments).

Timeline

  • June 11th 2019 17:01 UTC: A change to our software stacks is deployed, which triggers the defect in the software stacks deployment process.
  • 17:01 UTC: Alerting is triggered and investigation begins.
  • 17:04 UTC: The change from 17:01 is reverted, but the issue is not resolved. Investigation continues.
  • 17:14 UTC: A change in our deployment process is identified as the cause. Testing begins on the impact of reverting this change.
  • 17:27 UTC: The deploy process change is reverted.
  • 17:30 UTC: Software stacks are redeployed across our network.
  • 18:00 UTC: Work begins to automate restarting all web containers safely across all data centers on a rack/switch basis.
  • 18:21 UTC: Container rolling restarts begin and containers reconnect with their custom themes and plugins.
  • 20:12 UTC: All sites on VIP Go are confirmed to be fully restored.

Future Prevention

We are working on implementing additional safeguards and process improvements designed to prevent similar issues from happening again. These include (but are not limited to):

  • Updating Docker volume management to prevent disconnections.
  • Deploying future changes in a limited scope to limit the effects of any uncaught issues.
  • Updating our automated testing on releases prior to production rollouts.
  • Updating internal tooling to facilitate faster global rollbacks.
  • Disabling the default WordPress behavior of automatically switching themes when the current theme is not found (via the `validate_current_theme` filter).

Again, we sincerely apologize for yesterday’s interruption and appreciate the immense trust you place in us every single day. We are in the process of directly connecting with every customer who has been impacted. However, if you have any further questions or concerns related to this, please reach out to me (I’m ng@automattic.com), your VIP Relationship Manager, or open a ticket via vip-support@wordpress.com.