Resolved: VIP Go Availability Issues

On February 12th at 16:35 UTC, a platform-level change introduced an error on the VIP Go platform. This caused availability issues for a small subset of production and non-production sites. The change was reverted at 16:42 UTC, and we’re taking steps to ensure the issue doesn’t happen again.

The issue is now resolved and all sites should be functioning normally. We apologize for the trouble!

If you have any questions or are experiencing any potentially-related issues, please open a support ticket, and we will be happy to assist.

Incident Report

(published February 20, 2019 at 1:00 UTC)

Summary

On February 12, 2019 at 1634 UTC, a deployment introduced fatal errors which caused six minutes of downtime for some applications on the VIP Go platform. The downtime only impacted a small number of environments

Timeline

  • 16:34 UTC: VIP Team deploys changeset.
  • 16:35 UTC: Monitoring indicates increased error rates. VIP Team begins investigation.
  • 16:40 UTC: VIP Team deploys revert.
  • 17:02 UTC: VIP Team publishes Lobby post.

Root Cause

The changeset was attempting to change the load order of a globally loaded plugin. A dependency (Jetpack_Photon), that some applications relied on, was not available in the earlier load order, which resulted in the following fatal error:

PHP Fatal error: Uncaught Error: Class 'Jetpack_Photon' not found in /var/www/wp-content/mu-plugins/a8c-files.php:128

Impact

A small number of production environments were impacted. Any uncached requests (i.e. to Origin servers) returned an error page with a 5xx status code (0.89% of all requests for the timespan) and were inaccessible during the outage.

Future Prevention

We’re looking at two areas of improvement:

  • Improving our integration and pre-deploy testing to account for this specific use case so that similar issues can be caught earlier in the development process.
  • Improving our time-to-revert to reduce impact if similar issues come up again in the future.