Resolved: WordPress.com VIP Dashboard Service Outage

Root cause analysis:

Here is an update as to what caused this issue on November 16th and what we have done to prevent it from happening again.  First, a bit of background of jobs on WordPress.com.  In order to speed up common actions such as publishing a post, WordPress.com defers a lot of work triggered by these common actions to jobs which are run after, and asynchronous to the publish action itself.  Today we run about 25 million of these jobs daily across WordPress.com. The jobs understand priority which allows us to use the same system for both important and less important work. Here is the timeline of events:

9:15AM PST: One of the members of our support team flagged that there were a large number of pending tasks for one of the lower priority tasks run by the jobs system.  An initial investigation showed that the workers that normally processes this task had stopped running.  Unfortunately, the monitoring we had in place to catch this was also broken by an unrelated problem.

10:52AM PST: One of our engineers manually restarted the task which began to process the large backlog of items in the queue.  Unfortunately, when the task was started, it was done with a concurrency of 10, instead of the designed concurrency of 1.

10:54AM PST: Our systems team was alerted to a performance degradation of the jobs system and began their investigation.

10:56AM PST: The original task started at 10:52AM was stopped by our engineering team.

11:02AM PST: Everything returned to normal.

There were a couple takeaways from this event that will prevent a similar issue from happening in the future:

  • We have improved the monitoring to ensure that all jobs, even low priority ones, are running as expected.
  • We have started working on a change to allow a developer to specify the maximum concurrency at which a job should be run at the time the job is created. Previously this was handled in documentation, but having it enforced programmatically will ensure errors like this can’t happen in the future.

——–

From approximately 18:54 to 19:02 UTC (10:54 to 11:02 PST), wp-admin pages for some WordPress.com VIP sites were unavailable or unresponsive. This outage was caused by an overload of the asynchronous jobs service, which in turn affected dashboard web servers.

This disruption did not affect the VIP Go service.

Automattic/WordPress.com is auditing the responsible code and processes to ensure they do not cause any further outages.

We apologize for the disruption. Please contact VIP Support if you have any additional questions.

For real-time updates on service availability, please follow our status Twitter account at @WPVIPStatus.