Category Archives: Service & Systems Status

WordPress VIP Status Page

From June 1, 2022, WordPress VIP service disruptions and incidents will no longer be announced in the VIP Lobby. Please check our WordPress VIP Status Page for any known issues before opening an urgent ticket.

For the latest updates, we recommend subscribing to the status page (email or RSS feed) using the subscribe button at the top of the page:

Why are we making this change?

A standalone status page provides automated monitoring of our key services as well as testing sites in each data center, giving customers a single location for service monitoring, site disruptions, and incident reports. This change also means faster, more timely updates regarding data center or platform-wide issues. Going forward, the VIP Lobby will focus on announcing product enhancements, releases, and other important platform messages.

Russia-Ukraine Crisis and WPVIP Readiness

In light of the recent Russian invasion of Ukraine, we wanted to take a moment to let you know about WordPress VIP’s commitment to business continuity for our customers at this time.

Global Customer Support

WordPress VIP is a fully distributed business and our team spans more than 20 countries. We are committed to continuing to provide a high level of support operations even during geopolitical conflict.

Platform Security

WordPress VIP protects customers from attacks at all layers of our infrastructure, from our global CDN down to the code running on customer sites. Our systems continuously monitor for suspicious activity, enabling immediate automated or human response to threats. We maintain emergency and contingency plans, including redundant storage and procedures for recovering data in the event of a service interruption. More details on VIP’s security features can be found here.

User Security Advisory

We advise all customers to follow best practices when it comes to securing devices, accounts, and access to WordPress VIP tooling. WordPress VIP publishes a checklist of security recommendations that we encourage all users to adhere to.

Multi-factor authentication (MFA) is required for all VIP Cloud and WordPress Administrator accounts on VIP. However, we also highly recommend enforcing MFA for other WordPress roles, such as roles with post editing capabilities. To enable MFA on additional roles, please see our documentation.

If you are an administrator, we recommend auditing all privileged WordPress and VIP Cloud accounts to ensure the correct permissions are assigned and any unused accounts are removed.

Your VIP account team is here to support you. If you have any questions, please do not hesitate to reach out to your VIP Relationship Manager or open a ticket via our support portal.

Sincerely,
Steph Yiu
Chief Customer Officer, WordPress VIP

Incident Report: Feb 13 Service Disruption

Overview

Between 11:33 and 12:35 UTC on 13 February 2022, WordPress VIP experienced a partial service disruption due to a Distributed Denial of Service (DDoS) attack. As a result, affected sites saw an intermittent increase in latency, timeouts, and 503 errors.

Chronology of Events

Date	UTC Time	Update
13 Feb. 2022	11:33	DDoS detected against the VIP Platform.
	11:36	VIP Edge Caches report being unable to reach Origin Data Centers.
	11:39	DDoS target identified.
	12:05	Targeted traffic-blocking rules implemented.
	12:35	Issue mitigated. Latency and error rates return to normal.
	12:36	VIP Lobby post updated.

What Happened

A Distributed Denial of Service (DDoS) attack caused congestion on a subset of VIP’s Globally Distributed Edge Cache resulting in intermittent latency, timeouts, and 503 errors. Targeted blocks were implemented which mitigated the attack and returned latency and error rates to normal.

Further Infrastructure details can be found at https://wpvip.com/infrastructure/

Future Prevention

VIP’s proactive monitoring and automated DDoS mitigation systems have been updated to more easily identify DDoS attacks of this nature. Additionally, the processes and tools used to identify and mitigate attacks are being reviewed to add additional protection and reduce the time between when an attack is detected and it is mitigated.

RESOLVED: Site availability issues on the WordPress VIP Platform

The VIP Platform experienced a service interruption between 11:35 and 12:29 UTC which impacted some requests on WordPress VIP sites.

The issue has been resolved, and we’ll post more details as soon as they are available.

Apologies for the trouble! Please open a support ticket if you have any questions, and we’ll be happy to assist.

Incident Report: Jan 12 Service Disruption

Overview

Between 20:30 and 20:46 UTC on 12 January 2022, WordPress VIP experienced a partial service disruption due to a code change that impacted how HTTP requests are routed within the WordPress VIP infrastructure. As a result, the majority of uncached requests for affected sites were served 503 responses during this time.

Chronology of Events

Date	UTC Time	Update
12 Jan. 2022	20:15	Code change release causes an internal API to generate incorrect configuration data.
12 Jan. 2022	20:30	As part of normal operations, VIP routing configurations dynamically update using data from an internal API. The data is incorrect because of the previous update.
12 Jan. 2022	20:30:42	First failed request is recorded in the logs and internal alerts received.
12 Jan. 2022	20:32	VIP begins investigation.
12 Jan. 2022	20:40	VIP identifies the problem.
12 Jan. 2022	20:43	VIP reverts the offending code and reloads routing configurations.
12 Jan. 2022	20:46:08	The last failed request resulting from this issue is recorded in our logs. Incident is resolved.
12 Jan. 2022	20:50	VIP Lobby updated, post-outage process begins.

What Happened

A code release caused incorrect data to be materialized by an internal API. Our systems use this data to determine how HTTP requests are routed within the WordPress VIP Infrastructure. With incorrect data, our systems were incapable of forwarding traffic to the correct destination, and returned errors to uncached requests on affected sites resulting in HTTP 503 errors.

Remediation

The issue was addressed by reverting the code change that led to incorrect routing configurations and deploying the correct configurations.

Future Prevention

The process for code releases is being reviewed to add additional procedural safeguards. Automated checks are also being investigated to minimize the chance of a similar problem happening in the future.

Incident Report: September 29 Service Disruption to DCA Data Center

On Sept. 29, 2021 at 17:05 UTC (1:05 PM EDT) a VIP origin data center in Ashburn, Virginia (DCA) experienced a rapid increase in ambient temperature. The temperature increase triggered thermal throttling on a subset of VIP’s server infrastructure which caused some sites to experience errors and intermittent slowness.

As temperatures returned to normal levels, VIP discovered a resource limit triggered by the rapid reallocation of resources away from the impacted infrastructure. This issue prevented immediate recovery of the affected sites, but by 18:00 UTC (2:00 PM EDT) the issue was resolved and all affected sites were responding as expected.

Chronology of Events

Times are UTC.

17:17 VIP is notified by internal teams that alerts indicate that temperatures are higher than normal at DCA.
17:19 VIP notifies customers via the VIP Lobby and begins providing regular status updates.
17:20 Servers begin thermal throttling of CPU.
17:24 VIP customers with sites at DCA begin reporting site availability issues.
17:32 Cooling restored and data center temperatures begin to decrease.
17:34 Thermal throttling ends.
17:41 VIP monitors recovery and discovers a etcd resource limit issue preventing full recovery
17:59 VIP increases the limit, which resolves the issue.
18:00 All sites have recovered.

Business Impact

WordPress VIP customer sites with DCA as their origin experienced intermittent slowness and reduced availability for approximately 40 minutes between 17:20 and 18:00 UTC (1:20 – 2:00 PM EDT) on September 29, 2021.

Root Cause Analysis

Why did this happen?

VIP leases data center space in Ashburn from a well-known vendor. The vendor is responsible for providing the space, power, and cooling. Their investigation into the root cause of the temperature issue is ongoing, so we don’t have a RCA at this time.

The problem that prevented immediate recovery once the thermal event had ended was caused by etcd exceeding the maximum configured database size. Etcd is a distributed key value store used to track the state of VIP hosted sites. During normal operation, the configured size was sufficient, but during the thermal event, thousands of sites were scheduled to move to unaffected servers simultaneously. This spike in activity caused etcd to exceed its configured limit.

Remediation

Immediate Fix

The data center temperature issue was addressed, ensuring that temperatures returned to normal.
The etcd quota was increased, which immediately allowed sites to resume starting.

Preventative Actions

Data Center

VIP is awaiting the official RCA from the data center, and looks forward to reviewing and discussing their plan to mitigate future risk.

Etcd Quota

The immediate action taken by VIP to increase the etcd quota should prevent a similar issue from occurring again. In addition, VIP is adding additional monitoring to ensure the configured limits can absorb large spikes in activity.

Incident Report: August 5 Service Disruption to BUR Datacenter

On August 5, 2021, from approximately 7:21 PM – 9:03 PM UTC, multiple sites hosted on our BUR datacenter experienced a service disruption resulting in an increase to status codes in the 500’s and slow response times. As of 9:03 PM UTC, our service was fully restored.

Chronology of Events

Times are in UTC.

7:21 PM – Our monitoring system discovers rising temperature levels at our Los Angeles Datacenter. Work to simultaneously alleviate and reroute traffic to decrease CPU load and discover the physical cause of the temperature increase begins.
7:26 PM – First alert received.
7:55 PM – Datacenter confirms with VIP that a power issue caused a temperature increase.
7:58 PM – VIP releases our initial Lobby Post concerning the incident.
8:02 PM – We begin to see some cooling recovery.
8:10 PM – Another rise in temperature in the datacenter is detected.
8:17 PM – VIP identifies a cooling issue within the affected datacenter and works toward resolution with on-premises staff. Work continues to offload CPU usage from the Datacenter while working towards a physical fix.
8:33 PM – VIP receives notice that cooling has returned and the datacenter temperature is decreasing.
8:42 PM – We start to see cooler temperatures within the datacenter. The resolution is still in progress and we continue to monitor.
8:47 PM – Datacenter transfers load back to utility power, temperatures continue to decrease.
9:03 PM – We now see full recovery of service disruption for affected applications hosted in the Los Angeles Datacenter.

Business Impact

This service disruption event caused elevated levels of status codes in the 500s as well as sporadic instances of increased loading times. This event affected applications served from our Los Angeles Datacenter.

The elevated levels of 5xx responses lasted approximately 70 minutes, starting around 19:20 UTC and returning to normal levels around 20:30 UTC, August 5, 2021.

Some customers also experienced longer response times during this event.

Root Cause Analysis

Why did this happen?

Servers in the BUR datacenter started throttling CPUs to lower frequencies due to an excessive increase in temperatures. This resulted in sites hosted in the BUR datacenter returning 5xx errors. The increase in temperatures at the datacenter were related to the datacenter suffering power loss, including loss of backup power. When the datacenter regained utility power, the temperatures dropped and resolved the outage.

The process of the BUR going slightly offline due to the server cooling issue is not outside of expected mitigation procedure, and VIP did not experience a cascading outage of sites outside of this datacenter were impacted.

VIP does not support an origin datacenter failover. There are multiple POP locations around the globe for edge traffic, but an issue at the origin datacenter cannot be fully mitigated when the core issue is outside of the control of VIP.

Corrective Actions

Immediate Fix

When temperature increases started, a line of communication with the physical datacenter location was opened as technications worked on the cooling issue.
VIP rerouted some traffic from the datacenter to decrease the CPU load and decrease the core temperatures.
The datacenter technicians were able to remedy the power issue that was causing the ineffective cooling.
Replicas for several big sites were in the progress of being spun up, however temperatures dropped before they were needed.

Preventative Actions

Communications & Process Improvements

VIP is updating the Outage Protocol to increase the speed of external alerts.
VIP is working with datacenter management to receive additional incident details which will then be applied to future risk mitigation.

Technology Improvements

We are investigating systems-related service failover to our alternate datacenters.

Incident Report: August 5 Service Disruption to BUR Datacenter

On August 5, 2021, from approximately 03:27 – 05:35 AM UTC, multiple sites hosted on our BUR datacenter experienced a service disruption resulting in an increase to status codes in the 500’s and slow response times. As of 05:44 AM UTC, our service was fully restored.

Chronology of Events

Times are in UTC.

03:27:13 – First alert received.
03:30:00 – VIP Team acknowledges alert and begins investigation.
03:31:00 – Service failure identified. Restoration of last backup required. Outage Protocol initiated.
04:13:00 – Twitter Outage Notification posted.
04:39:00 – Outage posted to WPVIP Lobby.
04:56:00 – Restoration process is complete. VIP Team begins testing services. Sites start recovering.
04:57:00 – Due to the amount of sites working to recover the restored cluster encounters an OOM error.
05:00:00 – VIP Team adds a Network Policy to limit traffic and allow the cluster to start up.
05:35:00 – Services restored with increased resources. Network Policy removed allowing web traffic to
05:44:00 – Service disruption resolved. WPVIP Lobby Updated. Twitter Updated.

Business Impact

This service disruption event caused users to experience an increase to status codes in the 500’s and slower response times for uncached data on multiple sites hosted in the Los Angeles Datacenter from Aug 5, 2021 03:27:13 UTC until 05:44:00 UTC.

Root Cause Analysis

Why did this happen?

The cluster powering Vitess (a database solution for deploying, scaling, and managing clusters of database instances) running in the BUR datacenter broke. The WPVIP operator, which relies on Vitess topologies for rendering wp-config.php, reconciled bad configs for sites and brought them down.

Remediation

The cluster was restored, which fully resolved the outage.

Preventative Actions

VIP fixed the operator code so that if Vitess cannot retrieve topology information from the cluster, the operator won’t reconcile sites again, sites will keep status quo. This was the expected behavior for Vitess to begin with.
VIP is updating the Outage Protocol to increase the speed of external alerts.

If there are any questions or concerns related to this incident, please reach out to your VIP Relationship Manager or open a ticket via vip-support@wordpress.com.

WordPress VIP Network Issue – July 29

We are aware of a networking issue with an upstream network provider in our DFW data center. Our teams are currently working on routing around this issue, and we’re already seeing some sites recover.

Sorry for the trouble! We are working on the issue, and will follow up with another update as we investigate further.

We will continue to update this post and tweet out status updates from @wpvipstatus until the issue is resolved.

If you have any questions, please open a support ticket and we will be happy to assist.

13:16 UTC: Service is back to normal: The majority of sites have now recovered.

13:31 UTC: Service is normal: We’ve addressed the issue with an upstream network provider, and all sites have now recovered.

Resolved: Brief, Localized Service Degradation

A network problem in our Los Angeles datacenter affected a subset of infrastructure between 19:52 and 20:00 UTC. For sites with their origin in Los Angeles and with containers in the affected area of the data center, elevated rates of 503 errors and timeouts for uncached content would have been noticeable. At this time the cause has been identified and service levels are back to normal.

Questions?

If you have any questions related to this incident, please open a support ticket and we will be happy to assist.