Partial service degradation of London load balancer

Traffic routed to our London (LHR) edge data center may at times be experiencing elevated levels of 503 errors. Requests routed to other data centers are not affected. We are actively investigating this incident and are working on a resolution.

Beginning Tuesday, June 22, at 16:40 UTC, our team performed a load-balancer cycling procedure to mitigate performance regression affecting application stability. This was in response to instability affecting applications with traffic going through our London data center over preceding days.

If you have any questions relating to this incident, please open a support ticket, and we will be happy to assist.

Incident Report: December 17 Performance Issue

On Thursday, December 17 at 17:35 UTC (December 17 at 12:35 p.m. ET), the VIP Platform experienced degraded performance that lasted approximately 61 minutes.

VaultPress is a service VIP uses to facilitate backup access. The VaultPress service experienced degraded performance that slowed down requests made to it from VIP sites, in some cases leading to timeouts or 503 errors. Intermittent issues on some sites occurred between 17:35 UTC and 18:36 UTC. The incident included a VaultPress outage that occurred between 18:04 -‌ 18:06 UTC. 

We take incidents like this very seriously and would like to outline what happened, as well as the steps we have taken to help prevent future occurrences.

What Happened

The VaultPress plugin communicates with Automattic servers through a PHP shutdown hook. When VaultPress service degraded, this led to thread shutdown taking as much as an additional 30 seconds due to three sequential API requests with a 10-second timeout. This affected responses to requests that were uncached, such as logged in users or pages not in or expiring from cache. Most users initially were served by Varnish cache from the load balancers. For sites with significant traffic, this uncached traffic tied up enough concurrent PHP threads to cause load balancers to issue 503 responses. That made the affected sites partially unavailable for a period of time that varies by site.

Impact

During the performance issue, up to 13% of VIP sites may have been affected, and of those sites, up to 26% of requests returned a 501-599 error (most often, a 503 error). Across all sites, the rate of these errors was 0.16%, during the incident. This means that VIP site visitors may have encountered one of these error types when attempting to access the front or back-end of their VIP applications. This situation was intermittent and was partially mitigated by our caching. It did not affect all sites. Sites were only affected if VaultPress was installed, the shutdown hook fired, and when traffic was sufficiently high. 

Not Impacted

Sites not attempting connections to VaultPress during the outage would not have been affected. Additionally, the incident did not affect site backups, which are stored in a redundant manner on the VIP platform.

Timeline

  • On December 17th at 17:35 UTC Vaultpress began experiencing issues.
  • At 17:56 UTC, we received the first reports of issues.
  • At 18:13 UTC, we identified that the issue was related to VaultPress and the shutdown hook and started preparing mitigation code to mitigate the problem.
  • At 18:24 UTC, we deployed code to mitigate the problem.
  • At 18:56 UTC, the VaultPress outage was resolved, but monitoring continued.
  • At 20:23 UTC, additional rate limiting was implemented as a safeguard.

Future Prevention

We have already implemented additional safeguards and process improvements designed to prevent similar issues from happening again. These include (but are not limited to):

  • Architecture changes in the VaultPress service that prevent the type of outage that occurred.
  • Improving the rate limiting in VIP sites, so that even if an outage were to occur, VIP sites would remain unaffected.
  • Improving defensive timeouts and “circuit breaker” functionality to prevent issues with external services from affecting sites.

Questions?

If you have any questions related to this incident, please open a support ticket and we will be happy to assist.

This incident was unrelated to a recent incident related to certificates.

Incident Report: December 17 00:00 UTC Service Interruption

On Thursday, December 17 at 00:00 UTC (December 16 at 7p.m. ET), the VIP Platform experienced a widespread service interruption that lasted approximately 13 minutes. Some services, like uploading images to the file server, were impacted for up to 33 minutes.

We take incidents like this very seriously and would like to outline what happened, as well as the steps we have taken to help prevent future occurrences.

What Happened

Connections to the VIP origin-servers and VIP Files Service were disrupted because of an expired certificate. This happened because, although a new certificate had been issued and deployed, it did not get updated across the entire VIP network. While our existing monitoring caught the expiry, it did not catch that a subset of the network had not been updated.

Impact

During the service interruption, 93% of all requests were successful, while 7% returned a 503 error. This means that VIP site visitors may have encountered a 503 error when attempting to access the front or back-end of their VIP applications.

Timeline

  • On December 17th at 00:00 UTC the old certificate still in use on the hosts expired.
  • At 00:11 UTC we received the first reports of interruptions to uploading files for back-end users.
  • At 00:20 UTC we received the first report of uncached requests not loading affecting external website visitors.
  • At 00:33 UTC the issue was fully resolved with the correct certificate loading across the VIP Platform.

Future Prevention

We have already implemented additional safeguards and process improvements designed to prevent similar issues from happening again. These include (but are not limited to):

  • Extending alerts for expiring certificates to all VIP hosts using a TLS certificate, in order to ensure that certificate updates are propagated to the whole platform.
  • Improving the existing monitoring for TLS certificates with fine-tuned checks that initially trigger 30 days before a certificate expires and repeat with an increased frequency when the expiration date approaches.

Questions?

If you have any questions related to this incident, please open a support ticket and we will be happy to assist.

Resolved: Intermittent Issues on the VIP Platform

The VIP Platform is currently experiencing intermittent issues. Customers may see slow requests or timeouts. Our team is actively investigating and will provide updates in this post.


Update 19:02 UTC: Our team is working on resolving the issue. This has largely been resolved, and our team is continuing to monitor the situation.


Update 19:36 UTC: We’re seeing stability across the platform. Our team is continuing to monitor the situation. We can also confirm that this intermittent issue is unrelated to a recent certificate issue.


Update 20:22 UTC: The issue reported earlier is now resolved. Our VaultPress service experienced degraded performance that slowed down requests made to it from VIP sites, in some cases leading to timeouts or 503 errors. Intermittent issues on some sites occurred between 17:35 UTC and 18:36 UTC. We’ve implemented changes to the way VaultPress operates to prevent this issue from recurring. We’re also looking at how we can add protections at the VIP level.

Resolved: Service Interruption on the VIP Platform

The VIP platform experienced a brief service interruption that started at 00:11 00:00 UTC with impacts to uploading files, then uncached requests were impacted starting at 00:20 UTC. Our team is actively investigating and we will provide an update on the cause both here and on our Twitter account (@wpvipstatus) as soon as possible.


Update 00:47 UTC: Our team was able to resolve the issue as of 00:33. We are still investigating the root cause of our service disruption and will continue to provide updates here as they are available.


Update 01:16 UTC: The service interruption was the result of an expired certificate left in place on a subset of servers that was not caught by our existing monitoring.  Service was restored when the updated certificate was installed. We are adding additional monitoring to prevent this from reoccurring in the future. Please see our incident report for more details.

Notice: New Relic Temporarily Disabled

This notice relates to the following platforms: VIP Go

UPDATE (Friday, November 6 @ 22:00 UTC):

New Relic continues to be disabled across the VIP Go Platform.

Our team is still working with New Relic engineers to resolve the performance issues, but the root cause has not yet been determined.

We’ll publish a new post on Monday with more details about about our investigation so far as well as options for customers that would like to restore their access in a limited manner.


UPDATE (Monday, November 2 @ 16:00 UTC):

New Relic is still disabled across the VIP Go platform and has been enabled on a select number of sites.

Working with the New Relic team, we have upgraded the PHP agent to a newer release. While there were some minor improvements noticed, the overall performance costs were still too high (compared to the expected baselines). Our investigation with their team continues.


UPDATE (Tuesday, October 27 @ 16:00 UTC):

New Relic continues to be disabled across the VIP Go Platform.

One clarification: our initial post mistakenly mentioned that this was a “maintenance” change. This change was an emergency response made after discovering that the New Relic integration was resulting in unreasonably high resource usage across the platform, resulting in degraded application performance (notably higher response times and increased error rates such as 5xx errors) and an increased risk of instability for sites. We know how important application monitoring is to our customers so this decision was not made lightly.

We’ve started initial testing to re-enable New Relic in a limited manner for a select number of sites until the root cause is identified and fixed (and proper New Relic coverage can be restored). Those investigations are ongoing with the New Relic team.


UPDATE (Friday, October 23 @ 22:00 UTC):

New Relic is still disabled and our investigation into the performance issues is ongoing. As mentioned yesterday, we have other streams of work in progress to explore alternate ways of surfacing performance and monitoring information (on a limited scale); we hope to have more information on this early next week. Note that we’re going to pause updates over the weekend but will have another update on Monday.


UPDATE (Thursday, October 22 @ 22:00 UTC):

New Relic continues to be disabled across the VIP Go Platform. Our investigation is ongoing and further changes have been made to narrow down the cause. We’re also continuing with work that will allow us to re-enable New Relic in a limited manner until the root cause as fixed. And finally, we’re looking at alternate ways to surface instrumentation, logging, and performance data that we know is critical to our customers.


UPDATE (Wednesday, October 21 @ 22:00 UTC):

New Relic is still disabled across the VIP Go Platform. We’re continuing our investigation alongside the New Relic team and have attempted some changes to rule out possible causes. We have a few other leads to follow and we’ll continue to update here as we learn more.


UPDATE (Tuesday, October 20 @ 22:00 UTC):

New Relic continues to be disabled across the VIP Go Platform. It was originally disabled because it was observed to have an adverse impact on site performance (much slower response times, higher error rate, etc). We are working with New Relic support to attempt to identify the root cause of and mitigate the performance impact. Our team is also working on a plan which would allow us to re-enable New Relic in a limited manner until the root cause is fixed.


Beginning at 17:00 UTC today, Monday, October 19, the New Relic application monitoring service will be disabled on VIP Go for some maintenance work. Disabling this service will not impact the performance of VIP Go applications.

For debugging slow queries in the interim, we recommend enabling Query Monitor.

We will update the VIP lobby when the New Relic service is re-enabled.

If you have questions about your application’s performance, or about New Relic on the VIP Go platform, please open a support ticket.

Resolved: Network Availability Issues at Some Edge Locations

We recently experienced an issue with an upstream network provider that led to a service degradation at some edge locations on our platform, and requests to those locations may have resulted in slow load times or errors.

Sorry for the trouble! Service has been restored, and we will add more details to this post soon.

If you have any questions or are currently experiencing availability issues, please email vip-support@wordpress.com.

[Resolved] GitHub Performance Issues Affecting WordPress VIP

Resolved: GitHub is reporting resolution of performance issues, and all services operating normally. All VIP services should be operating as expected at this time. If you experience any further issues, please reach out to us directly at vip-support@wordpress.com.

Update: GitHub has deployed a fix and is monitoring recovery. We are continuing to monitor the situation. (17:32 UTC)

We are aware of ongoing performance issues on GitHub which are affecting some VIP sites. The issues may affect code deploys. We are monitoring the situation, and will follow up with another alert once this is resolved.

We will continue to update this post and tweet out status updates from @wpvipstatus until the issue is resolved. You can also subscribe for updates directly from GitHub regarding this incident here:
https://www.githubstatus.com/incidents/phnch1rww464

If you have any questions, or are experiencing any issues, please email vip-support@wordpress.com.

WordPress VIP COVID-19 Readiness

In light of recent COVID-19 events, we wanted to take a moment to let you know about WordPress VIP’s commitment to business continuity during this time. We place the reliability of our platform at the heart of everything we do and we take great pride in the trust you place in us each and every day.

Global Customer Support

WordPress VIP is a fully distributed business, with all of our employees accustomed to working outside a traditional office setting. Our team spans nearly 20 countries, and together with our parent company, Automattic, we have led the way on distributed work for more than 15 years. We are committed to providing our normal operations throughout this time.

Platform Availability

The WordPress VIP platform resides on our own infrastructure with built-in redundancy across more than 25 data centers worldwide. As a distributed team, our systems and automations are designed to be managed remotely by our team. We will continue to work closely with our business partners and technology service providers to ensure the same level of availability, service, and performance that our enterprise customers have grown accustomed to.

Health and Community

We prioritize the health and safety of our team, customer, partners, and local communities. For that reason, we have cancelled all work-related travel for our employees in the near future, and continue to monitor and adhere to advisories from governments and health organizations.

We deeply value our partnership with you and remain committed to providing excellent service throughout this challenging time. If you have any questions or concerns, please do not hesitate to reach out to our team.

Sincerely,
Steph Yiu
Head of Support

Notice: Let’s Encrypt Cert Reissuance

(02:23 UTC) Update: We have confirmed that any and all impacted domains have had a certificate successfully re-issued.

(00:17 UTC) Update: It was incorrectly reported that this action was complete. At this time the re-issuing of certs is ongoing.

We’ve completed the reissuance of Let’s Encrypt certs affected by the Let’s Encrypt announcement on March 3rd, 2020.

Due to the 2020.02.29 CAA Rechecking Bug 2.8k, we unfortunately need to revoke many Let’s Encrypt TLS/SSL certificates. We’re e-mailing affected subscribers for whom we have contact information.

No action is needed on the part of VIP clients using Let’s Encrypt certificates. The VIP Team was notified by Let’s Encrypt, and began reissuance of the affected certificates. At this time, the reissuance has been completed.

If you have any questions, please open a support ticket and we’ll be happy to assist.