Explanation – San Antonio Data Center Outage

As we mentioned earlier this week, WordPress.com experienced a partial outage and service degregation when one of our three data centers was taken completely offline by a fiber cut. I wanted to provide some more information about how this occurred, what the impact was, and what we are doing to prevent this from happening in the future.

BACKGROUND

WordPress.com is served in real-time out of 3 data centers.

Currently, we use DNS to distribute traffic between locations and have a few ways we manage these DNS entries during an outage. Most of WordPress.com uses wildcard DNS (*.wordpress.com), some other records have systems in place which automatically remove the DNS record of any location which is not responding, and other records are managed semi-automatically, meaning a human has to run the commands to remove the affected IP addresses. All of our DNS records have a low TTL to ensure that clients receive the most recent records.

Additionally, for domains hosted on WordPress.com, we return multiple IP addresses. This allows us to take advantage of DNS failover, sometimes called client retry. Modern browsers will automatically try a different IP address. Relying on browsers to behave in a certain way is not ideal, and thus we only rely on this to work in certain cases which a broken IP is not removed from DNS as well as during the time between when our systems remove the broken addressees and clients refresh their list.

TIMELINE

February 12th 2013, all times UTC –

0600 – Complete loss of connectivity to San Antonio datacenter.
0605 – San Antonio datacenter removed from production for 90% of traffic (everything but the semi automatically DNS records mentioned above).
0605 – 0700 – Troubleshooting network connectivity, trying to determine ETA when service will be restored.
0700 – No ETA for service restoration, decision made to remove San Antonio IP addressees from VIP DNS records.
0700 – 0742 – Testing and verification of DNS updates.
0745 – DNS updates for all VIPs complete.
0930 – Connectivity restored to San Antonio (one link back online).
1113 – Redundant connectivity restored.
1130 – DNS changes reverted, traffic back to San Antonio.

CAUSE

Our San Antonio facility is connected to the Internet via two fiber links which go between San Antonio and Dallas. From Dallas, traffic is sent to the rest of the world. To ensure that links were redundant, our data center provider contracted with two different companies and verified that the fiber ran along different physical routes. Unfortunately, in the past 24 months, both companies were acquired and without our knowledge, the lines were relocated into the same physical fiber bundle. We are still not 100% sure when this change was made, but we think it was in the past few months. A maintenance by the fiber owner on February 12th caused both circuits to be taken offline and a complete loss of network connectivity to the data center. Unfortunately, the maintenance notification never made it to our provider or to us, otherwise we would have taken proactive steps to remove the data center from production.

IMPACT

For subdomains like photos.digitalize.ca, which use a DNS CNAME record to point to WordPress.com, things returned to normal after 5-10 minutes. For top level domains, like raanan.com, our stats show about 5% of traffic was impacted between 0600 and 0745. No data was lost.

REMEDIATION

Our provider has been in the process of replacing the existing fiber links with new ones. The end result will be a redundant circuit ring through one provider and separate redundant circuit through another. We are taking the appropriate precautions to ensure that these new circuits will run on completely separate paths. We are also going to obtain confirmation from both fiber providers that there will not be any work on our circuits for the next 8 weeks. This is the ETA for when the new circuits will be fully operational.

We are exploring the possibility of bringing an additional network provider into our San Antonio facility. This would mean even if both redundant fiber connections to Dallas were impacted, the location would still be able to communicate with the Internet.

We have some tentative plans to switch away from DNS-level distribution between locations and instead use a network routing topology called anycast. This has the advantage of providing faster failover and less manual intervention will be required. Our current anycast implementation was tested during this outage and worked as expected. Until this is complete, we will be working on some things to make the current failover faster. In general DNS updates would be made much faster, but unfortunately, our scripts required some re-verification after our recent data center migration in December/January. We wanted to make sure we didn’t break anything more severely during the DNS update process.

For those not using our DNS servers, it impacts our ability to protect your sites from exposure to outages like this. We urge you to use our DNS servers if at all possible.

We apologize for the service disruption, and as always, use these opportunities to make WordPress.com VIP the best we can.

Barry Abrahamson
Systems and Infrastructure Engineering

One thought on “Explanation – San Antonio Data Center Outage

Comments are closed.