Amazon’s most recent failures with their EC2 server farm highlight the importance for corporate America to have a solid disaster recovery plan in place that can withstand natural disasters and other emergencies, cloud or no cloud. A national network restoral plan is currently in place for specialized services to ensure critical operations are restored like ambulance, police, telecommunications companies, critical financial services, etc. But tech giants like Amazon do not qualify for priority restoration and continue to struggle in developing a solid redundant backbone to ensure 99.9999% uptime when force majeure happens.
As we know millions of customers worldwide were left without their Instagram, Netflix, Pinterest, and Twitter this weekend all trying to understand how this could happen to a company the size of Amazon. After spending over 15 years on the carrier side of telecommunications and negotiating vendor contracts within the telecom industry as well as other industries, here are a few insights on how major networking actually works.
First of all, force majeure is written into all carrier contracts voiding all uptime SLAs (service level agreements) in the event of storms, floods, earthquakes, etc. Since 9/11, carriers also include a clause voiding the uptime guarantee in the event of a terrorist attack. Most server farms, and other vendors that offer a myriad of hosted solutions, also include these clauses in their contracts.
Secondly, redundancy is generally built into the business plan and typically includes anticipated time frames for service restoration, partial to complete. Some companies guarantee a 72-hour turn around for partial restoration while others have a more extended time frame. During this time frame the vendors are re-routing traffic to their backup location(s) while assessing damage at the primary location, if possible. Chances are companies that do not qualify for priority restoration will develop their plans around the emergency services to maximize service restoration time frames.
Next, companies have the data saved in back-up locations and they also work with their customers to test all cut-overs and roll-backs to ensure things go smoothly and minimize data loss in the event of an emergency.
Once the impact hits the hosted solutions, providers start turning up additional temporary circuits and re-routing traffic to the back-up locations. If you start seeing service restoration within 72 hours, the website you’re trying to access has purchased the maximized uptime guarantee and partial restoration is proceeding.
Now, why isn’t Instagram on Facebook’s network? Simple. Depending on how the buy-out agreement is structured, Instagram may remain contractually bound to Amazon’s service.
If Facebook is not obligated to Instagram’s contract, Facebook is performing due diligence on their own capacity and redundancy abilities to ensure they have the bandwidth (technical, hardware, etc.) to meet Instagram’s demand. Their next steps include hardware purchases, resource sizing, and business analysis on how to move the data. They are also looking at numerous test plans from network stress to turn-up. Considering the size of Amazon, Instagram, and Facebook this would be considered a nine-month project even though a well-developed scope and dedicated resources can cut this time frame down by at least three months.
All in all, if you’re a Tweep, Pinner, Flixer, Gramite, or whatever you go by and you can connect to your favorite sites via pc, iPad, or mobile ap (which has probably been updated) the site obviously falls into the 72 hour restoration. Anticipate lag times, periodic perceived outages, etc. and know that Amazon has a plan and is diligently fixing the problem.
As for the capacity issues Amazon seems to encounter quite frequently – well, they need to kick some overpriced engineers in the rear for their failure to anticipate demand and plan accordingly.