Welcome to our ongoing series covering downtime blunders around the world (wide web). We first covered a series of downtime incidents in 2017 here. And now we're making it a regular occurrence here on the Downtime Prevention Blog at Blue Matador (monitoring AWS since 2016). Until we help rid the world of the evils of service interruptions, we'll report on some of the more notable or interesting incidents here.
World of Tanks - March 18, 2018
Free-to-play World of Tanks, the multiplayer armored-combat game available on PS4, Xbox 360, Xbox One, and Windows 10, experienced its first major overhaul on March 21st — the first in the 8 years of the game's life. But the update didn't come without a major downtime incident. In preparation for the update, for a full 24 hours the game was inaccessible to players around the world. Having been developed by Cyprus-based Wargaming Group, the game truly is an international phenomenon, meaning players were affected around the globe. (If you're like us and had to look Cyprus up on a map, know that it's a large, independent country-island in the Mediterranean Sea.)
The update leads us to the question, what is the definition of "downtime?" If an outage is planned, is it still considered "downtime?" Many DevOps engineers would say no, but we think that any time a customer can't access your product or service when they want to is definitely an outage to them.
Yet it's in this way that ops teams have traditionally skirted around meeting their SLA obligations. If downtime can be "planned," it can go unreported on status boards, so everything still looks good. At least on the website. (Here's a hint: Customers don't like downtime, planned or unplanned.)
Skype – March 13, 2018
Telephone communications software company Skype, which was acquired by Microsoft in 2011 for no less than $8.5 billion, went down for more than an hour early in the morning of March 13 (PST). The service, which commands some 40% of the international telephone call market, was completely unavailable to make said calls to mobiles or landlines for the entirety of the incident.
Skype's official description of the incident reads:
Service outage of calls to mobiles and landlines affects most Skype users and causes serious problems when calling mobiles and landlines from Skype. Calls might not connect at all and might drop unexpectedly.
Customers started going to Skype's support forum (among other places) to complain.
It's worth noting that when a product experiences a service outage, it doesn't mean the entire product becomes inaccessible. For example, instant messaging and SMS relay worked just fine during Skype's downtime. But it's also worth noting that when the downtime affects your product's core competency (in this case, international calling), customers can view the whole product as being down, making the incident larger in customer's eyes than your own, perhaps.
Brazilian Bitcoin Exchange Foxbit — March 10, 2018
For 72 hours, Brazil's largest cryptocurrency exchange, Foxbit, suffered a major outage caused by a software error in its withdrawal system (via CCN, not to be confused with CNN). The bug, which was only detected after the fact, allowed customers to withdraw from their bitcoin balances twice but only have one withdrawal counted.
Some USD $270,000 worth of bitcoin (about 30 BTC total) slipped through the withdrawal bug, causing the exchange to shutter its servers in an emergency maintenance mode while Foxbit worked to fix the problem. In the process, many customers' data was corrupted, and the company has been working to restore lost funds and the mismatched data.
The outage and the software flaw highlight security concerns surrounding Bitcoin as a currency, which has come under fire internationally by lawmakers and private entities alike. It also highlights an important problem with downtime. Even when self-induced for valid reasons, it can cause negative repercussions, like corrupting your users' data. Good thing Foxbit says it has more than 7,500 BTC that wasn't affected in the incident.
Downtime doesn't have to happen — even for planned outages. Ready for your organization to receive predictive recommendations that prevent service outages from starting in the first place?