If you had issues loading up your favourite sites at the end of February, fret not, you weren’t alone. Numerous websites such as Giphy, Quora, Slack, Imgur and others were inaccessible for a period of time on the 28th of February, due to many Amazon servers going down, causing numerous websites, apps and other services that used the cloud-based system to slow to a crawl and, in some cases, stop working.
Amazon acknowledged the outage and closely monitored its systems, narrowing down the outage to a North Virginia location, which was the source of the S3 web services errors. The issue was remedied in a couple of hours but it’s only now that the company released the reason why the popular service went down. The reason? A Typo.
On the date of the outage, the Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37 AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
The servers that were removed unfortunately supported two other S3 subsystems, resulting in the initial mistake spiralling out of control. As a result, all the connected systems had to undergo a full restart. While this may simple, one would be surprised to find out that many Amazon servers haven’t been restarted in years, which resulted in the prolonged down time for a bit of the internet.
“We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”
Amazon apologised at the end of their lengthy post to customers, their applications, end users and their businesses. Considering how many services utilise Amazon servers, such as Netflix, it’s a good thing that Amazon is placing safety checks on its systems to prevent further outages.
