Amazon's $150 Million Server Outage Caused By IT Worker Typo

Amazon says a typo caused its cloud-computing service to fail earlier this week.

On Tuesday, part of Amazon Web Services stopped working. The company's so-called simple storage service, or S3, provides features ranging from file sharing to web feeds.

In an online statement, Amazon described the circumstances of the disruptive typo this way:

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
The company did not elaborate on what, exactly, the "authorized S3 team member" mistyped, but did say that it took about three hours to get some of the system back up, and more than four hours before the S3 system was back to normal.

The Wall Street Journal reported that the outage "cost companies in the S&P 500 index $150 million, according to Cyence Inc., a startup that specializes in estimating cyber-risks. Apica Inc., a website-monitoring company, said 54 of the internet's top 100 retailers saw website performance slow by 20% or more."

"People reported outages and delays on services like Slack, Trello, Sprinklr, Venmo and even Down Detector, which is the site that shows where real time outages are occurring," reported CNN Money.

The tech site Gizmodo, whose own website was disrupted, reported that the forum site Quora was disrupted.

Even Apple relies on the Amazon system for some of its own cloud services, and parts of its iCloud service were disrupted.

Amazon said it has changed its protocol for the routine, temporary removal of servers from its system so that server capacity is taken offline more slowly, among other safeguards.

"This will prevent an incorrect input from triggering a similar event in the future," the company wrote. (NPR)