Have you ever experienced when you visit your favorite website and an error page welcomes you or a note saying that the site is down? Or have you ever experienced visiting a site only to leave it a few minutes later as you cannot do anything on it because everything is moving too damn slow or clicking buttons doesn’t result to anything?
There’s huge chanced you’ve encountered a site outage. Most of the time, if a site has an outage, they quickly inform their users by putting a note on their site saying that there’s a service outage and they’re doing everything they can to determine what caused the outage and to restore operations.
Sites often have a blog site or a developer site where they inform people as to what’s going on but users often times ignore those or don’t even know they exist. You often hear people cursing some sites saying that it’s often down, it’s useless, even swearing to never use the site again without even making an effort to find out why there was a site outage.
Online shopping site Etsy recently encountered two outages, one last July 30 and the second one last August 10. Though hacked sites are so popular these days, the Etsy outage wasn’t hack-related, they were actually related to improving their site and service.
The company’s IT team sat themselves down and did a postmortem on these events so that everyone can learn what happened, how it happens, and perhaps how it can be better avoided in the future. They published their findings in an article called “Demystifying Site Outages,” and I encourage you to read it if you work at all in any operations or administration role.
TL;DR: How much of the outages went down
Etsy recently expanded their reach by becoming available in more languages so they were slowly rolling out an update to their database server software. They were upgrading one server at a time as the goal was to make the whole upgrade unnoticed by users. Meanwhile, they noticed an issue occurring when their databases backup on a nightly basis. The issue was, when a server is finishing up its backup, it would stall or lock for a few seconds, sometimes as long as 30 seconds. This would usually happen at around 3 A.M. when site traffic isn’t that heavy but still, when you’re a consumer waiting for your order to be confirmed, a few seconds could be so unnerving.
So the Etsy team decided to address the issue as well since they’re already making site improvements. What they wanted to do, since they were already rolling out the other upgrade was to push this backing up issue upgrade first, but when their engineers did that, the second upgrade pushed the first upgrade to all the servers which caused slowing down of the site. SI instead of just leaving it at that, they decided to take it down so their consumers won’t encounter any problems and so that they could check that everything was in order.
As for the August 10 outage, they were creating more unique ID numbers. Their servers must be told as to what the range of these unique IDs will be so enough space and memory can be allotted to them. The problem occurred when some of the allotted spaces weren’t large enough for a specific range of unique IDs. So they had to again take down the site in order to fix the space allotment and make sure that everything was in order.
So before calling anything “stupid” or “useless” because the site you love is down, do a background check. Sometimes they’re just doing their best to make people happy and not piss them off. To learn more about the Etsy outage, click on Demystifying Site Outages.