Another Azure outage, and why regional failover isn't straight forward
[This is a repost from my personal blog.]
It's been a rough month for my sites in the East US Azure region. On March 16, a network issue made it all fall down for about four hours. Today, on April 9, just a few weeks later, I've endured what might be the longest down time I've ever had in the 18 years I've had sites online, including the time an ISP had to move my aging server and fight a fire in the data center. It will probably be awhile until we see a root cause analysis (RCA), but the initial notes referred to a memory issue with storage. The sites were down for around 7.5 hours this time, and the rolling up time over the last 30 days is now down to 98.5%. That's not very good. Previous outages include the four hours on 3/16/16, two hours on 3/17/15, and two hours for the epic, multi-region failure on 11/20/14. Fortunately, none of these involved data loss, which is the thing that cloud services should achieve the most. I moved in to Azure about two years ago.
Here's the thing, I know firmly that CoasterBuzz and PointBuzz don't support life or launch rockets. The ad revenue lost is really not that much, which you could probably guess considering how much I complain about it. Still, the sites are an important time waster for a lot of people, and I've spent a lot of years trying to get some solid Google juice. When the sites are down, it harms the reputation of them for users and for search engine bots that are trying to figure out how important the sites are. There is a cost, even if it isn't financial.
My costs are lower, while my flexibility and toolbox are better since moving to Azure. No question about it. The hassle free and inexpensive nature of SQL databases in particular are huge, especially the backups and ability to roll back to previous points in time through the log. That said, the down time for all but the broken 2014 incident were regional issues, and the only way to get around that is to have everything duplicated and on standby in another region.
If the only issue was the apps themselves, this would be super easy to handle with Azure Traffic Manager. Sites go down, boom, they route to a different region. Where things get less obvious is when you have a database failure. Today's failure appears to have been caused by a failure of the underlying storage for the databases, so the apps returned 500 errors. In this case, ATM would presumably reroute traffic to my stand by region, where I would have the sites ready to go and pointing to the failover database, also in the other region.
In today's case, I'm not sure if that would have worked. The documentation says that the database failover won't happen until the primary database is listed as being "degraded," but for the entire 7.5 hours today, it was listed in the portal as being "online." It most certainly was not. The secondary database won't come online until the other fails. I assume I could manually force it, but I'm not sure. I'm also not sure what happens when the original comes back online in terms of synchronization, and designating it back as the primary. And what if the apps went down but the databases were fine? Traffic would roll to the other region, but wouldn't be able to connect to the local databases because they're not failed over (and no, I don't want to connect to a database across the country).
So really, there are two issues here. The first is the cost, which even for my little enterprise would add up a bit over the course of a year. The secondary databases in another region would add around $25 per month. Backup sites would cost another $75 a month. ATM cost would be negligible. An extra hundred bucks seems like an awful lot for what I'm trying to do. I did see a good hack suggestion that says you can put the backup sites in free mode, and manually scale up if you need them, then point ATM at them.
The second problem is that the automation is far from perfect. In the sites down, databases up scenario, it would fail. Today, if the databases were "online" but really not, it would fail. I wouldn't feel comfortable getting on a cruise knowing that while I'm at sea there could be a problem.
This is mostly academic, and I realize that. If I have to deal with a few hours now and then with the sites down, so be it. Like I said, they're unimportant time wasters. It's just that 98.5% uptime in the last 30 days sucks. I know they can do better.