Code Defect Crashed Azure Cloud Services
Apparently even the best programmers make mistakes. Microsoft this week attributed an April 1 Azure cloud services crash to a code defect.
The outage didn’t last long, as services such as Dynamics 365 and Xbox Live generally came back online within an hour of the start of problems at about 5:30 EDT. Still, the Redmond, Wash., software giant said that recovery time “exceeded our design goal.”
Those services and several others crashed because Azure DNS experienced a service availability issue, Microsoft said, leaving customers unable to resolve domain names for cloud services.
The root cause was detailed thusly:
Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.
Steps the company is taking to address the issue include:
- Repair the code defect so that all requests can be efficiently handled in cache.
- Improve the automatic detection and mitigation of anomalous traffic patterns.
A post about the incident on the Downdetector site garnered 1,396 comments from users.
The outage came just a couple weeks after another incident was reported on March 18, having to do with intermittent failures of Azure Key Vaults, one of three for that month. All Microsoft posts about Azure outages can be found here.