Breaking

Friday, September 7, 2018

Microsoft South Central U.S. data center outage takes down a number of cloud services

A cooling problem in Microsoft's South Central U.S. data center seems to be causing issues for a number of Microsoft Cloud services users in the U.S. and beyond.


A number of U.S.-based customers connecting to Microsoft cloud services, including Office 365, Azure Active Directory and Visual Studio Team Services, are reporting outages this morning, September 4. The issue seems to stem from problems in Microsoft's South Central U.S. datacenter.

The problems also are affecting users in other regions who are using services that authenticate with Azure Active Directory and who are attempting to use the Azure Resource Manager, I'm hearing from users on Twitter.

An Office 365 service health status page notes that some users have been unable to authenticate or connect to Office 365 services. The Office 365 admin center also has been down for these customers since at least 9:09 a.m. UTC (5:09 a.m. ET), according to that page.

"We've determined that a data center issue caused a subset of the Office 365 service to become degraded. We're connecting some of the affected services to an alternate infrastructure, while remediating the underlying issue within the data center," Microsoft's message says.

Office 365 doesn't run on Azure, but it does use Azure Active Directory authentication services.

An Azure Support message on Twitter acknowledged the problem in the South Central U.S. and said that "engineers were still working on the resolution of an issue affecting resources" there. Microsoft is "actively monitoring for impact to other regions also," the message said, directing customers to see status.azure.com for further updates. But as a number of Twitter users noted, they were unable to access that page.

Those who can see the Azure status page may see this message (as of 10:25 a.m. ET):

"Starting at 09:29 UTC on 04 Sep 2018, customers in South Central US may experience difficulties connecting to resources hosted in this region. Engineers have isolated an issue with cooling in one part of the data center, which caused a localized spike in temperature, as the preliminary root-cause, which has now been mitigated. Automated data center procedures to ensure data and hardware integrity went into effect when temperatures hit a specified threshold and critical hardware entered a structured power down process. Engineers are now in the process of restoring power to affected devices as part of the ongoing mitigation process.

"Some services may also be experiencing intermittent authentication issues due to downstream Azure Active Directory impact, and engineers are separately working on mitigation options for this also.

"The next update will be provided at 15:00 UTC or as events warrant."


I've heard from one contact that a lightning strike in the San Antonio, Texas, the area may have caused the cooling system to go down in Microsoft's data center there.

Update (September 4, 11 a.m. ET): The Office 365 service health status page was updated, noting that Exchange, Power BI, SharePoint, Teams and Intune users all may be affected. The page says "This issue could potentially affect any of your users who are hosted out of the San Antonio data center," but I'm hearing from users outside the U.S. who say they are also affected. Next update from Microsoft is scheduled for 4:30 p.m. UTC (12:30 p.m. ET).

Update (September 4, 1:45 p.m. ET): I'm hearing from a number of users worldwide that their Microsoft cloud services are coming back.

Here's an update about why some customers not using U.S. Central data center may have been affected, from Azure Identity division's Alex Simons (via Twitter):

"Many of you are asking -- here's a quick update. Azure AD Service has been experiencing load issues in North America with several high volume tenants doing massive auth retries. Availability in some NA (North American) tenants has dropped to 70%. Tenants in Europe and Asia are not affected.

"Mitigations are underway and capacity has now returned to >90% in North America and should be back to full availability very soon. We are deeply sorry for any negative impact to our customers are experiencing."

(Just to be clear, I've been hearing from Microsoft cloud customers outside the U.S. who have been affected, in spite of what Simons' tweet says.)

Update (September 4, 3:50 p.m. ET): I still am having problems seeing the Azure status page, but this is the latest from that page (forwarded to me by a Microsoft spokesperson), which does confirm the lightning strike and cooling-system impact -- plus the impact on services in other regions:

CUSTOMER IMPACT: Starting at 09:29 UTC on 04 Sep 2018, customers in South Central US may experience difficulties connecting to resources hosted in this region. Additionally, this issue impacts Azure Active Directory authentication, which may impact services in other regions.

PRELIMINARY ROOT CAUSE: A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.

ENGINEERING STATUS: Engineers have successfully restored power to the datacenter. Additionally, engineers have recovered a majority of the impacted network devices. While some services are starting to see signs of recovery, mitigation efforts are still ongoing.

NEXT UPDATE: The next update will be provided by 20:00 UTC or as events warrant.

Update (September 4, 4:50 pm. ET): The The Azure Support account on Twitter said, that as of 4:15 p.m. ET, engineers had restored power to the South Central U.S. datacenter. (Because of the cooling system problem, "automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a shutdown process," they said.)

Non-regional services like Azure Active Directory "encountered an operational threshold for processing requests through the South Central US datacenter. Initial attempts to fail over into other datacenters resulted in temporary traffic congestion for those reasons."

Recovery efforts were still ongoing as of this update from Azure Support.



No comments:

Post a Comment