Azure experienced an issue in the Azure Active Directory, which disrupted access to Office 365 apps and Azure Admin Portal for two hours or more for other users. Microsoft apologized for it on Monday, as part of a March 15 “preliminary root cause analysis’ notice, which was an explanation of what happened.
The explanation was about how an internal “cross-cloud migration” exercise which aimed to improve the Azure AD service, became the origin of service disruption for some enterprises.
Service disruptions happened on March 15 for users of Azure Admin Portal, Exchange, Teams, SharePoint, Storage, KeyVault, and other major apps.
All the services were fixed and usable after a while, except for “Intune and the Microsoft Managed Desktop” according to Microsoft 365 on its Status Twitter feed.
The incident was caused by a decision on Microsoft’s part to retain a key from expiring. The reason for this was to facilitate the Azure AD migration. The automated process Microsoft used, ignored the key’s ‘retain’ status, leading tokens signed by the key to be seen as ‘untrustworthy.’
That is how the disruptions happened, which led to Microsoft having to roll back to the last saved point to address the problem.
In a notice, Microsoft says that it is currently undertaking a two-stage process to improve Azure AD, without disrupting anything again. The process will now have a backend Safe Deployment Process (SDP) system that will prevent several risks, including the recent key problem.
Microsoft 365 users have been through outages because of the Azure AD service. A few years ago, configuration changes caused a 2.5-hour downtime. In 2018, an Azure AD hub in Texas was struck by lightning, causing outages that lasted more than a day.
Even Tony Redmond, a Microsoft Most Valuable Professional, described Azure AD as the ‘Achilles Heel’ of Microsoft 365.