A typo in a code upgrade for a scale unit of Microsoft Azure DevOps recently caused the deletion of 17 production databases. The lengthy recovery period subsequently led to 10 hours of downtime in the Azure region of southern Brazil.
According to a statement from Microsoft, administrators of Azure DevOps, a suite of application lifecycle services, regularly take snapshots of production databases to investigate reported customer issues or test improvements. In doing so, a system in the background creates these snapshots daily and deletes them after a certain amount of time.
Typo in pull request
During a recent code upgrade that replaced outdated Microsoft.Azure.Management packages with supported Azure.ResourceManager-NuGet packages, a typo snuck into the large pull request, the tech giant says.
This typo, not immediately noticed, caused the removal task to remove the full scale unit Azure SQL server for the South Brazil region with 17 databases. As a result, customers could no longer use the server in question.
10 hours of downtime
The recovery process took as much as 10 hours. Primarily because customers cannot restore Azure SQL servers themselves but must leave it to Microsoft’s Azure engineers, this took quite some time, according to the tech giant.
In addition, the deleted databases had different backup configurations. Getting these backups aligned took even more recovery time.
In addition, after all the databases were back online, the full scale unit remained inaccessible to customers. This was due to complex problems with the tech giant’s Web servers. As customer traffic came online, these servers received an overload of traffic, causing them to go offline still.
Microsoft has apologized to its customers and implemented several fixes to prevent a recurrence.