2 min

Microsoft uses unsupervised learning to collect data about disruptions to cloud services. According to researchers, it is no longer necessary to annotate large amounts of training data.

Structured data is very valuable, especially in the field of cloud services. Not only can this data be used to build AI models aimed at assessing urgency (triage), but it is also useful for automating certain processes for engineers.

The SoftNER framework (short for Software artifact Named-Entity Recognition) collects knowledge by parsing unstructured data, detecting entities in outage descriptions and categorizing these entities. The framework identifies structural patterns in the descriptions and applies bootstrapping, as well as label propagation and a multitasking model to simplify data.

Data de-noising

SoftNER starts each run with data de-noising the collected data (descriptions, shell scripts, etc.) to get rid of useless data and tags (such as HTML tags). This description is then segmented into sentences, and these sentences are then converted into tokens.

After this, the AI performs entity tagging (problem types, locations, etc.) and data-type tagging (for IP addresses, URLs and more). These values are then applied to similar entities. If the IP address ‘127.0.0.1’ is tagged as a ‘source IP’, all cases of ‘127.0.0.1’ are also tagged as a ‘source IP’.

96 percent accurate

Researchers tested the AI framework on 41.000 disruptions at Microsoft over a period of two months. The average description had 472 words, and the framework managed to extract 77 valid entities out of 100 descriptions with an accuracy of more than 96 percent. According to the research team, the AI is accurate enough to function as a triage tool at Microsoft.

The team plans to use SoftNER in the future to evaluate bug reports and improve existing incident reporting and management tools. “Incident management is a key part of building and operating large-scale cloud services”, the researchers wrote in the paper. “We show that the extracted knowledge can be used for building significantly more accurate models for critical incident management tasks.”