6 min Security

CrowdStrike reveals cause of global Windows blue screen problems

A regular bug with great consequences

CrowdStrike reveals cause of global Windows blue screen problems

CrowdStrike shares more details about the update bug that led to global IT chaos. A full-fledged Root Cause Analysis will follow, but a preliminary and already quite comprehensive Post Incident Review (PIR) was released today.

A regular update to a configuration file for the CrowdStrike Falcon sensor led to 8.5 million crashed Windows systems. Previously, we’ve shared in detail the impact and initial findings on this incident. Now CrowdStrike is letting us know exactly how the fork happened and how it will improve its own processes.

Also read: Crowdstrike warns against fake recovery manual

Update to Rapid Response Content

CrowdStrike cannot simply update its Falcon sensor. This is because it runs at the kernel level, a privileged state that requires Windows Hardware Quality Labs (WHQL) validation. WHQL is a testing process by Microsoft to validate these impactful drivers. The CrowdStrike sensor itself is updated via what the company calls “Sensor Content.” This allows it to modify on-sensor AI and ML models and build out other features for the longer term. CrowdStrike emphasizes that Quality Assurance (QA) for these types of updates is significant. In addition, responsibility has always been shared with Microsoft. To top it off, Sensor Content appears via a slow rollout, so any issues that managed to circumvent QA should never directly affect all users. In effect, CrowdStrike has a digital stop button available to halt the rollout if worrisome reports come in.

Friday’s IT incident came from a different kind of update, however, emenating from a new Rapid Response Content release. These updates appear much more frequently than Sensor Content, several times a month in fact, and provide so-called “Channel Files” with updates. In this way, the sensor detects newer specific forms of malicious behaviour. These updates allow CrowdStrike to respond to the most current threats. Ideally, the end user doesn’t notice any of this.

Data from the cloud (via the Content Configuration System) ends up in the Channel Files. These files store and validate Template Instances. Again, CrowdStrike emphasizes that any errors are handled “gracefully” thanks to safeguards in the Content Interpreter. So far, it seems CrowdStrike had everything well plugged in to prevent large-scale problems.

Best practices deployed, but…

In the run-up to the global IT failure, then, things went smoothly for a long time. The Falcon sensor has been around since 2013 and updates have also appeared smoothly in recent months, CrowdStrike claims. There is evidence that this is not true, as it happens: several Linux distributions also reportedly faced a similar incident in April. Regardless: in February, March and April, updates to various CrowdStrike components occurred without widespread problems and with all the required green checkmarks from the control mechanisms already in place.

An emerging attack technique targeting internal operating system communications forced CrowdStrike to launch a new “Template Type.” Two new Template Instances based on this type appeared on July 19, both of which were validated. Previously, this new Template Type had been successfully rolled out. However: due to a bug in CrowdStrike’s Content Validator, one of the updates on July 19 mistakenly passed this digital checkup without ending up being rejected.

The successful debut of the new Template Type, confidence in the Content Validator’s output and previous successful Instance deployments led to the final, fatal, deployment. CrowdStrike confirms the suggestion by many that the Windows failure was due to the Falcon sensor requesting a memory address that did not exist. Referencing a “null pointer” is a classic, common programming error, which may occur in C++ as it did here or in most other reasonably system-level languages.

Prevention is better than cure

A subtle error with gigantic consequences, then, causing CrowdStrike to make major changes to its own policy. Rapid Response Content testing is getting a revamp and getting six new checks, mostly familiar best practices in the developer community such as rollback testing, fuzzing and stability testing. The Content Validator bug is said to have been fixed by now. In addition, the Content Interpreter, which reads information from Channel Files, should be better at catching erroneous information from now on. By implementing these changes, CrowdStrike is trying to ensure this never happens again.

Should all other safeguards fail, an outage of this size wouldn’t reappear if CrowdStrike is to be believed. This is due to the fact Rapid Response Content will roll out gradually from now on. “Canary deployment” is set to give some users first access to the latest updates, which (as is now evident) may contain errors. Users will additionally have more control over when updates occur. From now on, it will be up to IT teams to determine how eagerly they want to take advantage of up-to-date protection. Presumably this will be a big point of discussion with their management teams, who, like the rest of the world, couldn’t help but notice Friday’s IT perils. Finally, release notes will clarify exactly what CrowdStrike is defending against with the new updates.

Speed over security

The extensive checks to the aforementioned Sensor Content are not only logical, but simply mandatory. Microsoft does not allow anyone to release kernel-level Windows drivers without a WHQL check. Yet CrowdStrike chose to push its other updates via Rapid Response Content faster than more stringent checks would allow. This usually went well, in part because there were already controls in place that usually worked properly.

That was until a single previously undiscovered bug in the Content Validator threw a spanner in the works. In fact, the only direct cause CrowdStrike cites is this slight software error that could have happened to anyone. The company stresses that other tools, such as the Content Interpreter, ought to have been mature enough to recognize a bad input. Some of those tools failed to pick up the slack, too. The implication is that this kind of incorrect input simply wasn’t covered.

The laundry list of newly introduced checks shows that Rapid Response Content simply had too minor a validation step. CrowdStrike has long done just fine with its old way of doing things, which may also have given the company a competitive advantage by shipping updates faster than others. That would add to its claim that Falcon detects threats six to eleven times faster than competitive solutions. Criticism of CrowdStrike’s update speed, over which users have no control, is now obvious. Now it is finally the case that users can in fact control when updates occur, so CrowdStrike’s need for speed no longer sets the agenda. In addition, more information is available about which threats the company better fends off through these updates. These things should have already been in place, but CrowdStrike seems to have figured that out by now, too.

Also read: Fake CrowdStrike fixes spread malware