Background and Impact
On the afternoon of July 19, 2024, starting around 5:30 UTC, Windows users worldwide began experiencing Blue Screen of Death (BSOD) errors.
The error screen displayed a “Recovery” message, indicating that Windows had not loaded correctly. Soon after, the internet was flooded with memes and jokes about blue screens, targeting Microsoft or CrowdStrike, For example, someone posted a picture to show the refrigerator’s BSOD prevents it from being opened (in fact, it is due to the replacement of the Android system’s image, and a malfunction of the refrigerator’s display screen does not affect the refrigerator door from being opened).
CrowdStrike, the company responsible for this incident, is a popular American cybersecurity service provider founded in 2011 and went public in 2019. Their core product, Falcon, is an Endpoint Detection and Response (EDR) solution that runs on user endpoints, particularly Windows systems, to detect security threats and provide active defense capabilities.
On July 19, 2024, at 4:09 UTC, CrowdStrike released a sensor content update for the Falcon product on Windows systems. This update is a part of the Falcon platform’s protection mechanism, affected Windows driver files or files that could influence the Windows kernel. Although the error was corrected by 5:27 UTC, many Windows users were unable to update the fixed file due to the blue screen failure.
The widespread adoption of the Falcon product meant that even though it only affected Windows users with the product installed, or more specifically, users whose systems were online and powered on between 4:09 UTCand 5:27 UTC on July 19, the impact was still enormous. The issue affected Windows versions 7.11 and above, impacting approximately 8.5 million Windows devices, less than 1% of the global Windows market.
Various sectors were affected, including airports, hospitals, transportation, media, hotels, and restaurants. This led to airport shutdowns (American Airlines canceled 3,400 flights that day), with many passengers unable to depart on time and some even resorting to handwritten tickets.
CrowdStrike’s stock price also took a hit, dropping 11% on the day of the incident and has since plummeted 30%, from $343 per share to $264 at the time of writing (July 23).
CrowdStrike’s Response
Following the incident, CrowdStrike released a temporary recovery solution through channels like Reddit, as affected users couldn’t update the correct configuration file due to the blue screen error.
The solution is:
- Restarting the system in safe mode or recovery mode
- Opening the C:\Windows\System32\drivers\CrowdStrike directory
- Finding and deleting the “C-00000291*.sys” file
- Restarting the system
However, this method was ineffective for users who had enabled Windows BitLocker, as it prevents entering safe mode or recovery mode.
That evening, CrowdStrike CEO George Kurtz issued a statement apologizing to customers and users, clarifying that Mac and Linux systems were unaffected, and that the incident was caused by a content file flaw, not a cyberattack. He also warned that someone might try to exploit this incident for malicious purposes.
Indeed, there were attempts to exploit the situation, including phishing emails disguised as CrowdStrike support, phone scams impersonating CrowdStrike employees, and fake security personnel offering false remediation advice or selling malicious programs claiming to solve the issue.
Microsoft’s Response
Although Microsoft didn’t directly cause the incident, some technical experts questioned why Windows allows third-party products to use kernel-level programs or files. They suggested adopting an approach similar to Mac, which doesn’t allow third-party developers to obtain kernel-level access.
In January this year, CrowdStrike CEO George Kurtz had raised similar concerns to Microsoft, pointing out that Windows system issues could put customers and the US government at risk.
Microsoft deflected the system risk issue to the EU, with a spokesperson stating that due to earlier complaints and an understanding reached with the EU, Microsoft has agreed to provide system-level access to third-party companies or manufacturers since 2009 to facilitate security software companies’ use of the Windows operating system.
Reactions from Others
Kaspersky, a well-established Russian security company, posted a cheeky tweet on July 19:
“You wouldn’t see this with any of our products.”
This post was met with indignation, with people pointing out Kaspersky’s own blue screen error notifications in 2020 and 2023 caused by their products.
A person named Vincent Flibustier posted images and videos claiming to be a CrowdStrike employee fired on his first day for releasing a file update. He said the company told him not to push to production on Friday, but he believed he released the update on Thursday, making his dismissal unfair.
In reality, this person is known for fake news research, and the image he posted showed clear signs of manipulation.
Root Cause Analysis
The updated configuration file, also known as a “channel file,” is part of the Falcon sensor’s behavior protection mechanism. These files are updated multiple times daily to update the Tactics, Techniques, and Procedures (TTPs) discovered by CrowdStrike. This frequent update mechanism led to the rapid and widespread impact on Windows systems.
The channel file is located in Windows systems at:
C:\Windows\System32\drivers\CrowdStrike\
The incident involved a file named “C-”, with each Falcon channel file having a unique numerical identifier. In this case, the affected channel file number was 291, so its filename started with “C-00000291-” and had a “.sys” extension (CrowdStrike stated this file is not a kernel file).
The update to configuration file 291 was intended to update Falcon’s detection and response rules for controlling Named Pipes execution in Windows systems. Named Pipes are a mechanism for inter-process communication in Windows, allowing data exchange between different processes. Since some Command and Control (C2) malware uses Named Pipes to establish covert communication channels, EDR products need to monitor Named Pipe security, such as blocking the creation of malicious Named Pipes.
While someone speculated that the 291 file was full of empty bytes, CrowdStrike denied this in their official statement. In fact, the updated file contained a logical error that led to incorrect memory allocation. The validation logic was also flawed, causing the driver to continue executing. The improper memory allocation ultimately resulted in a PAGE_FAULT_IN_NONPAGED_AREA error, preventing the system from booting normally.
When creating a Named Pipe, the system allocates a memory buffer. If the allocated memory address is unreasonable or points to a non-existent address, and the system continues to read from these non-existent or non-paged memory areas, it triggers a PAGE_FAULT_IN_NONPAGED_AREA blue screen error. Non-paged areas, which are not paged to the physical hard drive and always use physical memory, are used to run core system programs. Since the system kernel has the highest system privileges, kernel errors can have severe consequences. The blue screen error is actually a self-protection mechanism of the Windows system to prevent further damage to kernel programs and data.
From the image above, we can see that the updated file attempted to read from memory address 0x9C, which is in the area used to store the interrupt vector table (0x0000–0x03FF). Attempting to read from address 0x9C would inevitably cause a system exception.
In C/C++ development, great caution is required when using memory addresses, especially when using pointers. Pointer validation is crucial, for example:
Evidently, CrowdStrike’s developers may have made an error in pointer validation, and this faulty file was not detected before being released to production environments.
Lessons Learned
From a software development process perspective, this problem could be solved in three stages:
- Code Writing Stage
Techniques such as Test-Driven Development (TDD), pair programming, and peer review can help identify such flaw.
However, TDD requires writing numerous test cases, increasing overall code volume. Pair programming may seem wasteful to many companies, and peer reviews can become burdensome due to non-standard code submissions. Consequently, many companies in China neglect or disregard quality at the code writing stage.
2. Software Testing Stage
Someone believe that memory faults like this are difficult to detect in automated unit and integration testing, requiring manual testing by highly skilled developers.
However, given the nature of the failure, any automated integration test environment using the same Windows system as the production environment would immediately detect the BSOD. It indicates that CrowdStrike had gaps in their testing process or didn’t conduct detailed testing for these routinely updated files.
Similar issues due to insufficient testing have occurred before, such as Cloudflare outage on July 2 2019, a ReDoS vulnerability in their WAF rules caused a 30-minute outage of their global CDN network and WAF products.
3. Change Release Stage
For security product updates affecting users globally, a canary release or gradual rollout approach should be adopted. This involves releasing the update to a small group first, monitoring its performance, then gradually expanding the release while continuously monitoring, before finally updating all users.
CrowdStrike may have made the same mistake as Cloudflare in their 2019 incident by not implementing a gradual release strategy. While delaying security updates could potentially expose customers to attacks and subsequent losses, an erroneous release can cause even greater disruption and impact for all of a company’s customers.
In China, few companies and individuals use CrowdStrike products, so domestic users and businesses were largely unaffected by this incident. This is thanks to years of promotion and application of China’s own independent security vendors and product ecosystems that are more suited to national conditions. However, this doesn’t mean that having an independent product market and ecosystem can prevent similar incidents from occurring.
The root cause of this incident lies in CrowdStrike’s software quality management and implementation issues. The three aspects mentioned above may not be fully implemented in many software companies in China either. If a similar problem were to occur with a widely used software company in China, the impact could cover the entire country (for example, on September 1, 2015, an upgrade to Alibaba Cloud’s Aegis caused normal files on customer ECS instances to be quarantined).
Conclusion
It’s crucial to learn lessons and gain experience from various incidents, starting from research and development management, quality, and security.
We must make software development safer and security development more reassuring, step by step.