What didn’t CrowdStrike Reveal in their Incident Root Cause Analysis on the Outage?

CrowdStrike released Root Cause Analysis on the outage - What didn't CrowdStrike Reveal?

Overview

In the rapidly evolving landscape of cybersecurity, even industry leaders aren’t immune to critical flaws. CrowdStrike, a name synonymous with cutting-edge protection, recently faced a significant challenge when a routine update led to widespread system crashes. But beyond the immediate disruption, this incident reveals deeper, more troubling issues within their security architecture. As we peel back the layers of this event, the question looms: Is your trust in cybersecurity giants like CrowdStrike misplaced? This analysis uncovers the technical missteps, persistent vulnerabilities, and the potentially far-reaching consequences for organizations relying on their solutions. If you think your defenses are impenetrable, think again—this is a wake-up call you can’t afford to ignore. What didn’t CrowdStrike reveal? Let’s review CrowdStrike’s “External Technical Root Cause Analysis — Channel File 291”  and see what it tells us about the incident’s root causes, technical specifics, and broader implications for users and the cybersecurity community.

CrowdStrike - What didn't they Reveal

What Happened in this “291 Incident” Report?

On July 19, 2024, CrowdStrike released a sensor configuration update for Windows systems. This update caused Windows hosts to experience BSOD errors, leading to significant disruptions. Here is quick summary:

  • Background on CrowdStrike Falcon Sensor:

    • The CrowdStrike Falcon sensor uses AI and machine learning models to protect systems by detecting and responding to threats. These models are regularly updated based on threat intelligence from various sources.
  • Introduction of New Detection Capability:

    • In February 2024, a new detection feature was added to the Falcon sensor (version 7.11) to identify threats exploiting Windows interprocess communication (IPC) mechanisms, such as named pipes. This feature used a specific template that required 21 input parameters.
  • The Issue with Parameter Mismatch:

    • The new feature, delivered via Channel File 291, included instances requiring all 21 input parameters. However, the system was only supplying 20 input parameters due to a mismatch. This discrepancy was not caught during multiple testing phases because wildcard matching was used during tests.
  • Deployment of Problematic Content:

    • On July 19, 2024, new instances of the IPC template were deployed, and one of these instances required a non-wildcard match for the 21st parameter, triggering the issue. The sensor’s Content Interpreter was not expecting the 21st parameter and attempted an out-of-bounds memory read, leading to system crashes (BSOD errors).
  • Resulting System Crashes:

    • The error occurred when the sensor tried to process the 21st input parameter, which it was not equipped to handle, leading to system crashes on affected Windows hosts.
  • Root Cause and Summary:

    • The crash was caused by a combination of the input mismatch, an existing out-of-bounds read issue in the Content Interpreter, and the lack of proper testing for the specific scenario. CrowdStrike has since taken steps to prevent such issues from recurring by enhancing testing and validation processes with a new IPC template instance.

Channel File 291 Incident Findings and Mitigations According to the CrowdStrike Report

  • Mismatch in IPC Template Fields:

    • Finding: The IPC Template Type in Channel File 291 expected 21 input fields, but the sensor only supplied 20, leading to a mismatch not caught during testing.
    • Mitigation: Added validation during sensor compile time to ensure the number of input fields matches the new template expectations.
  • Missing Runtime Array Bounds Check on Channel File 291:

    • Finding: The Content Interpreter tried to access a 21st field that didn’t exist, causing out-of-bounds memory access.
    • Mitigation: Implemented bounds checks on the Content Interpreter for
      Rapid Response Content in Channel File 291 to prevent out-of-bounds reads and ensured the input array matches expected inputs.
  • Inadequate Template Type Testing:

    • Finding: Testing did not include cases with non-wildcard matching criteria, which led to missing the out-of-bounds error.
    • Mitigation: Expanded automated tests to include non-wildcard criteria and additional scenarios that better reflect production use.
  • Logic Error in Content Validator:

    • Finding: The Content Validator incorrectly allowed the mismatch between expected and provided inputs.
    • Mitigation: Added new checks in the Content Validator to prevent such mismatches and restrict wildcard usage in problematic fields on Channel 291.
  • Limited Validation of Template Instances:

    • Finding: Stress testing missed the mismatched input issue, which led to system crashes in production.
    • Mitigation: Updated test procedures to ensure each Template Instance is thoroughly tested before deployment.
  • Need for Staged Deployment:

    • Finding: A lack of staged deployment led to the widespread impact of the issue.
    • Mitigation: Implemented staged deployment with additional deployment layers and acceptance checks, giving customers more control over Rapid Response Content updates.

What CrowdStrike Claims They Fixed – Channel File 291

According to the CrowdStrike Incident Root Cause Analysis report, the CrowdStrike said, it has implemented several measures to address the issues identified during the incident and review of the Falcon sensor code. What didn’t CrowdStrike reveal? They claim these fixes have made their systems safer and more reliable:

  1. Enhanced Testing Procedures: CrowdStrike stated, “We have augmented our testing procedures to include additional layers of validation, particularly focusing on compatibility checks with various operating systems”. This aims to prevent similar issues in the future by catching potential conflicts before deployment through these process improvements and mitigation steps.

  2. Update Deployment Process: The report mentions improvements in the deployment process, “We have refined our update deployment process to include staged rollouts, reducing the impact of any unforeseen issues by allowing for smaller, controlled groups to receive updates initially”.

  3. Incident Response Enhancements: CrowdStrike claims, “Our incident response protocols have been updated to ensure faster detection and resolution of any issues, with enhanced monitoring and automated rollback capabilities”.

How They Claim Users Are Safer

CrowdStrike asserts that these changes make their platform more resilient and secure. They state, “With these enhancements, our customers can be assured of improved stability and security in their cybersecurity defenses. The augmented testing and deployment procedures significantly reduce the likelihood of similar incidents” .

Here is the CrowdStrike Narrative of What Technically Happened – Channel File 291 Incident

The root cause was a defect in the content update for the CrowdStrike Falcon® sensor. This update introduced a configuration error that conflicted with Windows operating systems, causing them to crash. The flaw lay in the update’s handling of certain system calls, which were not adequately tested for compatibility with all versions of Windows, aka the “channel file 291” issue.

Comparison with Our Root Cause Analysis of CrowdStrike’s Security Flaws and Future Risks of What Happened

Claim 1: Injection of Unvalidated Code into Kernel

CrowdStrike’s recent incident has highlighted several concerning practices regarding their update process. According to the report, CrowdStrike injects code dynamically into the kernel, bypassing the security controls established by Microsoft’s Kernel Code Signing program. They refer to this code as “content” to sidestep EV Code signing guidelines, thus avoiding the rigorous validation typically required by Microsoft.

  • Evidence: The root cause analysis from CrowdStrike acknowledges that the content update was pushed without adequate validation, leading to system crashes (BSOD) on Windows hosts. This indicates a lapse in adhering to secure coding and update practices, as they bypassed kernel-level security protocols meant to prevent such incidents.

Claim 2: Lack of Validation of Injected Code

The incident report confirms that the “content” injected into the kernel was not properly validated. This oversight allowed for a configuration error to disrupt systems widely.

  • Evidence: The report states that the update process failed to catch a critical configuration error, which should have been detected through rigorous validation checks before deployment. This lack of proper validation is a significant security flaw, as it opens the door for potential exploitation by malicious actors.

Future Risks and Security Implications

The CrowdStrike architecture that allows for bypassing Kernel Code Signing Security controls can potentially lead to Remote Code Execution (RCE) and Local Privilege Escalation (LPE) vulnerabilities. This poses a serious risk to users of CrowdStrike’s products, as it undermines the fundamental security mechanisms put in place by the operating system vendor (Microsoft).

  • RCE and LPE Risks: By injecting unvalidated code into the kernel, CrowdStrike’s approach can be exploited to execute arbitrary code with high-level privileges. This could allow attackers to take control of affected systems, exfiltrate sensitive data, or disrupt critical operations.

SEC Reporting Requirements

Given the potential severity of these vulnerabilities, continued use of CrowdStrike products under these conditions could be considered a material cybersecurity risk. As per SEC rules, such material cybersecurity incidents must be reported, as they could significantly impact the company’s operational integrity and shareholder value.

  • Material Cybersecurity Incidents: The SEC requires companies to disclose significant cybersecurity incidents that could affect their financial performance. The described vulnerabilities in CrowdStrike’s update mechanism certainly qualify as they pose a serious threat to the security of systems using their products.

Criticality Score

On a criticality scale, this incident rates high due to the widespread impact on business operations and the potential security risks associated with system downtime. Currently, estimated with between $2-5B in impact and with class action lawsuits against CrowdStrike lining up, this will be a huge shakeup for the industry.

Company Root Cause Analysis Response

CrowdStrike promptly acknowledged the issue and deployed a fix. They have since enhanced their testing procedures, update deployment processes, and incident response protocols to prevent future occurrences for CrowdStrike customers.

Why it Matters

What didn’t CrowdStrike reveal, well our analysis shows that CrowdStrike’s update process still involves injecting unvalidated code into the kernel. This practice bypasses critical security checks, posing a serious risk of RCE and LPE vulnerabilities. Such vulnerabilities could be exploited by attackers to gain control over affected systems, exfiltrate data, or disrupt operations. The lack of clear commitment to fully validated updates by Microsoft’s processes means these risks persist

Updated Mitigation Strategies if you plan to continue to use CrowdStrike Falcon Sensor

  1. Rigorous Code Validation: Ensure all updates, especially those involving kernel-level modifications, undergo thorough validation according to Microsoft’s Kernel Code Signing program.

  2. Third-Party Audits: Engage independent security firms to audit update processes and validate the security of deployed code.

  3. Enhanced Incident Monitoring: Implement advanced monitoring tools to detect any anomalies in update deployment and rollback swiftly in case of issues.

  4. User Contingency Planning: Users should develop and maintain robust incident response plans, including quick rollback options and backup protocols, to mitigate risks from faulty updates.

Updated Strategic Truths

  1. Strategic Trust: While CrowdStrike released this after action report on their incident, they did not acknowledge nor look at it from a first principles viewpoint. That they have an architectural flaw and they are circumventing the EV Code signing requirements to inject “content” directly into the kernel.
  2. Organizational Leadership (e.g. Board, CEO, CSO or CISO, CIO): Are all at risk for not reporting CrowdStrike’s vulnerabilities and systemic risks linked directly to the Falcon sensor that they are accepting. 
  3. Trust in Automated Systems Is Not Absolute: Even with robust protocols, human oversight and validation remain critical in cybersecurity.
  4. Transparency Is Crucial: Vendors must be transparent about their processes, especially when they involve bypassing standard security controls.
  5. Continuous Improvement Is Essential: Cybersecurity is an evolving field, and vendors must continuously refine their practices to address emerging threats and vulnerabilities.

Conclusion

Despite CrowdStrike’s improvements, significant risks remain due to their practice of injecting unvalidated code into the kernel. This ongoing vulnerability undermines the security of systems using their products, presenting serious risks of RCE and LPE. CrowdStrike must address these flaws and align with best practices to ensure their solutions are secure and reliable. Continued use of their products without these corrections should be reported to the SEC, as it constitutes a material cybersecurity risk. This is what CrowdStrike didn’t reveal.

For more details, you can review the full CrowdStrike Root Cause Analysis here.