Lessons from the Crowdstrike Outage: The Importance of Robust Edge Orchestration

Lessons from the Crowdstrike Outage: The Importance of Robust Edge Orchestration
Younes Khadraoui

Published on July 27, 2024

Lessons from the Crowdstrike Outage: The Importance of Robust Edge Orchestration
Ready for a demo of
Namla
?

Last week, a massive outage affected over 8 million computers worldwide running Crowdstrike. The crash disrupted thousands of businesses and organizations, causing widespread chaos: IT systems went down, flights were canceled, banks halted transactions, factories stopped production, retail shops ceased operations, and companies came to a standstill. This incident shows the challenges companies face with large-scale application deployments and highlights the urgent need for robust edge orchestration to prevent such widespread disruptions and ensure operational continuity.

The issue was traced to a defective update pushed by Crowdstrike, affecting millions of computers globally and causing them to crash with a blue screen of death. Just like Crowdstrike, these software applications are regularly updated to improve performance, accuracy, and user experience, among other reasons. Consequently, they are susceptible to similar problems:

  • Tight integration with the Operating System

    The Crowdstrike tool is tightly integrated within the OS, making each bug or crash induced by Crowdstrike having a direct impact on the entire system. Similarly, many software and AI applications are deployed and run directly on host systems. This setup implies that any faulty application can crash the entire system. Additionally, developers must thoroughly test their software on all the systems where it will be deployed to ensure stability and compatibility.

  • Difficulties to rollback to a previous running version

    Just like Crowdstrike, many software and AI applications lack clear and straightforward rollback processes. When an update causes issues, the absence of a robust rollback mechanism can lead to prolonged downtime and increased operational challenges. This deficiency underscores the need for comprehensive version control and rollback strategies in software deployment. Implementing such strategies would allow organizations to swiftly return to a known good state, minimizing downtime and mitigating the impact on business operations.

  • Poor orchestration & update scheduling

    The Crowdstrike update was pushed to all devices around the world simultaneously. Considering the potential impact of a faulty update on all these devices, a more subtle and customized orchestration is necessary. Deploying updates to only a subset of devices first—known as canary deployment in security jargon—ensures they work correctly before a full rollout. Other types of software can have different but equally significant impacts. Imagine a point-of-sale application crashing in thousands of stores, causing business operations to halt, or an AI video surveillance application securing sensitive sites failing. Just like Crowdstrike, these applications require careful deployment strategies to prevent widespread disruption.

  • Difficultiles to remotely troubleshoot

    Once the issue was identified, Crowdstrike promptly released an update and a workaround to reboot the affected systems. However, the update only impacted non-affected systems, and the workaround required physical access to each device. It took companies days to address all their computers, and many are still struggling with the issue at the time this article is being written. This situation clearly demonstrates the importance of having an efficient and secure tool for remotely troubleshooting devices & applications deployed at the Edge.

Namla Cloud Native Edge Orchestration & Management

Namla adopts a Cloud Native Edge approach by leveraging Kubernetes as its underlying orchestration framework. This enables companies and organizations to benefit from a cloud-native environment where they can deploy container- and VM-based applications, effectively decoupling them from the OS and ensuring complete isolation between applications. This isolation prevents the entire system from being impacted if one application crashes. Additionally, Namla Edge components themselves are container-based agents, allowing them to be deployed on any underlying system without impacting the system’s stability.

ill-undefined

Kubernetes' robust orchestration capabilities enable the scheduling of complex deployment policies, enhancing reliability and scalability. Alongside Namla's GitOps architecture, this approach provides full traceability of updates and configuration changes across the entire infrastructure. This traceability allows for easy identification of faulty updates and facilitates seamless rollbacks to previous stable versions, ensuring minimal downtime and maintaining system integrity.

Namla provides a secure management channel outside of Kubernetes, along with a remote terminal on each Edge device. This setup allows for quick and secure connections to troubleshoot any device system, even if its Kubernetes status is down. Additionally, Namla offers a remote terminal inside each container, enabling deeper troubleshooting of the application itself remotely.

The Crowdstrike event is likely just the tip of the iceberg due to its scale and impact on everyday life. There are undoubtedly hundreds of similar incidents affecting businesses because of the lack of a clear strategy for deploying software at the distributed Edge.

As AI is increasingly deployed at the Edge for critical applications, such as healthcare and public safety, CTOs and Heads of IT departments in large corporations, as well as System Integration and MSP firms, must begin integrating Edge orchestration and management at the infrastructure level and leverage cloud native architectures.

By using Namla, companies benefit from Kubernetes to deploy and orchestrate applications at the Edge while maintaining the ability to control and remotely access their devices and deployed applications for quick troubleshooting. This approach allows them to deploy AI and software at scale, securely push updates, and quickly roll back to previous working versions if issues arise, ensuring minimal disruption and maintaining operational stability.