Navigating the Largest IT Outage in History: Lessons from the CrowdStrike Incident
Posted 26 Jul at 8:19 pm in Business Continuity, Business, Security
Last week, CrowdStrike, a leading cybersecurity firm, triggered the largest IT outage in history. The cause was not a hacker group, but rather an erroneous update pushed out.
It’s important to recognize the extent of this disruption, affecting 8.5 million PCs across various sectors and organization sizes. Small businesses, large multinational corporations, government agencies, hospitals, and critical infrastructure all encountered the infamous blue screen of death.
The massive impact stems from Windows being the world’s most widely used operating system and CrowdStrike being one of the most popular endpoint security tools. Therefore, a faulty update from CrowdStrike can – and did -have widespread consequences.
Even those who were not CrowdStrike customers felt the impact, with flights delayed and canceled, gas stations and grocery stores unable to complete transactions, and critical services like police and fire dispatch delayed.
While we are still learning details about the events, there are a few key takeaways from this global outage.
No Single Solution is Bulletproof:
Even top-tier cybersecurity firms like CrowdStrike, employing some of the brightest minds in the industry, can face unexpected challenges. Endpoint Detection Response (EDR) tools like CrowdStrike’s Falcon are critical in cyber defense for its immediate response to suspend services if malicious activity is detected. However, the outage highlights that their deployment, or any software solution for that matter, is not without risk. Any vendor that claims to be immune to such disruptions is simply not being truthful.
Incident Response Plans Must Include All Potential Risks:
The CrowdStrike event demonstrates that major outages and disruptions can occur from various sources, not just cyberattacks. Business leaders need to ensure their incident response plans encompass all possible risks, including system failures, human errors, and think out of the box to uncover any other unforeseen vulnerabilities.
Every Incident response plan should have a business continuity strategy. How will your organization conduct business as usual if such an event occurs? And if you can’t conduct business as usual, how do you minimally get the most critical components of your business up and running temporarily?
Regularly revisiting and testing these plans is essential to respond swiftly and effectively, and for minimizing damage and downtime. Training employees to recognize and react to different types of incidents is also crucial in maintaining operational resilience.
Despite the existence of automated mechanisms that could roll back bad updates, many companies sidestep such resources because they can require a lot of space. Monitoring tools should be updated or invested in more, potentially bringing AI-based tools into the mix, to test for an detect such issues.
Having redundant systems so that you can switch to a backup plan very fast is key.
Thank Your IT Guys – They Work Tirelessly for You
The reality is that IT often goes unnoticed until there’s a problem. IT professionals work tirelessly behind the scenes, often without recognition or thanks, to ensure systems run smoothly and updates such as this are conducted in a timely manner.
Yes, there should be better policies around protecting businesses from an event such as this, but with more testing, more procedures to follow, things become less nimble, which is the ongoing balancing act for IT – one that is hard to juggle.
IT is a demanding job that requires empathy and respect for those on the front lines who are constantly troubleshooting and resolving issues. Take a moment to thank your IT team—they put in countless hours to keep everything running seamlessly.
No Comments