outage, IT outages,

As enterprises continue to invest in their business-technology transformations, the risks associated with increasingly widespread IT service outages continue to rise. A recent survey conducted by Wakefield Research on behalf of incident response platform provider PagerDuty revealed an alarming expectation among IT and business executives regarding the likelihood of significant service disruptions in the near future.  

According to the survey, 88% of executives believe that an incident as severe as the July global IT outage will occur within the next 12 months, with this sentiment relatively even across the UK (91%), the US (89%) and Australia (88%). 

The CrowdStrike outage of July 19, 2024, is a cold reminder of the fragility of increasingly connected digital systems. An error in what should have been a routine software update to CrowdStrike’s Falcon security platform triggered a catastrophic chain reaction, reportedly causing approximately 8.5 million Windows devices worldwide to crash. This incident is widely considered the largest IT outage in history, as it paralyzed critical services across various business sectors, including health care, finance, transportation and government. 

Hospitals reportedly reverted to paper processes, 911 services faced disruptions, and major airlines like Delta and United canceled thousands of flights, leaving millions of passengers stranded. The financial impact proved substantial, with estimates suggesting that Fortune 500 companies may have lost up to $5.4 billion in revenues and gross profit. While CrowdStrike identified and deployed a fix within 79 minutes, the aftermath lingered for days, in some cases for weeks, as IT administrators worldwide scrambled to restore affected systems manually. The event underscores the vital role of working backup systems, the need for system redundancies and careful software update processes. 

July 2024: A Service Outage Not to be Forgotten 

If the PagerDuty Outage Survey is any indication, it’s not an incident business executives or IT leaders will soon forget. However, Michael Farnum, advisory CISO at technology services firm Trace3, said that in a perfect world, the July 2024 outage would not change how business or IT leaders think about technology service outages. The lesson is something that shouldn’t need teaching. “It’s something of a forgotten lesson from the days when traditional anti-virus updates, and even typical operating system and software patches (which still happen) caused widespread outages,” Farnum said. 

Farnum further explained that when an organization depends on any core structure, such as an operating system, the Domain Name System, or the Border Gateway Protocol, it must plan for such potential outages and ensure its systems remain resilient. “It makes sense to put less emphasis on a particular risk because of past performance, but the lesson here is to not depend on past performance to be an indicator of future results,” Farnum said. 

The survey also uncovered a significant shift in focus between securing from external attacks and securing resiliency: 86% of executives said they now recognize that they have prioritized security at the expense of their organizations’ readiness for service disruptions. And this realization is prompting a reevaluation of strategies across organizations. This is likely due to the July global IT outage, which resulted in substantial consequences for many companies, including lost revenue or inability to process sales transactions (37%), delayed response times to customer or internal requests (39%), and interrupted access to critical business systems or applications (39%). 

“It’s still very important to prioritize the protection of data and systems from external threats, but organizations are now realizing that they can’t neglect the risks associated with downtime or outages caused by service disruptions,” Eric Johnson, CIO at PagerDuty told Techstrong ITSM. “The July global IT outage demonstrated how critical it is for organizations to also ensure resiliency against service outages that can be caused by factors outside of cyberattacks,” Johnson said. 

Mitigating the Impact of Future Outages 

While it’s probabilistically unlikely another historic outage of a level comparable to the CrowdStrike incident will occur within the next year, the incident remains front of mind as investments in digital transformation increase digital dependencies. “The technology footprint of organizations continues to grow increasingly complex as new applications and services are built on top of old ones, resulting in numerous interdependencies across the stack. Most organizations today are tied by digital operations, and what makes things different from years before is that the blast radius is no longer localized,” Johnson said of the potential impact of service disruptions. 

In response to these challenges, most (55%) of executives have observed a change towards continually evaluating and improving preparedness rather than relying on one-time investments in new systems or protocols. There is also a growing emphasis on collaboration, with 60% of executives wanting to prioritize more organizational partnerships with IT teams. Opinions are divided on the best approach to prevent and mitigate service disruptions, with executives in Australia (58%), Japan (57%) and the UK (52%) favoring the use of AI tools for proactive prevention. In contrast, US executives are evenly split between AI tools and collaboration with incident management experts. 

Yet, many experts, including Scott Crawford, information security research head at S&P Global Market Intelligence, say that the July 2024 incident highlighted the threats rising to technology services due to an increased concentration of technical risk. “Too much dependence on too few vendors shifts the focus to how those vendors will mitigate that risk,” Crawford said. “In that case, CrowdStrike was pushing updates to all targets with a flaw in the process. So, address the process and give users latitude over which targets, as they have done,” he said. 

“Heading off future such events would require greater visibility into those issues. Private sector vendors aren’t likely to disclose what could be sensitive intellectual property, so a lot of those exposures may remain unknown,” he added. Ultimately, Crawford advises organizations to consider their availability of “threat models” and whether or not they can satisfactorily address the risks they identify.  

Looking ahead, the survey highlights a universal acknowledgment of the potential magnitude of IT service disruptions, with 100% of executives reporting an increased focus on preparing for future incidents. The key actions they intend to take include increasing budgets for technology solutions (41% overall, with Japan leading at 49%) and improving communication about preparedness protocols, particularly in Australia, the UK and the US.  

As organizations navigate an increasingly interdependent digital environment, these findings underscore the critical need for robust incident management strategies and cross-departmental collaboration to mitigate the impact of inevitable service disruptions. 

PagerDuty’s Johnson said organizations should implement new processes that drive communication, preparation and a culture of partnership between IT and the broader organization. “Improving the collaboration between IT and the different organizational teams can help ensure alignment between business objectives and tech initiatives. It’s equally as important to define clear objectives and metrics with IT as it is for executives to invest in their understanding of IT,” he said.  

Additionally, to improve service availability, Johnson advises organizations to implement robust monitoring and alerting systems as a first step, followed by enhancing clear communication plans and cross-department coordination. “Next, companies should conduct regular stress testing and drills, including ‘failure’ drills that test the response to system outages. This helps ensure readiness and refine recovery processes. Leveraging AI and automation is also important,” Johnson said. 

Ultimately, it’s a matter of “when” not “if” another significant disruption occurs, warned Johnson. “Whether it’s tomorrow or one year from now … another major incident will occur. As the July global IT outage showed us, if organizations start planning for such incidents tomorrow, it’s too late. They need to start today to automate their operations, streamline processes and reinforce their digital infrastructure to help ensure resiliency,” Johnson concluded. 

Techstrong TV

Click full-screen to enable volume control
Watch latest episodes and shows

Qlik Tech Field Day Showcase

SHARE THIS STORY

RELATED STORIES