What does the CrowdStrike outage teach us about operational resilience?

The historic IT outage that affected 8.5 million Microsoft Windows devices is a cautionary tale about the need for stringent operational resilience testing and planning, especially as regulatory expectations continue to rise.

08 November 2024 7 mins read
By Jennie Clarke
Written by humans

Written by a human

In brief:

  • On 19 July, 2024, the CrowdStrike IT outage affected a historic 8.5 million Microsoft Windows devices across critical infrastructure like airports, healthcare services, and banks
  • The FCA has recently published a release outlining lessons learned from the outage and how firms fared in terms of operational resilience preparedness
  • Regulators are increasing their focus on operational resilience requirements – and firms need to make sure they have the right solutions and partners in place to meet them

The ‘blue screen of death’ might be used as a humorous cliché by tech users and IT professionals, but there was nothing amusing about the stark reality of the ‘BSoD’ some 8.5 million Microsoft Windows users were faced with on 19 July, 2024 as a result of the historic CrowdStrike systems outage.

The impact of what was intended as a “routine software update” was felt by individuals and firms across a huge range of sectors, including aviation, media, healthcare, finance, and other critical industries. The complete loss of service caused huge disruption, with users and customers unable to access systems or communicate for many hours.

The Financial Conduct Authority (FCA) has released its “lessons for operational resilience” that firms can take on board as a result of the outage. With multiple regulators setting out clear expectations and legislation for firms to meet on operational resilience, understanding the requirements – and how technology can help meet them – is a business imperative.

Learning the hard way

On 31 October, 2024, the FCA released its “insights, observations, and key lessons from how firms responded to the CrowdStrike outage and their preparedness for future incidents.” The regulator highlighted that the CrowdStrike issue continued a trend, as “between 2022 and 2023, third-party related issues were the leading cause of operational incidents reported.”

The FCA believes that this type of incident is a key example of “the importance of firms continuing to become operationally resilient” in line with regulatory rules, specifically PS21/3: Building operational resilience (more on this below). The regulator encourages “all firms, regardless of how they were affected by the CrowdStrike incident … to improve their ability to respond to and recover from future disruptions.”

 Encouraging firms to invest in operational resilience and follow its rules, the FCA also highlighted where firms had ridden out the worst of the disruption by:

  • Mapping important business services and the resources needed to deliver them in order to effectively prioritize which key services to get back online first
  • Testing scenarios that were “severe but plausible”, including those impacting multiple business-critical services at the same time
  • Clearly outlining and testing crisis communications strategies to quickly and efficiently communicate with customers and stakeholders

Operational resilience – what are the rules and how does it impact you?

Financial organizations are increasingly subject to operational resilience requirements from regulatory bodies across multiple territories. As unpredictable, “severe but plausible” events become increasingly common – from CrowdStrike to COVID – firms are expected to have strategies and technologies in place to ensure business continuation and to satisfy the growing number of regulatory requirements.

EU and U.K.

EU Digital Operational Resilience Act (DORA)

A relatively “new kid on the block”, DORA is a European Union regulation that entered into force in 2023, and applies to all regulated businesses that operate in both the U.K. and EU. The Act includes five pillars that firms need to ensure they meet the requirements of:

  • Operational resilience testing across basic and advanced levels to ensure business operations can withstand disruption
  • ICT risk management, including implementing internal and external controls and governance strategies
  • Third-party risk monitoring and management via key contractual provisions
  • Establishing an incident reporting framework, including factors such as customer impact, downtime duration, economic impact, and geographical spread
  • Information sharing across the industry so firms are notified of potential operational and cyber risks

FCA – PS21/3: Building operational resilience

The final rules around PS21/3: Building operational resilience require that “by no later than March 31, 2025” regulated firms including banks and insurers must have:

  • Performed mapping and testing so that they are able to remain within impact tolerances for each important business service
  • Made the necessary investments to enable them to operate consistently in the event of unexpected disruption
  • Identified any vulnerabilities in their existing operational resilience
  • Included third-party services in scenario testing (or ensure these parties have conducted their own)

U.S.

Office of the Comptroller of the Currency (OCC)

In March, 2024, the Office of the Comptroller of the Currency (OCC) set out a focus on “exploring baseline operational resilience requirements for large banks with critical operations, including third-party service providers.” The five “baseline requirements” that the OCC will focus on include:

  • Establishing clear definitions for identifying critical activities and core business lines
  • Defining tolerances for disruption
  • Requiring testing and validation of resilience capabilities
  • Incorporating third-party risk management expectations
  • Stipulating clear communication expectations among stakeholders and counterparties

Commodity Futures and Trading Commission (CFTC)

The CFTC proposed a rulemaking in December, 2023, requiring that firms establish risk appetites and tolerance limits as part of an operational resilience framework built around three pillars:

  • Information and technology security
  • A third-party relationship program to manage risks presented by mission-critical third-party service providers
  • A business continuity and disaster recovery plan

Financial Industry Regulatory Authority (FINRA)

FINRA has established rules, including Rule 4370, that require firms to have written Business Continuity Plans (BCPs) that must include:

  • Annual reviews of an established plan to ensure effectiveness
  • Automated data backup and recovery that is readily accessible
  • Working to identify mission-critical systems
  • Risk assessments of critical third-party vendors and response plans for disruption to their services

How does Global Relay App ensure operationally resilient business communications?

As the FCA’s observations highlighted, communications are critical to resilience and recovery during severe disruption. They are vital to identifying the scope of an issue, rallying the resources needed to address it, and communicating with teams, stakeholders, and customers worldwide in order to manage responses and reputational impacts.

Global Relay App has been designed to keep your lines of communication open constantly and consistently, delivering uninterrupted access – from wherever you’re working – to business-critical channels including:

  • Emails
  • Calendars
  • Contacts
  • Chats
  • Phone calls
  • SMS/Text messages
  • WhatsApp
  • WeChat

In the face of global outages from providers like Microsoft and Google, Global Relay partners are able to coordinate and continue providing services across all their lines of business communication, backed up by next-gen private cloud, multiple data centers, and always-on support.

Global Relay Archive enables access to 100% of your business communication history to date, across both external and internal channels, including images, attachments, voice notes, and other critical file types.

Global Relay underpins and supports the operational resilience of your organization, working continuously, consistently, and in tandem with your BCPs to guarantee your business as usual communications within your organization and externally, so you can work confidently knowing your business is resilient, and meeting regulatory expectations.

With legacy and public cloud vendors becoming increasingly prone to disruption, financial organizations must ensure they have an operationally resilient solution to keep the lights on when other providers go dark.

 

SUPPORT 24 Hour